INVARIANT REPRESENTATION LEARNING VIA FUNCTIONS IN
        REPRODUCING KERNEL HILBERT SPACES
                               By
                         Bashir Sadeghi
                      A DISSERTATION
                           Submitted to
                   Michigan State University
           in partial fulfillment of the requirements
                        for the degree of
          Computer Science – Doctor of Philosophy
                              2023


                                            ABSTRACT
    Many applications of representation learning, such as privacy preservation and algorithmic
fairness, desire explicit control over some unwanted information being discarded. This goal is
formulated as satisfying two objectives: maximizing utility for predicting a target attribute while
simultaneously being invariant (independent) to a known sensitive attribute (like gender or race).
Solutions to invariant representation learning (IRepL) problems lead to a trade-off between utility
and invariance when they are competing. Most existing works are empirical and implicitly look
for single or multiple points on the utility-invariance trade-off. They do not explicitly seek to
characterize the entire trade-off front optimally and do not provide invariance and convergence
guarantees.
    In this thesis, we address the shortcoming mentioned above by considering simple linear mod-
eling and building upon them. As a first step, we derive a closed-form solution for the global optima
of the underlying linear IRepL optimization problem. In further development, we consider neural
network-based encoders, where we model the utility of the target task and the invariance to the
sensitive attribute via kernelized ridge regressors. This setting leads to a stable iterative optimiza-
tion scheme toward global/local optima(s). However, such a setting cannot guarantee universal
invariance. This drawback motivated us to further study the case where the invariance measure is
modeled universally via functions in some reproducing kernel Hilbert spaces (RKHS)s. By model-
ing the encoder and target networks via functions in some RKHS, too, we derive a closed formula
for a near-optimal trade-off, corresponding optimal representation dimensionality, and the associ-
ated encoder(s). Our findings have an immediate application to fairness in terms of demographic
parity.


               To my wife: Katharina
      To my parents: Farajollah and Farrokh
To my sister’s family: Atekeh, Karen, and Hanzaleh
   To my brother’s family: Farhad and Fatemeh
                         iii


                                      ACKNOWLEDGMENTS
         I am thankful to have Dr. Vishnu Naresh Boddeti as my PhD advisor. His attention, expec-
tations, and encouragement helped me learn more than I could have imagined. With his meticulous
attention to detail and critical assessment, and setting himself as an example, Dr. Bodetti made me
decide what kind of researcher I want to be: A solid researcher who formulates scientific problems
principally, who observes carefully, who evaluates/designs experiments efficiently and critically,
and finally, presents the findings understandably and intuitively.
     I am honored to have Dr. Arun Ross, Dr. Sijia Liu, and Dr. Hamidreza Modares on my thesis
committee. I am grateful that I could attend Dr. Sijia Liu’s ’Adversarial Machine Learning’ course.
His deep insight into the topic made this course enjoyable to me. I am also very thankful to Dr.
Mathew Hirn for his interesting ’Mathematics of Deep Learning’ course. This course contributed
to filling some gaps between the application and theory of deep learning for me.
     I want to thank Gautam Sreekumar for his generous support and encouragement during the
challenging times of my PhD. Likewise, I am also grateful for the encouragement I received from
my other labmates, Hamed Bolandi, Lan Wang, Xiaoxue Wang, Shihua Huang, Rahul Dey, Sepehr
Dehdashtian, and Wei Ao. Their valuable comments during group meetings and research discus-
sions helped me to improve on what I could do alone.
     Last but not least, I am very fortunate and grateful that Dr. Runyi Yu was my MSc advisor. He
set himself as an example of a principal scientist who approaches research problems fundamentally.
Without any exaggeration, I could not see myself pursuing a PhD without his encouragement and
support during my MSc study.
                                                   iv


                                  TABLE OF CONTENTS
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 1     Introduction To Invariant Representation Learning         . . . . . . . . . . . . 1
   1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
   1.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
        1.2.1 Adversarial Representation Learning . . . . . . . . .     . . . . . . . . . . . . 3
        1.2.2 Invariant Representation Learning . . . . . . . . . .     . . . . . . . . . . . . 4
        1.2.3 Trade-Offs in Invariant Representation Learning . . .     . . . . . . . . . . . . 5
        1.2.4 Optimization Theory for Adversarial Learning . . . .      . . . . . . . . . . . . 6
        1.2.5 Dimensionality Reduction . . . . . . . . . . . . . .      . . . . . . . . . . . . 7
        1.2.6 Dependence Measure . . . . . . . . . . . . . . . . .      . . . . . . . . . . . . 8
   1.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . 8
   1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . 10
Chapter 2     Mathematical Background and Preliminaries           . . . . . . . . . . . . . . . 11
   2.1 Notations and Definitions . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . 11
   2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . 14
        2.2.1 Kernelization . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . 14
        2.2.2 Kernelized Dependence Measures . . . . . . .        . . . . . . . . . . . . . . . 14
Chapter 3     Adversarial Representation Learning Under Linear Invariance             . . . . . 17
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
   3.2 Adversarial Representation Learning . . . . . . . . . . . . . . . . . . . .    . . . . . 20
        3.2.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . 20
        3.2.2 The Linear Case . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . 21
   3.3 Empirical Solution for Linear Encoder . . . . . . . . . . . . . . . . . . .    . . . . . 29
        3.3.1 Non-Linear Extension Through Kernelization . . . . . . . . . . .        . . . . . 31
   3.4 Analytical Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . 33
   3.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . 36
   3.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . 36
        3.6.1 Mixture of Four Gaussians . . . . . . . . . . . . . . . . . . . . .     . . . . . 37
        3.6.2 Fair Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
        3.6.3 Illumination Invariant Face Classification . . . . . . . . . . . . .    . . . . . 40
        3.6.4 CIFAR-100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . 42
   3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . 44
Chapter 4     Adversarial Representation Learning With Closed-Form Solvers              . . . . 46
   4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
   4.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . 48
        4.2.1 Motivating Exact Solvers . . . . . . . . . . . . . . . . . . . . . . .    . . . . 49
   4.3 Exact Adversary and Target Predictor Solvers . . . . . . . . . . . . . . . .     . . . . 51
                                               v


        4.3.1 Closed-Form Adversary and Target Predictor        . . . . . . . . . . . . . . . .  51
        4.3.2 Optimal Embedding Dimensionality . . . . .        . . . . . . . . . . . . . . . .  54
        4.3.3 Gradient of Closed-Form Solution . . . . . .      . . . . . . . . . . . . . . . .  58
   4.4  Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  59
        4.4.1 Fair Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  60
        4.4.2 Mitigating Sensitive Information Leakage . .      . . . . . . . . . . . . . . . .  62
        4.4.3 Ablation Study on Mixture of Four Gaussians       . . . . . . . . . . . . . . . .  64
   4.5  Summary . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . .  66
Chapter 5     Universal Invariant Representation Learning .         . . . . . . . . . . . . . .  68
   5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  68
        5.1.1 Adversarial Representation Learning . . . . . . .     . . . . . . . . . . . . . .  71
        5.1.2 Trade-Offs in Invariant Representation Learning:      . . . . . . . . . . . . . .  71
   5.2 Deficiency of Mean-Squared Error as
        A Measure of Dependence . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . .  73
   5.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  75
        5.3.1 Problem Setup . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . .  75
   5.4 Choice of Dependence Measure . . . . . . . . . . . . .       . . . . . . . . . . . . . .  76
   5.5 Exact Kernelized Trade-Off . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . .  78
        5.5.1 Numerical Complexity . . . . . . . . . . . . . .      . . . . . . . . . . . . . .  86
        5.5.2 Target Task Performance in K−TOpt . . . . . . .       . . . . . . . . . . . . . .  87
   5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  90
        5.6.1 Baselines . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . .  90
        5.6.2 Datasets . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . .  90
        5.6.3 Evaluation Metrics . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . .  92
        5.6.4 Choice of (Y, S) Pair . . . . . . . . . . . . . . .   . . . . . . . . . . . . . .  93
        5.6.5 Implementation Details . . . . . . . . . . . . . .    . . . . . . . . . . . . . .  93
        5.6.6 Results . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . .  95
        5.6.7 Ablation Study . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . .  98
   5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . 100
Chapter 6     Conclusion and Future Work        . . . . . . . . . . . . . . . . . . . . . . . . 101
   6.1 Limitations . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . 101
   6.2 Future Work . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . 102
   6.3 Broader Impact . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . . . 102
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
                                               vi


                         LIST OF ABBREVIATIONS
Acronyms / Abbreviations
ARL    Adversarial Representation Learning
COCO   Constrained Covariance
DNN    Deep Neural Network
GAN    Generative Adversarial Network
HSIC   Hilbert-Schmidt Independence Criterion
IRepL  Invariant Representation Learning
iff    If and Only If
KCC    Kernelized Canonical Correlation
MLP    Multi-Layer Perceptrons
MSE    Mean Square Error
NN     Neural Network
PCA    Principal Component Analysis
RFF    Random Fourier Features
RKHS   Reproducing Kernel Hilbert Space
RV     Random Variable
SGD    Stochastic Gradient Descent
SGDA   Stochastic Gradient Descent Ascent
w.r.t. With Respect To
                                       vii


Chapter 1
Introduction To Invariant Representation
Learning
1.1      Introduction
Real-world applications of representation learning often have to contend with objectives beyond
predictive performance. These include cost functions pertaining to, invariance (e.g., to photo-
metric or geometric variations), semantic independence (e.g., to age or race for face recognition
systems), privacy (e.g., mitigating leakage of sensitive information [1]), algorithmic fairness (e.g.,
demographic parity [2]), and generalization across multiple domains [3], to name a few.
    At its core, the goal of the aforementioned formulations of representation learning is to sat-
isfy two competing objectives: Extracting as much information necessary to predict a target label
Y (e.g., face identity), while intentionally and permanently suppressing information about a given
sensitive attribute S (e.g., age or gender). See Figure 1.1 for an illustration. An encoder f produces
a representation Z = f (X) from the input data X. A target predictor gY operates on the repre-
sentation Z to predict the target attribute Y . A parametric or non-parametric dependence measure
Dep(Z, S) measures the statistical dependency of the representation Z on the sensitive attribute S.
For example, Dep(Z, S) can be measured by a hypothetical adversary loss that aims to predict the
sensitive attribute S. Even though randomized encoder and target predictor can also be consid-
                                                    1


                  X
                                      f (X)
                                                          Z          gY (Z)            Yb
                                                                    Dep(Z, S)           S
Figure 1.1: An encoder f in the form of a Borel function produces a representation Z = f (X)
from the input data X. A target predictor g, in the form of a Borel function, operates on the
representation Z to predict the target attribute Y . A parametric or non-parametric dependence
measure Dep(Z, S) quantifies the statistical dependency of the representation Z on the sensitive
attribute S. Invariant representation learning seeks a representation Z = f (X) that contains as
much information necessary for the downstream target predictor gY while being independent of
the sensitive attribute S.
ered, however in this dissertation, we assume that both f and gY are deterministic Borel functions.
When the statistical dependency between Y and S is not negligible, learning a representation Z
that is invariant to the sensitive attribute S (i.e., Z ⊥⊥ S) will necessarily degrade the performance
of the target prediction, i.e., there exists a trade-off between utility and invariance. The primary
application of invariant representation learning (IRepL) is invariant prediction. This is because if
Z is independent of S, then the target prediction Yb = gY (Z) is independent of S, regardless of the
target predictor gY . As a result, to be robust to the choice of target predictor [4], it is preferred to
deploy IRepL for invariant prediction rather than enforcing invariance on Yb directly.
     The existence of a trade-off between utility and invariance has been well established, both
theoretically and empirically, under various contexts of representation learning such as fairness [5,
6, 7, 8], invariance [9], and domain adaptation [10]. However, the central aspect of IRepL is
still challenging: A learning algorithm that achieves any point on the utility-invariance trade-off,
optimally or via local optima(s), and how can we estimate them from training data. A vast majority
of existing works are empirical in nature. They implicitly look for single or multiple points on the
trade-off between utility and invariance to the sensitive information and do not explicitly seek to
                                                      2


                  X
                                    f (X)
                                                      Z           gY (Z)             Yb
                                                                  gS (Z)              Sb
Figure 1.2: Adversarial Representation Learning consists of three entities, an encoder f that
obtains a compact representation Z of the input data X, a predictor gY that predicts a desired
target attribute Y and an adversary gS that seeks to extract a sensitive attribute S, both from the
embedding Z.
characterize the entire trade-off front optimally. This dissertation aims to address the mentioned
shortcoming of existing IRepL approaches by employing functions in some reproducing kernel
Hilbert spaces (RKHS)s to model target predictor gY and the dependence measure Dep(Z, S).
Under the case where the encoder f is also modeled via functions in some RKHSs, we are able
to find a closed-form solution for the optimal encoder. For encoders modeled by neural networks
(NN)s, we are able to provide some stability for the underlying iterative optimization problem.
1.2      Prior Work
1.2.1     Adversarial Representation Learning
Most practical approaches for learning fair, invariant, domain adaptive, or privacy-preserving rep-
resentations discussed above are based on adversarial representation learning (ARL). At the core
of ARL is the idea of modeling Dep(Z, S) via a proxy adversary that seeks to extract the sensitive
attribute S. See Figure 1.2 for an illustration. In the context of image classification, adversarial
learning has been utilized to obtain representations that are invariant across domains [3, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20]. Such representations allow classifiers that are trained on a source
                                                  3


domain to generalize to a different target domain. In the context of learning fair and unbiased rep-
resentations, a number of approaches [1, 2, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]
have used and argued for explicit adversarial networks, to extract sensitive attributes from the en-
coded data. All these methods are usually set up as a minimax game between the encoder, a target
task, and a proxy adversary. The encoder is set up to achieve invariance by maximizing the loss
of the proxy adversary, i.e., minimizing the negative log-likelihood or mean square error (MSE) of
sensitive variables as measured by the proxy adversary. Roy et al. [1] identify and address the in-
stability of the optimization in the zero-sum minimax formulation of ARL and propose an alternate
non-zero-sum solution, demonstrating improved empirical performance. All the above approaches
use deep neural networks (DNN)s to represent the ARL entities, optimize their parameters through
stochastic gradient descent ascent (SGDA), and rely on empirical validation. However, none of
them seek to study the nature of the ARL formulation itself, i.e., in terms of decoupling the role of
the expressiveness of the models and convergence/stability properties of the optimization tools for
learning the parameters of the corresponding models. This shortcoming motivates us to take some
steps towards filling this gap by studying simpler forms of ARL from an optimal optimization
perspective in Chapter 3 and build upon it in Chapters 4, 5.
1.2.2     Invariant Representation Learning
The basic idea of representation learning that discards unwanted semantic information has been
explored under many contexts like invariant, fair, or privacy-preserving learning. The concept of
learning fair representations was first introduced by Zemel et al. [36]. The goal was to learn a rep-
resentation of data by “fair clustering” while maintaining the discriminative features of the predic-
tion task. Building upon this work, many techniques have been proposed to learn an unbiased rep-
resentation of data while retaining its effectiveness for a prediction task. These include the Varia-
                                                   4


tional Fair Autoencoder [37] and the more recent information bottleneck-based objective by Moyer
et al. [38]. In domain adaptation [11, 12, 39], the goal is to learn features that are independent of
the data domain. In fair learning [2, 21, 40, 36, 41, 42, 43, 23, 24, 22, 44, 26, 45, 46, 47, 48, 49, 50],
the goal is to discard the demographic information that leads to unfair outcomes. Similarly,
there is growing interest in mitigating unintended leakage of private information from representa-
tions [51, 52, 53, 54, 55, 45, 56, 57, 58, 59].
    A vast majority of this body of work is empirical in nature. They implicitly look for single
or multiple points on the trade-off between utility and sensitive information and do not explicitly
seek to characterize the entire trade-off front. Overall, these approaches are not concerned with or
aware of the inherent utility-invariance trade-off. In contrast, using functions in some RKHSs, we
near-optimally characterize the trade-off and propose a practical learning algorithm that achieves
this trade-off in Chapter 5.
1.2.3     Trade-Offs in Invariant Representation Learning
Prior work has established the existence of trade-offs in IRepL, both empirically and theoretically.
In the following, we categorize them based on properties of interest.
Restricted Class of Attributes: A majority of existing work considers IRepL trade-offs under
restricted settings, i.e., binary and/or categorical attributes Y and S. For instance, [60] uses
information-theoretic tools and characterizes the utility-fairness trade-off in terms of lower bounds
when both Y and S are binary labels. Later [55] provided both upper and lower bounds for binary
labels. By leveraging Chernoff bound, [61] proposed a construction method to generate an ideal
representation beyond the input data to achieve perfect fairness while maintaining the best per-
formance on the target task. In the case of categorical features, a lower bound on utility-fairness
trade-off has been provided by [6] for the total invariance scenario (i.e., Z ⊥⊥ S). In contrast to this
                                                   5


body of work, our trade-off analysis applies to multidimensional continuous/discrete attributes. To
the best of our knowledge, the only prior work with a general setting is [9]. However, in [9], both
S and Y are restricted to be continuous/discrete or binary at the same time (e.g., it is not possible
to have Y binary while S is continuous).
Characterizing Exact versus Bounds on Trade-Off: To the best of our knowledge, all existing
approaches characterize the trade-off in terms of upper and/or lower bounds. In contrast, we exactly
characterize a near-optimal trade-off with closed-form expressions for encoders belonging to some
RKHSs.
Optimal Encoder and Representation: Another property of practical interest is the optimal en-
coder that achieves the desired point on the utility-invariance trade-off and the corresponding rep-
resentation(s). Existing works which only study bounds on the trade-off do not obtain the encoder
that achieves those bounds. Hilbert-Schmidt independent criterion (HSIC), a universal measure
of dependence, has been adopted by prior work (e.g., [62]) to quantify all types of dependencies
between Z and S. However, these methods adopt stochastic gradient descent (SGD) for optimizing
the underlying non-convex optimization problem. As such, they fail to guarantee that the repre-
sentation learning problem converges to a global optima. In contrast, we obtain a closed-form
solution for the optimal encoder and its corresponding representation while detecting all modes of
dependence between Z and S in Chapter 5.
1.2.4     Optimization Theory for Adversarial Learning
A growing class of learning algorithms, including ARL, generative adversarial networks (GAN)s,
etc., involve more than one objective and are trained via games played by cooperating or dueling
NNs. An overview of the challenges presented by such algorithms and a plausible solution in
general n-player games can be found in [63]. In the context of two-player minimax games such
                                                   6


as GANs, several solutions [64, 65, 66, 67, 68, 69, 70, 71] have been proposed to improve the
optimization dynamics, many of them relying on the idea of taking an extrapolation step [72].
The non-convex nature of the ARL formulation poses unique challenges from an optimization
perspective. Practically, the parameters of the models in ARL are optimized through SGD, either
jointly [21, 64] or alternatively [11], with the former being a generalization of gradient descent and
is known as SGDA. While the convergence properties of gradient descent and its variants are well
understood, there is relatively little work on the convergence and stability of SGDA in adversarial
minimax problems. Recently, Mescheder et al. [64] and Nagarajan et al. [65] both leveraged
tools from non-linear systems theory [73] to analyze the convergence properties of SGDA, in
the context of GANs, around a given equilibrium. They show that without the introduction of
additional regularization terms to the objective of the zero-sum game, SGDA does not converge.
However, their analysis is restricted to the two-player GAN setting and is not concerned with its
global/local optima. In contrast, using kernelized ridge regressors for target and proxy adversary
networks, we are able to optimize these networks optimally for any given representation Z = f (X)
that turns the unstable SGDA optimization scheme into the stable SGD scheme.
1.2.5     Dimensionality Reduction
The technical machinery of Chapters 3, 5 in this dissertation is closely related to principal com-
ponent analysis (PCA) [74] and its kernelized version [75] for dimensionality reduction. PCA can
provide a compact disentangled representation (i.e., different elements of the representation vector
are uncorrelated to each other) of the input data, which is efficient for downstream classification,
regression, and clustering tasks. In particular, we deploy supervised PCA in this thesis [76]. Ker-
nel methods have also been previously used for fair dimensionality reduction by [77], where the
Rayleigh quotient is employed to only search for a single point in the utility-invariance trade-off.
                                                    7


In contrast to this work, our approaches in Chapters 3, 5 aims to characterize the entire trade-off
front.
1.2.6      Dependence Measure
In 1959, Rényi introduced dependence measures as a quantifier of the statistical dependence be-
tween two random variables (RV)s Z and S by a non-negative value, where zero indicates that
Z and S are independent and with larger values indicating greater degrees of dependence [78].
A possible such dependence measure can be defined as the maximum Pearson correlation (aka
correlation coefficients) between α(Z) and β(S) over all Borel functions α and β [78]. Such a
measure is not computationally tractable if Z and/or S are continuous [76]. To circumvent this
difficulty, [79] demonstrated that any universal RKHS is sufficient as a search space for α and
β to detect all modes of dependence1 between Z and S. Later, [80] employed the maximum of
covariance as a measure of independence for α and β belonging to a unit-ball in some univer-
sal RKHSs. Further, [81] proposed HSIC, where they demonstrated that considering covariance
for only elements of any basis set in the involved RKHSs is sufficient for a universal dependence
measure.
1.3        Overview of the Thesis
In the second chapter, we introduce the notations, definitions, mathematical background, and ma-
chinery required for this dissertation. The mathematical background of this dissertation includes
linear algebra, probability theory, and functional analysis. In particular, we sometimes deploy
functions in some RKHSs to model the encoder f and/or the target predictor gY . Further, we
   1 By ’all modes of dependence’, all types of linear or non-linear relations in contrast to only linear or monotonic
relations.
                                                          8


sometimes model the dependence measure between the representation Z and the sensitive attribute
S via kernel measures of independence.
     In the third chapter, we study the simplest ARL, where all players 1) encoder f , 2) target
predictor gY , and 3) proxy adversary gS are modeled linearly. Under this scenario, we obtain a
closed-form solution (both empirically and in population) for the optimal encoder in terms of the
eigenvectors of the projection of input data into the space that is as close as possible to the target
label space while lying on the least explanatory space for the sensitive attribute. We then generalize
our formulation and closed-form solutions to encoders in RKHSs while target and proxy adversary
networks remain linear. Moreover, we theoretically obtain an optimal embedding dimensionality
(i.e., dim(Z)) as a function of the user-defined trade-off parameter.
     In the fourth chapter, we let the encoder be DNN and aim to circumvent the instability and
lack of convergence guarantees induced by SGDA optimization in ARL by modeling adversary
and target networks by kernelized ridge regressors. This, in turn, yields a closed-form solution for
the optimal adversary and target predictors for any given representation Z = f (X). Therefore, the
SGDA optimization strategy reduces to a simple SGD to learn the encoder parameters. Moreover,
we theoretically obtain an upper bound for the optimal embedding dimensionality.
     In the fifth chapter, motivated by the fact that proxy adversary loss may not account for all
modes of dependence, we model the invariance measure via a near-universal dependence measure
rather than a proxy adversary loss. By modeling the target loss the same, we are able to find a
closed-form solution (both empirically and in population) for encoders in RKHSs. The closed-
form solution also leads to the determination of optimal embedding dimensionality. The utility-
invariance trade-off induced by the optimal encoder can be interpreted as an inherent trade-off
arising from the triplet of the input data X, the target label Y , and the sensitive attribute S.
     We conclude the thesis in the sixth chapter, where we discuss limitations, future work, and the
                                                   9


broader impact of this thesis.
1.4     Contributions of the Thesis
    • We obtain a closed-form solution for the global optima of ARL with linear/kernelized en-
      coder and linear target and proxy adversary networks under MSE loss in Chapter 3. This
      closed-form solution can interestingly be interpreted as a generalization of supervised PCA
      (and its kernelized version) when there is a sensitive attribute to discard. Consequently,
      we are able to obtain an exact optimal embedding dimensionality as a function of the user-
      defined utility-invariance trade-off parameter under the mentioned scenario.
    • We deploy kernelized ridge regressors for modeling proxy adversary and target networks
      while the encoder can be DNN in Chapter 4. In turn, the unstable SGDA optimization
      involved in ARL reduces to SGD, which is a stable optimization scheme. Furthermore, we
      obtain an upper bound for the optimal embedding dimensionality.
    • We introduce a simplified version of HSIC to measure the dependence between the represen-
      tation Z and the sensitive attribute S for encoders in RKHSs in Chapter 5. We demonstrate
      that our proposed dependence measure is near-universal and lends itself to a closed-form so-
      lution for the IRepL problem where the optimal embedding dimensionality can be precisely
      obtained as a function of the trade-off parameter. The introduced utility-invariance trade-off
      can be interpreted as a near-optimal trade-off induced by the triplet (X, Y, S).
                                                 10


Chapter 2
Mathematical Background and
Preliminaries
2.1     Notations and Definitions
Scalars are denoted by regular lowercase letters, e.g., r, λ. Deterministic vectors are denoted by
boldface lowercase letters, e.g., x, s. The L2 -norm of the vector x is denoted by kxk, and the
inner product between the vectors x and s of the same size is denoted by hx, si. We denote n-
tuple vectors of ones and zeros by 1n and 0n , respectively. Finite or infinite sets are denoted by
calligraphy letters, e.g., H, A. The indicator function is denoted by 1B (·), where
                                               
                                               
                                               
                                               1       if m ∈ B
                                    1B (m) =                                                   (2.1)
                                               
                                               
                                               0       if m 6∈ B
We denote deterministic matrices by boldface upper case letters, e.g., M , K. The element at i-th
row and j-th column of any matrix M is denoted by (M )ij or mij ; its transpose is denoted by
M T ; its inverse is denoted by M −1 , and its Moore-Pensore pseudo-inverse is denoted by M † .
Centering, i.e., mean subtraction with respect to (w.r.t.) columns, is denoted by ”˜”, e.g., M̃ ,
                                                  11


which can be obtained as
                                                                        1
                    M̃ = M H      where M ∈ Rm×n         and H := In −    1n 1Tn .            (2.2)
                                                                        n
The subspace spanned by the columns of M is denoted by R(M ) or simply M; the orthogonal
complement of M is denoted by M⊥ , and the null space of M is denoted by N (M ). The
orthogonal projection onto M is denoted by PM and can be obtained as
                                                        †
                                    PM = M M T M M T .                                        (2.3)
We denote an n × n identity matrix by In or simply I. The trace of any square matrix K (i.e., the
sum of diagonal elements) is denoted by Tr [K]. The Frobenius norm of any matrix M is denoted
by kM kF , which is related to the trace as
                                             h       i      h       i
                               kM k2F  = Tr    MMT     = Tr   MT M    .
    We denote both scalar-valued and multidimensional random variables (RV)s by regular upper-
case letters, e.g., X, S. The expectation of the RV X is denoted by E[X]; and its covariance matrix
is denoted by CX , where
                                         h                          i
                               CX := E (X − E[X])(X − E[X])T .
Similarly, denoted by CXY , the cross-covariance between the RVs X and S is defined as
                                           h                        i
                               CXS := E (X − E[X])(S       − E[S])T   .
                                                  12


For a positive definite matrix C (denoted by C  0), its Cholesky factorization results in
                                C  0 ⇒ C = QQT , Q is full rank.                                      (2.4)
If C is a positive semi-definite matrix (denoted by C  0), then its incomplete Cholesky factor-
ization is
                            C  0 ⇒ C = LLT , L is full column-rank.                                   (2.5)
     Consider the probability space (Ω, F, P), where Ω is the sample space, F is a σ−algebra on
Ω, and P is a probability measure on F. We assume that the joint RV, (X, Y, S) containing the
input data X ∈ RdX , the target label Y ∈ RdY , and the sensitive attribute S ∈ RdS , is an RV
on (Ω, F) with joint distribution pX,Y,S . Furthermore, Y and S can also belong to any finite set,
like a categorical set. This setting enables us to work with classification and multidimensional
regression tasks, where the sensitive attribute can be either categorical or multidimensional contin-
uous/discrete RV. We let D := {(x1 , y1 , s1 ), · · · , (xn , yn , sn )} be the training data, containing n
i.i.d. samples from the joint distribution pX,Y,S . We also separately define the input, the label, and
the sensitive data, respectively, as follows.
                                     X := [x1 , · · · , xn ] ∈ RdX ×n
                                     Y := [y1 , · · · , yn ] ∈ RdY ×n
                                      S := [s1 , · · · , sn ] ∈ RdS ×n .
                                                      13


2.2      Preliminaries
2.2.1     Kernelization
Let f ∈ HX , where HX is an RKHS of functions from RdX to R with kernel function kX (·, ·).
Invoking the representer theorem [82], it follows that
                            Xn
                  f (X) =       θi kX (xi , X) = θ [kX (x1 , X), · · · , kX (xn , X)]T ,
                            i=1
where θ ∈ Rn×1 and (θ)i = θi . Moreover, let f (X) := [f1 (X), · · · , fr (X)]T . Similarly, we have
                             f (X) = Θ [kX (x1 , X), · · · , kX (xn , X)]T ,                    (2.6)
where Θ ∈ Rr×n and (Θ)ji = θji .
2.2.2     Kernelized Dependence Measures
Principally, two RVs Z and S are independent (denoted by Z ⊥⊥ S) if and only if (iff) [83]
                      Cov(α(Z), β(S)) := E [α(Z) β(S)] − E [α(Z)] E [β(S)]                      (2.7)
is zero for all Borel functions α : Rr → R and βs : RdS → R belonging to the universal RKHSs
HZ and HS , respectively. Note that universality ensures that RKHSs can approximate any Borel
function with arbitrary precision [84]. In the remainder of this thesis, we consider the following
assumption unless otherwise stated.
Assumption 2.1. We assume that any RKHS H (from Rd to R) is universal and separable and the
                                                  14


corresponding kernel function, k(·, ·) is bounded:
                E [k(U, U )] < ∞ for any square-integrable d-dimensional U.                    (2.8)
Note that a Hilbert space is separable iff it has a countable orthonormal basis set and U is square-
                     
integrable iff E kU k2 < ∞.
    Now consider the following bi-linear functional:
                        h : HZ × HS → R, h(α, β) 7→ Cov(α(Z), β(S)),
where HZ and HS are RKHSs. This bi-linear functional is bounded due to Assumption 2.1 [85].
Invoking Riesz representation theorem [86], there exist a unique and bounded operator ΣSZ :
HZ → HS such that
                Cov(α(Z), β(S)) = h(α, β) = hΣ α, βiH            ∀ α ∈ HZ , β ∈ HS .           (2.9)
                                                            S
Consequently, it follows that
                                     Z ⊥⊥ S ⇐⇒ ΣSZ = 0.                                       (2.10)
Notice that ΣSZ = 0 iff the norm of ΣSZ is zero for any valid norm like spectral norm:
                                                         kΣSZ αkH
                               kΣSZ kSpectral := sup                S,                        (2.11)
                                                  α∈HZ     kαkH
                                                                  Z
                                                   15


or Hilbert-Schmidt norm:
                                            X                       2
                         kΣSZ k2HS :=                   ΣSZ αi , βj H ,                (2.12)
                                                                      S
                                       αi ∈UZ , βj ∈US
where UZ and US are countable orthonormal basis sets for the separable universal RKHSs HZ
and HS , respectively. These norms have been deployed in constrained covariance (COCO) [80]
and HSIC, respectively, which are universal measures of dependence [81]. Moreover, kernelized
canonical covariance (KCC) introduced by [79]:
                                                       Cov(α(Z), β(S))
                      KCC(Z, S) :=        sup       p                     ,            (2.13)
                                     α∈HZ ,β∈HS       Var(α(Z)) Var(β(S))
has also widely been used as a universal measure of dependence.
                                                 16


Chapter 3
Adversarial Representation Learning
Under Linear Invariance
3.1     Introduction
Adversarial representation learning is a promising framework for training image representation
models that can control the information encapsulated within it. ARL is practically employed for
learning representations for a variety of applications, including unsupervised domain adaptation
of images [87], censoring sensitive information from images [21], learning fair and unbiased rep-
resentations [37, 2], learning representations that are controllably invariant to sensitive attributes
[24] and mitigating unintended information leakage [1], amongst others.
    At the core of the ARL formulation is the idea of jointly optimizing three entities: (i) An
encoder f that seeks to distill the information from the input data X and retains the information
relevant to the target attribute Y while intentionally and permanently eliminating the information
corresponding to the sensitive attribute S, (ii) a predictor gY that seeks to predict Y , and (iii) a
proxy adversary gS , playing the role of an unknown adversary, that seeks to extract the sensitive
information S. Figure 3.1 shows a pictorial illustration of the ARL problem.
    Typical instantiations of ARL represent these entities through non-linear functions in the form
of NNs and formulate parameter learning as a minimax optimization problem. Practically, opti-
                                                  17


               X
                                  f (X)
                                                      Z           gY (Z)              Yb
                                                                  gS (Z)               Sb
Figure 3.1: Adversarial Representation Learning consists of three entities, an encoder f that
obtains a compact representation Z of the input data X, a predictor gY that predicts a desired
target attribute Y and an adversary gS that seeks to extract a sensitive attribute S, both from the
embedding Z.
mization is performed through SGDA, wherein small gradient steps are taken simultaneously in the
parameter space of the encoder, target predictor, and proxy adversary. The solutions thus obtained
have been effective in learning data representations with controlled invariance across applications
such as image classification [1], multi-lingual machine translation [24], and domain adaptation
[87].
    Despite its practical promise, the aforementioned ARL setup suffers from a number of draw-
backs:
– Representation learning under adversarial settings is challenging in its own right. The minimax
formulation of the problem leads to an optimization problem that is non-convex in the parameter
space, both due to the adversarial loss function as well as due to the non-linear nature of mod-
ern NNs. As we show in this chapter, even for simple instances of ARL where each entity is
characterized by a linear function, the problem remains non-convex in the parameter space. Sim-
ilar observations [65] have been made in a different but related context of adversarial learning in
generative adversarial networks (GAN)s [88].
– Current paradigm of SGDA to solve the ARL problem provides no provable guarantees while
suffering from instability and poor convergence [1, 2]. Again, similar observations [64, 65] have
                                                 18


been made in the context of GANs, demonstrating the difficulty posed by the minimax formula-
tion of the optimization and exposing the limitations of standard simultaneous optimization (i.e.,
SGDA).
– In applications of ARL related to fairness, accountability, and transparency of machine learning
models, it is critically important to be able to provide performance bounds in addition to empirical
evidence of their efficacy. A major shortcoming of existing works is the difficulty and lack of
performance analysis and provable guarantees of unfairness or information leakage.
    In this chapter, we take a step back and analytically study the simplest version of the ARL prob-
lem from an optimization perspective with the goal of addressing the aforementioned limitations.
Doing so enables us to delineate the contributions of the expressivity of the entities in ARL (i.e.,
shallow versus DNNs) and the challenges of optimizing the parameters (i.e., local optima through
SGDA versus global optima). We first consider the “linear” form of ARL, where the encoder is
a linear transformation, the target predictor is a linear regression, and the proxy adversary is also
a linear regressor. We show that this linear ARL leads to an optimization problem that is both
non-convex and non-differentiable. Despite this fact, by reducing it into a set of trace problems on
a Stiefel manifold, we obtain an exact closed-form solution for the global optima. As part of our
solution, we also determine the optimal dimensionality of the embedding space. We then obtain
analytical bounds (lower and upper) on the target and adversary objectives and prescribe a pro-
cedure to control the maximal leakage of sensitive information explicitly. Finally, we extend the
linear-ARL formulation to allow non-linear functions in some RKHSs while still enjoying an ex-
act closed-form solution for the global optima. Numerical experiments on multiple datasets, both
small and large scale, indicate that the global optima solution for the linear and kernel formulations
of ARL are competitive and sometimes even outperform DNN-based ARL trained through SGDA.
Practically, we also demonstrate the utility of linear ARL and kernel-ARL for “imparting” prov-
                                                   19


able invariance to any biased pre-trained data representation. We refer to our proposed algorithm
for obtaining the global optima as spectral-ARL and abbreviate it as SARL.
3.2      Adversarial Representation Learning
3.2.1     Problem Setting
The adversarial representation learning problem is formulated with the goal of learning parameters
of an embedding function f (·; Θ) : X 7→ Z with two objectives: (i) aiding a target predictor
gY (·; ΘY ) to accurately predict the target attribute Y from Z, and (ii) preventing an adversary
gS (·; ΘS ) from inferring the sensitive attribute S from Z. The ARL problem can be formulated as
                    min min EX,Y [LY (gY (f (X; Θ); ΘY ) , Y )]
                     Θ ΘY
                                   s.t. min EX,S [LS (gS (f (X; Θ); ΘS ) , S)] ≥ α,          (3.1)
                                          ΘS
where LY and LS are the loss functions for the target and the adversary predictors, respectively;
α ∈ [0, ∞) is a user-defined value that determines the minimum tolerable loss for the adversary on
the sensitive attribute. For example, α = 0 corresponds to ignoring the adversary loss and result-
ing in standard representation learning, while α → ∞ corresponds to no tolerance for adversary
performance. The minimization in the constraint is equivalent to the encoder operating against an
optimal adversary. Existing instances of this problem adopt DNNs to represent f , gY , and gS and
learn their respective parameters {Θ, ΘY , ΘS } through SGDA. See Figure 3.2 for an illustration.
                                                   20


                  X
                                   f (X; Θ)
                                                         Z       gY (Z; ΘY )             Yb
       figure
                                                                  gS (Z; ΘS )             Sb
Figure 3.2: ARL-SGDA: Illustration of training adversarial representation learning through
stochastic gradient descent ascent. i) At first, the target predictor parameters ΘY are updated
while the encoder and adversary are frozen. ii) Then, the adversary parameters ΘS are updated
while the encoder and ΘY are frozen. iii) Finally, the encoder parameters Θ get updated while
ΘY and ΘS are frozen. SGDA does not provide any convergence guarantees.
3.2.2     The Linear Case
We first consider the simplest form of the ARL problem and analyze it from an optimization per-
spective. We model both the adversary and the target predictors as linear regressors
                               Yb = ΘY Z + bY ,       Sb = ΘS Z + bS ,                        (3.2)
where Z is an encoded version of X, and Yb and Sb are the predictions corresponding to the target
and sensitive attributes, respectively. We also model the encoder through a linear mapping
                                  Θ ∈ Rr×dX     :    X 7→ Z = ΘX,                             (3.3)
where r is the dimensionality of the projected space. While existing NN-based solutions select r
on an ad-hoc basis, our approach for this problem determines r as part of our solution to the ARL
problem. See Figure 3.3 for an illustration. For both adversary and target predictors, we adopt the
                                                  21


                X
                                              Z ∈ Rr
                                                              ΘY Z +bY              Yb
                                 ΘX
                                                              ΘS Z + bS              Sb
Figure 3.3: Linear-ARL: Illustration of linear adversarial representation learning for learning a
fair representation. An encoder f , in the form of a linear mapping, produces a new representation
Z = ΘX. A target predictor gY and an adversary gS , in the form of linear regressors, operate
on the representation Z. We analytically analyze this ARL setup to obtain a closed-form solution
for the globally optimal parameters of the encoder Θ. Provable bounds on the trade-off between
utility and fairness of the representation are also derived.
MSE to assess the quality of their respective predictions, i.e.,
                                   h           i                    h          i
                   LY (Y, Yb ) = E kY − Yb k2 ,                b = E kS − Sk
                                                        LS (S, S)           b 2 .
3.2.2.1 Optimization Problem
For any given encoder Θ the following lemma gives the minimum MSE for a linear regressor in
terms of covariance matrices and Θ. The following Lemma assumes that the RV X is zero-mean
and the covariance matrix CX is positive definite. These assumptions are not restrictive since we
can always remove the mean and dependent features from X.
Lemma 3.1. Let X and U be two RVs with E[X] = 0, E[U ] = b, where CX  0. Consider a
linear regressor, Ub = W Z + b, where W ∈ RdU ×r is the parameter matrix, and Z ∈ Rr is an
encoded version of X for a given Θ: X 7→ Z = ΘX,              Θ ∈ Rr×dX . The minimum MSE that
can be achieved by designing W is
                               h          i                               2
                        min E kU − U  b k2 = Tr [CU ] − PM Q−T CXU           ,
                         W                                        X       F
                                                   22


where M = QX ΘT ∈ RdX ×r , and CX = QTX QX (Cholesky factorization).
Proof. See Appendix A.1
    Applying this Lemma to the target and adversary regressors, we obtain their minimum MSEs
as
                                                                                       2
           JY (Θ) := min LY (gY (f (X; Θ); ΘY ) , Y ) = Tr [CY ] − PM Q−T   X CXY             (3.4)
                      ΘY                                                               F
                                                                                     2
           JS (Θ) := min LS (gS (f (X; Θ); ΘS ) , S) = Tr [CS ] − PM Q−T   X CXS F .          (3.5)
                      ΘS
Given the encoder Θ, JY (Θ) is related to the performance of the target predictor, whereas JS (Θ)
corresponds to the amount of sensitive information that an adversary is able to extract. Note that
the linear model for gY and gS enables us to obtain their respective optimal solutions for a given
encoder Θ. On the other hand, when gY and gS are modeled as NNs, doing the same is analytically
infeasible and potentially impractical.
    The orthogonal projector PM in Lemma 3.1 is a function of two factors: a data-dependent term
QX and the encoder parameters Θ. While the former is fixed for a given dataset, the latter is our
object of interest. Pursuantly, we decompose PM in order to separably characterize the effect of
these two factors. Let the columns of LX ∈ RdX ×dX be an orthonormal basis for the column
space of QX , and G ∈ RdX ×r be an arbitrary matrix. consider LX G := QX ΘT . Due to the
bijection
                             G = L−1         T
                                    X QX Θ ⇔ Θ = G LX QX ,
                                                          T T −T
determining the encoder parameters Θ is equivalent to determining G. The projector PM can now
be expressed in terms of PG , which is only dependent on the free parameter G:
                                               †
                            PM = M M T M M T = LX PG LTX ,                                    (3.6)
                                                23


where we used the equality M = QX ΘT and the fact that LTX LX = I.
   Now, we turn back to the ARL setup and see how the above decomposition can be leveraged.
The optimization problem in (3.1) reduces to
                                           min JY (G)
                                             G
                                                                                            (3.7)
                                          s.t. JS (G) ≥ α,
where the minimum MSE measures of (3.4) and (3.5) are now expressed in terms of G instead of
Θ.
   Before solving this optimization problem, we will first interpret it geometrically. Consider a
simple example where X is a white RV, i.e., CX = I. Under this setting, QX = LX = I and
G = Θ. As a result, the optimization problem in (3.7) can alternatively be solved in terms of
                                                   2                                   2
G = Θ, where JY (G) = Tr [CY ] − PG CXY F and JS (G) = Tr [CS ] − PG CXS F .
                                                       2
   The constraint JS (G) ≥ α implies PG CXS F ≤ (Tr [CS ] − α) which is geometrically
equivalent to the subspace G being outside (or tangent to) the cone around CXS . Similarly, min-
                                                   2
imizing JY (G) implies maximizing PG CXY F , which in turn is equivalent to minimizing the
angle between the subspace G and the vector CXY . Therefore, the global optima of (3.7) are any
hyperplane G which is outside the cone around CXS while subtending the smallest angle to CXY .
An illustration of this setting and its solution is shown in Figure 3.4 for dX = 3, r = 2, and
dY = dS = 1.
   Constrained optimization problems such as (3.7) are commonly solved through their respective
unconstrained scalarization [89] formulations as shown below
                                 min     {(1 − λ) JY (G) − λ Js (G)}                        (3.8)
                              G∈RdX ×r
                                                 24


                                           CXS        CXY
                                                            G
                                             o
Figure 3.4: Geometric Interpretation: An illustration of a three-dimensional input space X
and one-dimensional target and adversary regressors. Therefore, both CXS and CXY are one-
dimensional. We locate the y-axis in the same direction as CXS . The feasible space for the
solution G = Θ imposed by the constraint JS (Θ) ≥ α corresponds to the region outside the
cone (specified by CS and α) around CXS . The non-convexity of the problem stems from the
non-convexity of this feasible set. The objective min JY (Θ) corresponds to minimizing the angle
between the line CXY and the plane G. When CXY is outside the cone, the line CXY itself or any
plane that contains the line CXY and does not intersect with the cone, is a valid solution. When
CXY is inside the cone, the solution is either a line or, as we illustrate, a tangent hyperplane to
the cone that is closest to CXY . The non-differentiability stems from the fact that the solution can
either be a plane or a line.
                                                 25


for some parameter 0 ≤ λ < 1. Such an approach affords two main advantages and one disadvan-
tage: (a) A direct and closed-form solution can be obtained. (b) Framing (3.8) in terms of λ and
(1 − λ) allows explicit control between the two extremes of no fairness (λ = 0) and only fairness
(λ → 1). As a consequence, it can be shown that for every λ ∈ [0, 1), ∃ α ∈ [αmin , αmax ] (see
Appendix A.2 for a proof). (c) The vice-versa, on the other hand, does not necessarily hold, i.e.,
for a given tolerable loss α, there may not be a corresponding λ ∈ [0, 1). This is the theoretical
limit of solving a scalarized problem instead of the constrained problem.
    Before we obtain the solution to the scalarization formulation (3.8), we characterize the nature
of the optimization problem in the following theorem.
Theorem 3.2. As a function of G ∈ RdX ×r , the objective function in (3.8) is neither convex nor
differentiable.
Proof. See Appendix A.3
3.2.2.2 Learning
Despite the difficulty associated with the objective in (3.8), we derive a closed-form solution for
its global optima. Our key insight lies in partitioning the search space RdX ×r based on the rank
of the matrix G (i.e., the number of independent rows or columns of G). For a given rank i, let Si
be the set containing all matrices G of rank i,
                           n                                o
                     Si = G ∈ RdX ×r       | rank(G) = i ,       i = 0, 1, · · · , r.
       Sr
Since    i=0 Si = RdX ×r , the optimization problem in (3.8) can be solved by considering r mini-
                                                  26


mization problems, one for each possible rank of G:
                                      (                                  )
                             min        min (1 − λ) JY (G) − λ JS (G)                            (3.9)
                          i∈{1,...,r}  G∈Si
    We observe from (3.4), (3.5), and (3.6) that the optimization problem in (3.8) is dependent only
on a subspace G. As such, the solution G is not unique since many different matrices can span the
same subspace. Hence, it is sufficient to solve for any particular G that spans the optimal subspace
G. Without loss of generality, we seek an orthonormal basis spanning the optimal subspace G
as our desired solution. We constrain G ∈ RdX ×i to be an orthonormal matrix i.e., GT G = Ii ,
where i is the dimensionality of G. Ignoring the constant terms in JY and JS , for each i = 1, . . . , r,
the minimization problem over Si in (3.9) reduces to
                                             min    Jλ (G),                                    (3.10)
                                          GT G=Ii
where
        Jλ (G) := λ kLX GGT LTX Q−T               2                     T T −T
                                        X CXS kF −(1 − λ) kLX GG LX QX CXY kF .
                                                                                          2
                                                              h        i
    From basic properties of trace, we have, Jλ (G) = Tr GT BG where B ∈ RdX ×dX is a
symmetric matrix:
                                                                       
                  B = LTX Q−T       λ C T C       − (1 − λ) C  T C          −1                 (3.11)
                             X          SX SX                 Y X Y X QX LX .
The optimization problem in (3.10) is a trace minimization on a Stiefel manifold which has closed-
form solution(s) (see [90] and [91]).
                                                  27


    In view of the above discussion, the solution to the optimization problem in (3.8) or equiva-
lently (3.9) can be stated in the next theorem.
Theorem 3.3. Assume that the number of negative eigenvalues (β) of B in (3.11) is j. Denote
γ = min{r, j}. Then, the minimum value in (3.9) is given as,
                                             β1 + β2 + · · · + βγ                              (3.12)
where β1 ≤ β2 ≤ . . . ≤ βγ < 0 are the γ smallest eigenvalues of B. And the minimum can be
attained by G = V , where the columns of V are eigenvectors corresponding to all the γ negative
eigenvalues of B.
Proof. Consider the inner optimization problem of (3.10) in (3.9). Using the trace optimization
problems and their solutions in [90], we get
                                                     h          i
                       min    Jλ (G) =      min   Tr GT BG = β1 + β2 + · · · + βi ,
                  GT G=Ii                GT G=Ii
where β1 , β2 , . . . , βi are i smallest eigenvalues of B and minimum value can be achieved by the
matrix V whose columns are corresponding eigenvectors. If the number of negative eigenvalues
of B is less than r, then the optimum i in (3.9) is j, otherwise the optimum i is r.
    Note that including the eigenvectors corresponding to zero eigenvalues of B into our solution
G in Theorem 3.3 does not change the minimum value in (3.12). But, considering only the eigen-
vectors corresponding to negative eigenvalues results in G with the least rank and, thereby, an
encoder that is less likely to contain sensitive information for an adversary to exploit. Once G is
constructed, we can obtain our desired encoder as Θ = GT LTX Q−T      X . Recall that the solution in
Theorem 3.3 is under the assumption that the covariance CX is a full-rank matrix.
                                                     28


3.3      Empirical Solution for Linear Encoder
In many practical scenarios, we only have access to data samples but not to the population mean
vectors and covariance matrices. Therefore, the population solution might not be feasible in such
as case. In this section, we provide an approach to solve the optimization problem in (3.3), which
relies on empirical moments and is valid even if the covariance matrix CX is not full-rank.
    Firstly, for a given Θ, we find
                                    JY = min MSE (Yb − Y ).
                                          WY ,bY
Note that the above optimization problem can be separated over WY and bY . Therefore, for a
given WY , we first minimize over bY :
                     n                        o        1X
                                                           n
              min E kWY ΘX + bY − Y k = min 2
                                                              kWY Θxk + bY − yk k2
               bY                                  bY n
                                                          k=1
                                                     Xn
                                                   1
                                                =         kWY Θxk + c − yk k2 ,
                                                  n
                                                     k=1
where we used the empirical expectation and the minimizer c is
                             n                          n                 n
                          1X                         1X                1X
                     c=        (yk − WY Θxk ) =            yk − WY Θ         xk
                         n                          n                 n
                           k=1                        k=1                k=1                 (3.13)
                                                 = E {Y } − WY Θ E {X} .
                                                 29


Let all the columns of matrix C be equal to c. We now have
                   JY = min MSE (Yb − Y )
                         WY ,bY
                              1
                       = min     kWY ΘX + C − Y k2F
                         WY n
                              1                      2
                       = min      WY ΘX̃ − Ỹ
                         WY n                        F
                              1                            2
                       = min      X̃ T ΘT WYT − Ỹ T
                         WY n                              F
                              1                           2      1           2
                       = min      M WYT − PM Ỹ T             +     PM⊥ Ỹ T
                         WY n                             F      n           F
                                                               2
                          1                                         1           2
                       =     M
                             |   M }† PM Ỹ T − PM Ỹ T
                                {z                                +    PM⊥ Ỹ T
                         n                                          n           F
                               PM
                                                               F
                       =
                          1
                             P      Ỹ T 2
                         n M⊥              F
                          1        2      1             2
                       =     Ỹ T      −      PM Ỹ T      ,
                         n         F      n             F
where in the third step we used (3.13), M = X̃ T ΘT and the fifth step is due to orthogonal
decomposition. Using the same approach, we get
                                        1        2     1             2
                                JS =        S̃ T    −      PM S̃ T     .                  (3.14)
                                        n        F     n             F
    Now, assume that the columns of Lx are the orthogonal basis for the column space of X̃ T .
Therefore, for any M , there exists a G such that LX G = M . In general, there is no bijection
between Θ and G in the equality X̃ T Θ = LX G. But, there is a bijection between G and Θ
                                                             ⊥
when constrained to Θ’s in which R(ΘT ) ⊆ N (X̃ T ) . This restricted bijection is sufficient to
be considered since for any ΘT ∈ N (X̃ T ), we have M = 0. Once G is determined, ΘT can be
                                                   30


obtained as
                           ΘT = (X̃ T )† LX G + Θ0 , Θ0 ⊆ N (X̃ T ).
However, since
                                         2                   2
                        kΘk2F = ΘT          = (X̃ T )† LX G     + kΘ0 k2F ,
                                         F                   F
choosing Θ0 = 0 results in minimum kΘkF , which is favorable in terms of robustness to noise.
    By choosing Θ0 = 0, determining the encoder Θ would be equivalent to determining G.
Similar to (3.6), we have PM = LX PG LTX . If we assume that the rank of PG is i, Jλ (G) in (3.10)
can be expressed as
                                                2                               2
                Jλ (G) = λ LX GGT LTX S̃ T        − (1 − λ) LX GGT LTX Ỹ T
                                                F                               F
where GGT = PG for some orthogonal matrix G ∈ RdX ×i . This resembles the optimization
problem in (3.9) and therefore it has the same solution as Theorem 3.3 with modified B given by
                                                                
                            B = LTX λ S̃ T S̃ − (1 − λ) Ỹ T Ỹ LX                          (3.15)
Once G is determined, Θ can be obtained as GT LTX X̃ † .
3.3.1    Non-Linear Extension Through Kernelization
We extend the “linear” version of the ARL problem studied thus far to a “non-linear” version
through kernelization. We model the encoder in the ARL problem as a linear function over the
non-linear mapping of inputs as illustrated in Figure 3.5. Let the data matrix X be mapped non-
linearly by a possibly unknown and infinite dimensional function φX (·) and the corresponding
                                                31


                              f1 (·) ∈ HX
               X                      ..
                                       .          Z ∈ Rr
                              fr (·) ∈ HX                        ΘY Z +bY             Yb
                              ΘkX (·, X)
                                                                 ΘS Z + bS             Sb
Figure 3.5: Kernel-ARL: Illustration of kernel adversarial representation learning for learning a
fair representation. An encoder f , in the form of a linear mapping on top of kernelized input,
produces a new representation Z = Θ [kX (x1 , X), · · · , kX (xn , X)]T . A target predictor gY and
an adversary gS , in the form of linear regressors, operate on the representation Z.
reproducing kernel function be kX (·, ·).
    From (2.6), it follows that the representation Z can be expressed as
                              Z = Θ [kX (x1 , X), · · · , kX (xn , X)]T .                     (3.16)
The scalarization formulation of this kernel-ARL setup and its solution share the same form as that
of the linear case (3.8). The objective function remains non-convex and non-differentiable, while
the matrix B is now dependent on the kernel matrix KX as opposed to the covariance matrix CX
(see Appendix A.5 for details):
                                                                  
                             B = LTX λ S̃ T S̃ − (1 − λ) Ỹ T Ỹ LX ,                         (3.17)
where the columns of LX are the orthonormal basis for HKX . Once G is obtained through
                                                                                       †
the eigendecomposition of B, we can obtain the optimal encoder as Θ = GT LTX KX . This non-
linear extension in the form of kernelization serves to study the ARL problem under a setting where
the encoder possesses greater representational capacity while still being able to obtain the global
optima and bounds on the objectives of the target predictor and the adversary.
                                                  32


3.4      Analytical Bounds
In this section, we introduce bounds on the utility and invariance of the representation learned by
SARL. We define four bounds αmin , αmax , γmin and γmax .
γmin : A lower bound on the minimum achievable target loss, or equivalently an upper bound on
the best achievable target performance. This bound can be expressed as the minimum target MSE
across all possible encoders Θ and is attained at λ = 0:
                                        γmin = min JY (θ)
                                                  Θ
αmax : A upper bound on the maximum achievable adversary loss, or equivalently a lower bound
on the minimum leakage of the sensitive attribute. This bound can be expressed as the maximum
adversary MSE across all possible encoders Θ and is attained at λ = 1:
                                       αmax = max JS (Θ)
                                                  Θ
γmax : An upper bound on the maximum achievable target loss, or equivalently a lower bound on
the minimum achievable target performance. This bound corresponds to the scenario where the
encoder is constrained to hinder the adversary maximally. In all other cases, one can obtain higher
target performance by choosing a better encoder. This bound is attained in the limit λ → 1 and can
be expressed as
                                   γmax =       min       JY (Θ)
                                           arg max JS (Θ)
αmin : A lower bound on the minimum achievable adversary loss, or equivalently an upper bound
on the maximum leakage of the sensitive attribute. The absolute lower bound is obtained in the
                                                 33


scenario where the encoder is neither constrained to aid the target nor hinder the adversary, i.e.,
                                          ∗
                                         αmin  = min JS (Θ)
                                                   Θ
However, this is an unrealistic scenario since in the ARL problem, by definition, the encoder is
explicitly designed to aid the target. Therefore, a more realistic lower bound can be defined under
the constraint that the encoder maximally aids the target, i.e.,
                                    ᾱmin =       min       JS (Θ)
                                             arg min Jy (Θ)
However, even this bound is not realistic since, among all the encoders that aid the target, one
can always choose the encoder that minimizes the leakage of the sensitive attribute. The bound
corresponding to such an encoder can be expressed as
                                    αmin =        max       JS (Θ)
                                            arg min JY (Θ)
This bound is attained in the limit λ → 0. It is easy to see that these bounds are ordinally related
as
                                         ∗
                                        αmin  ≤ ᾱmin ≤ αmin
To summarize, in each of these cases, there exists an encoder that achieves the respective bound.
Therefore, given a choice, the encoder that corresponds to αmin is the most desirable.
    The following Lemma defines these bounds and their respective closed-form expressions as a
function of data.
Theorem 3.4. Let the columns of LX be the orthonormal basis for HKX . Further, assume that
                                                  34


the columns of VS are the singular vectors corresponding to zero singular values of S̃LX and the
columns of VY are the singular vectors corresponding to non-zero singular values of Ỹ LX . Then,
we have
                                             1           2     1
                   γmin := min JY (Θ) =          Ỹ T       − kỸ LX k2F
                             Θ               n           F    n
                                                          1        2    1             2
                   γmax :=       min       JY (Θ) =          Ỹ T    −     Ỹ LX VS
                            arg max JS (Θ)                n        F    n             F
                                                         1        2    1            2
                   αmin :=       max       JS (Θ) =          S̃ T    −    S̃LX VY
                            arg min JY (Θ)               n        F    n            F
                                              1         2
                   αmax := max JS (Θ) =           S̃ T
                              Θ               n         F
Proof. See Appendix A.6
    Under the special case of one-dimensional data, i.e., X, Y , and S are scalars, the above bounds
can be related to the correlation coefficients (i.e., normalized correlations) of the variables involved.
Specifically, the normalized bounds γmin and αmin can be expressed as,
                                         γmin
                                           2
                                                = 1 − ρ2 (X, Y )
                                          σS
                                         αmin
                                                = 1 − ρ2 (X, S)
                                          σS2
where ρ(·, ·) denotes the correlation coefficient between two RVs and σY2 := Var[Y ] (σS2 is sim-
ilarly defined). Similarly, the upper bounds γmax and αmax can be expressed in terms of the
variance of the label space as
                                          γmax        αmax
                                              2
                                                  =           = 1.
                                           σy          σs2
                                                      35


Therefore, in the one-dimensional setting, the achievable bounds are related to the underlying
alignment between the subspace spanned by the data X, and the respective subspaces spanned by
the labels S and Y .
3.5      Computational Complexity
In the case of linear-SARL, calculating the covariance matrices CX , CY X and CSX requires
O(d2X n), O(d2Y n), and O(d2S dX ) multiplications, respectively. Next, the complexity of Cholesky
factorization CX = QTX QX and calculating its inverse Q−1                 3
                                                                X is O(dX ) each. Finally, solving
the optimization problem has a complexity of O(d3X ) to eigendecompose the dX × dX matrix B.
In the case of kernel-SARL, the eigendecomposition of B requires O(n3 ) operations. However,
for scalability, i.e., large n (e.g., CIFAR-100), the Nyström method with data sampling [92] can be
adopted. To summarize, the complexity of the linear and kernel formulations is O(d3X ) and O(n3 ),
respectively.
3.6      Numerical Experiments
We evaluate the efficacy of the proposed Spectral-ARL (SARL) algorithm in finding the global op-
tima and compare it with other ARL baselines that are based on the standard SGDA optimization.
In all experiments, we refer to our solution for “linear” ARL as Linear-SARL and the solution
to the “kernel” version of the encoder with linear classifiers for the predictor and adversary as
Kernel-SARL.
                                                    36


                                                                               90
                                                                               80
                                                         Target Accuracy [%]
                                                                               70
                                                                                                         No Privacy
                                                                               60
                                                                                                         SGDA-ARL
                                                                                                         Linear-ARL
                                                                                                         Kernel-ARL
                                                                               50
                                                                                 50       55              60
                                                                                      Adversary Accuracy [%]
                               (a)                                                        (b)
Figure 3.6: (a) Samples from a mixture of four Gaussians. Each sample has two attributes, shape
and color. (b) The trade-off between target performance and leakage of a sensitive attribute by an
adversary.
3.6.1    Mixture of Four Gaussians
We first consider a simple example in order to visualize and compare the learned embeddings
from different ARL solutions. We consider a three-dimensional problem where each data sample
consists of two attributes, color and shape. Specifically, the input data X is generated from a
mixture of four different Gaussian distributions corresponding to different possible combinations
of the attributes, i.e., {•, •, ×, ×} with means at µ1 = (1, 1, 0), µ2 = (2, 2, 0), µ3 = (2, 2.5, 0),
                                                                                      
µ4 = (2.5, 3, 0) and identical covariance matrices CX = diag 0.32 , 0.32 , 0.32 . The shape
attribute is the target, while color is the sensitive attribute, as illustrated in Figure 3.6 (a). The goal
of the ARL problem is to learn an encoder that projects the data such that it remains separable with
respect to the shape and non-separable with respect to the color attribute.
    We sample 4000 points to learn linear and non-linear (RBF Gaussian kernel) encoders across
λ ∈ [0, 1]. To train the encoder, the one-hot encoding of target and sensitive labels are treated
as the regression targets. Then, we freeze the encoder and train logistic regressors for the adver-
sary and target task for each λ. We evaluate their classification performance on a separate set of
1000 samples. The resulting trade-off front between target and adversary performance is shown
                                                    37


        (a) Color (λ = 0)       (b) Color (λ = 0.5)        (c) Color (λ = 1)        (d) Color (λ = 0.5)
       (e) Shape (λ = 0)        (f) Shape (λ = 0.5)   (g) Shape (Kernel, λ = 1) (h) Shape (Kernel, λ = 0.5)
Figure 3.7: Gaussian Mixture: The optimal dimensionality of embedding Z is 1. Visualization
of the embedding histograms w.r.t each attribute for different relative emphasis, λ, on the target
(shape) and sensitive attributes (color). The top row is color and the bottom row is shape. The first
three columns show results for a linear encoder. At λ = 0, the weight on the adversary is 0, so
the color is still separable. As the value of λ increases, we observe that the colors are less and less
separable. The last column shows results for a kernel encoder for λ = 0.5. We observe that the
target attribute is quite separable while the sensitive attribute is entangled.
in Figure 3.6 (b). We make the following observations, (1) For λ = 1, all methods achieve an
accuracy of 50% for the adversary, which indicates complete removal of features corresponding to
the sensitive attribute via our encoding, (2) At small values of λ the objective of Linear-ARL is
close to being convex, hence the similarity in the trade-off fronts of Linear-SARL and SGDA-ARL
in that region. However, everywhere else, due to the iterative nature of SGDA, it is unable to find
the global solution and achieve the same trade-off as Linear-SARL. (3) The non-linear encoder
in the Kernel-SARL solution significantly outperforms both Linear-SARL and SGDA-ARL. The
non-linear nature of the encoder enables it to strongly entangle the color attribute (50% accuracy)
while simultaneously achieving a higher target accuracy than the linear encoder. Figure 3.7 vi-
sualizes the learned embedding space Z for different trade-offs between the target and adversary
objectives.
                                                    38


                   1                                       1
                         min
                                                        0.98
                0.95
                         max
                                                                                     min
                                                        0.96
                 0.9     train                                                       max
                         test                           0.94                         train
                0.85                                                                 test
                                                        0.92
                 0.8                                     0.9
                                                        0.88
                     0  0.2      0.4   0.6   0.8   1         0 0.2    0.4    0.6 0.8       1
                           (a) Linear-SARL                       (b) Kernel-SARL
Figure 3.8: Gaussian Mixture: Lower and upper bounds on adversary loss, αmin and αmax ,
computed on the training set. The loss achieved by our solution as we vary λ is shown on the
training and testing sets, αtrain and αtest , respectively.
    Figure 3.8 shows the MSE of the adversary as we vary the relative trade-off parameter λ be-
tween the target and adversary objectives. The plot illustrates (1) the lower and upper bounds αmin
and αmax respectively calculated on the training dataset, (2) achievable adversary MSE computed
on the training set αtrain , and finally, (3) achievable adversary MSE computed on the test set αtest .
Observe that on the training dataset, all values of α ∈ [αmin , αmax ] are reachable as we sweep
through λ ∈ [0, 1]. This is, however, not the case on the test set as the bounds are computed
through empirical moments as opposed to the population covariance matrices.
3.6.2     Fair Classification
We consider the task of learning representations that are invariant to a sensitive attribute on two
datasets, Adult and German, from the UCI ML-repository [93]. For comparison, apart from the raw
features X, we consider several baselines that use NNs and are trained through SGDA; LFR [36],
VAE [94], VFAE [37], ML-ARL [24] and MaxEnt-ARL [1].
    The Adult dataset contains 14 attributes. There are 30, 163 and 15, 060 instances in the training
and test sets, respectively. The target task is the binary classification of annual income, i.e., more
or less than 50K, and the sensitive attribute is gender. Similarly, the German dataset contains 1000
                                                     39


                          Table 3.1: Fair Classification Performance (in %)
                                           Adult Dataset               German Dataset
             Method                  Target    Sensitive    ∆∗      Target Sensitive   ∆∗
                                   (income) (gender)               (credit)   (age)
             Raw Data                 85.0       85.0      17.6      80.0     87.0     6.0
             LFR [36]                 82.3       67.0       0.4      72.3     80.5     0.5
             VAE [94]                 81.9       66.0       1.4      72.5     79.5     1.5
             VFAE [37]                81.3       67.0       0.4      72.7     79.7     1.3
             ML-ARL [24]              84.4       67.7       0.3      74.4     80.2     0.8
             MaxEnt-ARL [1]           84.6       65.5       1.9      72.5     80.0     1.0
             Linear-SARL              84.1       67.4       0.0      76.3     80.9     0.1
             Kernel-SARL              84.1       67.4       0.0      76.3     80.9     0.1
                 ∗ Absolute difference between adversary accuracy and random guess
instances of individuals with 20 different attributes. The target is to classify the creditworthiness
of individuals as good or bad, with the sensitive attribute being age.
     We learn encoders on the training set, after which, following the baselines, we freeze the en-
coder and train the target (logistic regression) and adversary (2-layer network with 64 units) clas-
sifiers on the training set. Table 3.1 shows the performance of the target and adversary on both
datasets. Both Linear-SARL and Kernel-SARL outperform all NN-based baselines. For either of
these tasks, the Kernel-SARL does not afford any additional benefit over Linear-SARL. For the
adult dataset, the linear encoder maps the 14 input features to just one dimension. The weights as-
signed to each feature are shown in Figure 3.9. Notice that the encoder assigns almost zero weight
to the gender feature in order to be fair with respect to the gender attribute.
3.6.3     Illumination Invariant Face Classification
This task pertains to face classification under different illumination conditions on the Extended
Yale B dataset [95]. It comprises of face images of 38 people under five different light source
directions, namely, upper right, lower right, lower left, upper left, and front. The target task is to
                                                   40


                     0.9
                     0.8
                     0.7
                     0.6
                     0.5
                     0.4
                     0.3       age
                              work class
                                 final weight
                              education
                              education num
                     0.2       marital status
                                occupation
                               relationship
                             race
                     0.1       gender
                              capital gain
                               capital loss
                             hours/week
                              country
                       0
Figure 3.9: Adult Dataset: Magnitude of learned encoder weights Θ for each semantic input
feature.
establish the identity of the person in the image, with the direction of the light being the sensitive
attribute. Since the direction of lighting is independent of identity, the ideal ARL solution should
obtain a representation Z that is devoid of any sensitive information. We first followed the exper-
imental setup of Xie et al. [24] in terms of the train/test split strategy, i.e., 190 samples (5 from
each class) for training and 1096 images for testing. Our global solution was able to completely
remove illumination from the embedding resulting in the adversary accuracy being 20%, i.e., ran-
dom chance. To investigate further, we consider different variations of this problem, flipping target
and sensitive attributes and exchanging training and test sets. The complete set of results, including
NN-based baselines, are reported in Table 3.2 ([EX] corresponds to exchanging training and testing
sets). In all these cases, our solution was able to completely remove the sensitive features resulting
in adversary performance that is no better than random chance. Simultaneously, the embedding is
also competitive with the baselines on the target task.
                                                 41


                           Table 3.2: Extended Yale B Performance (in %)
            Method                   Adversary        Target   Adversary         Target
                                   (illumination)   (identity)  (identity)  (illumination)
            Raw Data                     96             78          -               -
            VFAE [37]                    57             85          -               -
            ML-ARL [24]                  57             89          -               -
            MaxEnt-ARL [1]               40             89          -               -
            Linear-SARL                  21             81          3              94
            Linear-SARL [EX]             20             86          3              97
            Kernel-SARL                  20             86          3              96
            Kernel-SARL [EX]             20             88          3              96
3.6.4     CIFAR-100
The CIFAR-100 dataset [96] consists of 50, 000 images from 100 classes that are further grouped
into 20 superclasses. Each image is therefore associated with two attributes, a “fine” class label and
a “coarse” superclass label. We consider a setup where the “coarse” and “fine” labels are the target
and sensitive attributes, respectively. For Linear-SARL and Kernel-SARL (degree five polynomial
kernel) and SGDA, we use features (64-dimensional) extracted from a pre-trained ResNet-110
model as an input to the encoder instead of raw images. From these features, the encoder is tasked
with aiding the target predictor and hindering the adversary. This setup serves as an example to
illustrate how invariance can be “imparted” to an existing biased pre-trained representation. We
also consider two NN-baselines, ML-ARL [24] and MaxEnt-ARL [1]. Unlike our scenario, where
the pre-trained layers of ResNet-18 are not adapted, the baselines optimize the entire encoder for
the ARL task. For evaluation, once the encoder is learned and frozen, we train a discriminator
and adversary as 2-layer networks with 64 neurons each. Therefore, although our approach uses
linear regressor as an adversary at training, we evaluate against stronger adversaries at test time.
In contrast, the baselines train and evaluate against adversaries with equal capacity.
    Figure 3.10 shows the trade-off in accuracy between the target predictor and adversary. We
                                                  42


                                               80
                         Target Accuracy [%]
                                               60
                                               40                          No Privacy
                                                                          SGDA-ARL
                                                                           ML-ARL
                                               20                         MaxEnt-ARL
                                                                          Linear-ARL
                                                                          Kernel-ARL
                                               0
                                                    0   15        30        45          60
                                                        Adversary Accuracy [%]
Figure 3.10: CIFAR-100: Trade-off between target performance and leakage of sensitive attribute
by adversary.
observe that (1) Kernel-ARL significantly outperforms Linear-SARL. Since the former implicitly
maps the data into a higher dimensional space, the sensitive features are potentially disentangled
sufficiently for the linear encoder in that space to discard such information. Therefore, even for
large values of λ, Kernel-SARL is able to simultaneously achieve high target accuracy while keep-
ing the adversary performance low. (2) Despite being handicapped by the fact that Kernel-SARL
is evaluated against stronger adversaries than it is trained against, its performance is comparable
to that of the NN baselines. In fact, it outperforms both ML-ARL and MaxEnt-ARL with respect
to the target task. (3) Despite repeated attempts with different hyper-parameters and choice of op-
timizers, SGDA was highly unstable across most datasets and often got stuck in a local optimum
and failed to find good solutions.
   Figure 3.11 shows the MSE of the adversary as we vary the relative trade-off λ between the
target and adversary objectives. The plot illustrates (1) the lower and upper bounds αmin and
αmax respectively calculated on the training dataset, (2) achievable adversary MSE computed on
the training set αtrain , and finally, (3) achievable adversary MSE computed on the test set αtest .
Observe that on the training dataset, all values of α ∈ [αmin , αmax ] are reachable as we sweep
                                                             43


                  1                                      1
                       min                                    min
                                                              max
                       max
               0.96                                   0.96    train
                       train
                                                              test
                       test
               0.93                                   0.93
                0.9                                    0.9
               0.87                                   0.87
                    0      0.25      0.5    0.75 1         0      0.25     0.5    0.75 1
                            (a) Linear-SARL                       (b) Kernel-SARL
Figure 3.11: CIFAR-100: Lower and upper bounds on adversary loss, αmin and αmax , computed
on the training set. The loss achieved by our solution as we vary λ is shown on the training and
testing sets, αtrain and αtest , respectively.
through λ ∈ [0, 1]. This is, however, not the case on the test set, as the bounds are computed
through empirical moments as opposed to the true covariance matrices.
     Figure 3.12 plots the optimal embedding dimensionality provided by SARL as a function of
the trade-off parameter λ. At small values of λ, the objective favors the target task, i.e., 20 class
predictions. Thus, SARL does indeed determine the optimal dimensionality of 19 for a 20-class
linear target regressor. However, at large values of λ, the objective only seeks to hinder the sen-
sitive task, i.e., 100 class prediction. In this case, the ideal embedding dimensionality from the
perspective of the linear adversary regressor is at least 99. The SARL ascertained dimensionality
of one is, thus, optimal for maximally mitigating the leakage of the sensitive attribute from the
embedding. However, unsurprisingly, the target task also suffers significantly.
3.7      Summary
We studied the “linear” form of adversarial representation learning (ARL), where all the entities
are linear functions. We showed that the optimization problem, even for this simplified version,
is both non-convex and non-differentiable. Using tools from spectral learning, we obtained a
                                                   44


                                                          20
                               Embedding Dimensionality
                                                          15
                                                          10
                                                           5       Linear-SARL
                                                                   Kernel-SARL
                                                           0
                                                               0   0.2   0.4     0.6   0.8   1
Figure 3.12: CIFAR-100: Optimal embedding dimensionality learned by SARL. At small values
of λ, the objective favors the target task, which predicts 20 classes. Thus, an embedding dimen-
sionality of 19 is optimal for a linear target regressor. At large values of λ, the objective only seeks
to hinder the adversary. Thus, SARL determines the optimal dimensionality of the embedding as
one.
closed-form expression for the global optima and derived analytical bounds on the achievable
utility and invariance. We also extended these results to non-linear parameterizations through
kernelization. Numerical experiments on multiple datasets indicated that the global optima solution
of the “kernel” form of ARL is able to obtain a trade-off between utility and invariance that is
comparable to that of local optima solutions of NN-based ARL. At the same time, unlike NN-based
solutions, the proposed method can (1) analytically determine the achievable utility and invariance
bounds and (2) provide explicit control over the trade-off between utility and invariance.
   Admittedly, the results presented in this chapter do not extend directly to NN-based formula-
tions of ARL. However, we believe it sheds light on the nature of the ARL optimization problem
and aids our understanding of the ARL problem. It helps delineate the role of the optimization al-
gorithm and the choice of embedding function, highlighting the trade-off between the expressivity
of the functions and our ability to obtain the global optima of the adversarial game. We consider
our contribution as the first step towards controlling the non-convexity that naturally appears in
game-theoretic representation learning.
                                                                          45


Chapter 4
Adversarial Representation Learning With
Closed-Form Solvers
4.1      Introduction
In this chapter, we revisit the ARL problem and look at it from the NN-based optimization point
of view. The vanilla algorithm for learning the parameters of the encoder, target, and adversary
networks is SGDA [24, 1], where the players take a gradient step simultaneously. See Figure 4.1
for an illustration. However, applying SGDA is not an optimal strategy for ARL and is known
to suffer from some drawbacks. Firstly, SGDA has undesirable convergence properties; it fails to
converge to a local minimum and can converge to fixed points that are not local minimax while
being very unstable and slow in practice [67, 68]. Secondly, SGDA exhibits strong rotation around
fixed points, which requires using very small learning rates [64, 66] to converge. Numerous solu-
tions [64, 65, 69] have been proposed recently to address the aforementioned computational chal-
lenges. These approaches, however, seek to obtain solutions to the minimax optimization problem
in the general case, where each player is modeled as a complex neural network.
    We take a different perspective and propose an alternative optimization algorithm for ARL.
Our key insight is to replace the shallow NNs with other analytically tractable models with simi-
lar capacities. We propose to adopt simple learning algorithms that admit closed-form solutions,
                                                46


                X
                                 f (X; Θ)
                                                       Z         gY (Z; ΘY )            Yb
                                                                  gS (Z; ΘS )            Sb
Figure 4.1: ARL-SGDA: Illustration of training adversarial representation learning through
stochastic gradient descent ascent. i) At first, the target predictor parameters ΘY are updated
while the encoder and adversary parameters are frozen. ii) Then, the adversary parameters ΘS are
updated while the encoder and target parameters are frozen. iii) Finally, the encoder parameters
Θ get updated while target and adversary parameters are frozen. SGDA does not provide any
convergence guarantees.
such as linear or kernel ridge regressors for the target and adversary, while modeling the encoder
as a DNN. Crucially, such models are particularly suitable for ARL and afford numerous advan-
tages, including (1) closed-form solution allows learning problems to be optimized globally and
efficiently, (2) analytically obtaining upper bound on optimal dimensionality of the embedding, (3)
the simplicity and differentiability allows us to backpropagate through the closed-form solution,
(4) practically it resolves the notorious rotational behavior of iterative minimax gradient dynamics,
resulting in a simple optimization that is empirically stable, reliable, converges faster to a local
optimum, and ultimately results in a more effective encoder.
    We demonstrate the practical effectiveness of our approach, dubbed OptNet-ARL, through
numerical experiments on an illustrative toy example, fair classification on UCI Adult and German
datasets, and mitigating information leakage on the CelebA dataset. We consider two scenarios
where the target and sensitive attributes are (a) dependent and (b) independent. Our results indicate
that, in comparison to existing ARL solutions, OptNet-ARL is more stable and converges faster
while also outperforming them in terms of accuracy, especially in the latter scenario.
    A number of recent approaches have integrated differentiable solvers, both iterative as well as
                                                   47


closed-form, within end-to-end learning systems. Structured layers for segmentation and higher-
order pooling were introduced by [97]. Similarly, [98] proposed an asymmetric architecture that
incorporates a correlation filter as a differentiable layer. Differential optimization as a layer in NNs
was introduced by [99, 100]. More recently, differentiable solvers have also been adopted for meta-
learning [101, 102] as well. The primary motivation for all the aforementioned approaches is to
endow DNNs with differential optimization and ultimately achieve faster convergence of the end-
to-end systems. In contrast, our inspiration for using differential closed-form solvers is to control
the non-convexity of the optimization in ARL in terms of stability, reliability, and effectiveness.
4.2     Problem Setting
Recall the ARL optimization problem in (3.1) and denote the global minimums of the adversary
and target estimators as
                        JY (Θ) := min EX,Y [LY (gY (f (X; Θ) ; ΘY ), Y )]
                                      ΘY
                                                                                                     (4.1)
                        JS (Θ) := min EX,S [LS (gS (f (X; Θ) ; ΘS ), S)] .
                                      ΘS
Similar to Chapter 3, instead of solving the constrained optimization problem in (3.1), we solve its
scalarization version:
                          min {(1 − λ) JY (Θ) − λ JS (Θ)} , 0 ≤ λ ≤ 1,                               (4.2)
                            Θ
where λ is the trade-off parameter between utility and the leakage of the sensitive information.
                                                    48


4.2.1       Motivating Exact Solvers
Most state-of-the-art ARL algorithms cannot solve the optimization problems in (4.1) optimally
(e.g., SGDA). For any given Θ, denote any non-optimal adversary and target predictors’ loss func-
              approx            approx
tions by JY          (Θ) and JS        (Θ), respectively. It is obvious that for any given Θ, it holds
that
                          approx                         approx
                        JY       (Θ) ≥ JY (Θ) and JS            (Θ) ≥ JS (Θ).
Note that the optimization problem raised from a non-optimal adversary and target predictors is
                                        approx           approx
                        min (1 − λ) JY          (Θ) − λ JS       (Θ) , 0 ≤ λ ≤ 1.                   (4.3)
                         Θ
Intuitively, solution(s) of (4.3) do not outperform that of (4.2). We now formulate this intuition
more concretely.
Definition 4.1. Let (a1 , a2 ) and (b1 , b2 ) be two arbitrary points in R2 . We say (b1 , b2 ) dominates
(a1 , a2 ) iff b1 > a1 and b2 < a2 hold simultaneously.
    The following lemma states that any solutions obtained by a sub-optimal adversary and target
predictors cannot dominate that of exact solvers.
Lemma 4.2. For any λ1 , λ2 ∈ [0, 1), consider the following optimization problems
                           Θexact = arg min {(1 − λ1 ) JY (Θ) − λ1 JS (Θ)}                          (4.4)
                                          Θ
and
                                                      approx            approx
                      Θapprox = arg min (1 − λ2 ) JY          (Θ) − λ2 JS       (Θ) .
                                     Θ
                                                     49


Then, any adversary-target objective trade-off generated by JS (Θexact ), JY (Θexact ) cannot be
dominated by the trade-off generated by (JS (Θapprox ), JY (Θapprox )).
Proof. It is enough to show that
    if (i) JS (Θapprox ) > JS (Θexact ) then JY (Θapprox ) ≥ JY (Θexact ),
    and if (ii) JY (Θapprox ) < JY (Θexact ) then JS (Θapprox ) ≤ JS (Θexact ).
                                                                    approx       approx
    The key point is to observe from (4.4) that regardless of λ2 , Jy       and JS      , we have
       (1 − λ1 ) JY (Θexact ) − λ1 JS (Θexact ) ≤ (1 − λ1 ) JY (Θapprox ) − λ1 JS (Θapprox ).
Now, consider three possible cases for λ1 :
    a) λ1 = 0: In this case we have JY (Θexact ) ≤ JY (Θapprox ) and therefore regardless of
Js (Θexact ) and JS (Θapprox ), (ii) cannot happen and (i) holds under its assumption.
    b) λ1 = 1: In this case we have JS (Θexact ) ≥ JY (Θapprox ) and therefore regardless of
JY (Θexact ) and JY (Θapprox ), (i) cannot happen and (ii) holds under its assumption.
    c) 0 < λ1 < 1:
    (i) If JS (Θapprox ) > JS (Θexact ), then
                                                                                         
       0 < λ1 Js (Θapprox ) − JS (Θexact ) ≤ (1 − λ1 ) JY (Θapprox ) − JY (Θexact ) .
This implies that JY (Θapprox ) ≥ JY (Θexact ).
    (ii) If JY (Θapprox ) < JY (Θexact ), then
                                                                                         
       0 < (1 − λ1 ) JY (Θexact ) − JY (Θapprox ) ≤ λ1 JS (Θexact ) − JS (Θapprox ) .
This implies that JS (Θapprox ) < Js (Θexact ).
                                                  50


               X
                                f (X; Θ)
                                                     Z        ΘY kY (·, Z)            Yb
                                                              ΘS kS (·, Z)             Sb
Figure 4.2: ARL with kernelized ridge regressors for adversary and target predictors. This setting
turns typical SGDA optimization of ARLs into a simple SGD optimization.
4.3      Exact Adversary and Target Predictor Solvers
Existing instances of ARL adopt NNs to represent f , gY , and gS and learn their respective param-
eters {Θ, ΘY , ΘS } through SGDA. Consequently, the target and adversary in equation (4.1) are
not solved to optimality, thereby resulting in a sub-optimal encoder.
4.3.1     Closed-Form Adversary and Target Predictor
The machine learning literature offers a wealth of methods with exact solutions appropriate for
modeling adversary and target predictors. In this section, we argue for and adopt simple, fast,
and differentiable methods such as kernel ridge regressors as shown in Figure 4.2. Such modeling
allows us to obtain the optimal estimators globally for any given encoder f (·; Θ).
    On the other hand, kernelized ridge regressors can be stronger than the shallow NNs that are
used in many ARL-based solutions(e.g., [24, 103, 2, 1]). Although it is not the focus of this
dissertation, it is worth noting that even DNNs in the infinite-width limit reduce to linear models
with a kernel called the neural tangent kernel [104], and as such can be adopted to increase the
capacity of our regressors.
    Consider two RKHSs of functions, HS and HY , for the adversary and target networks, respec-
                                                  51


tively. Let a possible corresponding pair of feature maps be φS (·) ∈ RrS and φY (·) ∈ RrY where
rS and rY are the dimensionality of the resulting features and can potentially approach infinity. The
respective kernel functions for HS and HY can be represented as kS (z, z 0 ) = hφS (z), φS (z 0 )iH
                                                                                                    S
and kY (z, z 0 ) = hφY (z), φY (z 0 )iH . Under this setting, we can relate the target and sensitive
                                          Y
attributes to any given embedding Z as
                                  Yb = ΘY [kY (z1 , Z), · · · , kY (zn , Z)]T
                                                                                                 (4.5)
                                   Sb = ΘS [kS (z1 , Z), · · · , kS (zn , Z)]T ,
where ΘY ∈ RdY ×n and ΘS ∈ RdS ×n , and n is the number of data samples. Let the entire
embedding of input data be denoted as Z := [z1 , · · · , zn ] ∈ Rr×n , where zi = f (xi ) for i =
1, · · · , n. Consequently, it follows that
                                       Yb := [b
                                              y1 , · · · , ybn ] = ΘY KY
                                                                                                 (4.6)
                                        b := [b
                                        S     s1 , · · · , sbn ] = ΘS KS .
In a typical ARL setting, once an encoder is learned (i.e., for a given fixed embedding Z), we eval-
uate against the best possible adversary and target predictors. In the following lemma, we obtain
the minimum MSE for the kernelized adversary and target predictors for any given embedding Z.
Lemma 4.3. Let JY (Z) and JS (Z) be regularized minimum MSEs for adversary and target:
                                                                                 
                                                                  2               2
                          JY (Z) = min E Y − Y        b               + γY kΘY kF ,
                                        ΘY
                                                                                
                                                                 2              2
                           JS (Z) = min E S − S       b              + γS kΘS kF
                                        ΘS
where γY and γS are regularization parameters for target and adversary regressors, respectively.
                                                         52


Then, for any given embedding matrix Z, the minimum MSE for the kernelized adversary and
target can be obtained as
                                                                       2
                                                                   T
                                     1      2     1           Y       
                          JY (Z) = kY kF −            P               
                                     n            n MY                
                                                               0n×d
                                                                     Y    F
                                                                      2                   (4.7)
                                                                  T
                                     1           1           S 
                           JS (Z) = kSk2F −           PM                 ,
                                     n           n        S          
                                                              0n×d
                                                                    S    F
where                                                                   
                                   KY                          KS 
                          MY :=  √
                                             ,
                                                      MS :=   √
                                                                           
                                                                           
                                      nγY In                        nγS In
are both full-column rank matrices, and the orthogonal projection matrix for any full-column rank
matrix M can be obtained as
                                    PM = M (M T M )−1 M T .
Proof. Using the empirical mean, we have
                       (      n
                                                                   )
                          1X
     JY (Z) = min                kΘY KY − Y k2F + γY kΘY k2F
                   ΘY     n
                             k=
                                                             2
                                                       
                                                      T
                   1         KY  T  Y                  
               =     min              Θ   −           
                   n ΘY       √         Y               
                                nγY In           0n×d
                                                        Y
                            |    {z    }
                                MY
                                                             F
                                                        2                         2
                                                    T                           T
                   1              T
                                               Y               1          Y      
               =     min MY ΘY − PM                     +         P              
                   n ΘY                    Y                  n M⊥     Y
                                                                                   
                                               0n×d                          0n×d
                                                      Y    F                      Y   F
                                                 53


                                                               2                        2
                                     YT                Y
                                                             T
                                                                                   Y
                                                                                        T
                     1          † 
                                                     
                                                                 
                                                                       1         
                                                                                            
                                                                                            
                =       MY MY              −  P MY                +     P ⊥            
                     n | {z }                                           n MY
                          PM       0 n×dY                0 n×dY                     0n×d
                              Y                                                           Y   F
                                                                   F
                                        2
                                    T
                     1          Y      
                =       P ⊥           
                     n MY               
                                 0n×d
                                      Y   F
                                                     2
                                                 T
                     1     2     1          Y       
                =      kY kF −      PM    
                                                      ,
                                                                                               (4.8)
                     n           n      Y
                                            0n×d
                                                   Y    F
where we used orthogonal decomposition w.r.t       MY in the third and last steps and a possible
                                                        T
                                               †   Y       
minimizer used in the forth step is ΘTY =   MY             . Using the same approach, we get
                                                           
                                                     0n×d
                                                          Y
                                                                       2
                                                                   T
                                        1     2      1          S 
                          JS (Z) =        kSkF −        P              ,
                                        n            n MS             
                                                                0n×d
                                                                     S    F
                           
                  KS 
where MS :=     √
                            .
                            
                     nγS In
   It is quite straightforward to generalize this method to the case of multiple target and adversary
predictors through equation (4.3). In this case, we will have multiple λs to trade-off between utility
and the leakage of sensitive information.
4.3.2     Optimal Embedding Dimensionality
The ability to effectively optimize the parameters of the encoder is critically dependent on the
dimensionality of the embedding as well. Higher dimensional embeddings can inherently absorb
                                                   54


unnecessary extraneous information in the data. Existing ARL applications, where the target and
adversary are non-linear NNs, select the dimensionality of the embedding on an ad-hoc basis.
    Adopting closed-form solvers for the target and adversary enables us to analytically determine
an upper bound on the optimal dimensionality of the embedding for OptNet-ARL. To obtain the
upper bound, we rely on the observation that a non-linear target predictor and adversary, by virtue
of greater capacity, can learn non-linear decision boundaries. As such, in the context of ARL, the
optimal dimensionality required by non-linear models is lower than the optimal dimensionality of
linear target predictor and adversary. Therefore, we analytically determine the optimal dimension-
ality of the embedding in the following theorem.
Theorem 4.4. Let Z in Figure 4.2 be disconnected from the encoder and be a free vector in Rr .
Further, assume that both adversary and target predictors are linear regressors, and γS = γY = 0.
Then, for any 0 ≤ λ ≤ 1 the optimal dimensionality of the embedding vector, r is the number of
negative eigenvalues of
                                  B = λS̃ T S̃ − (1 − λ)Ỹ T Ỹ .                             (4.9)
Proof. Recall that for linear regressor adversary and target predictors, we have
                              Yb = ΘY Z + bY ,       Sb = ΘS Z + bS .                        (4.10)
Following the proof in Lemma 4.3, we have
                      1      2     1          2                  1      2    1         2
           JY (Z) =       Ỹ    −     PM Ỹ     ,     JS (Z) =      S̃    −     PM S̃
                      n      F    n           F                  n      F    n         F
where M is the column space of Z̃ T Z̃ or equivalently the column space of Z̃. Consequently, it
                                                  55


follows that
  (1 − λ) JY (Z) − λ JS (Z) =
                                                                                             
                 1               T  2           T  2                    T  2              T 2
                    (1 − λ) Ỹ         − λ S̃        − (1 − λ) PM Ỹ         + λ PM S̃           .
                n                   F              F                       F                F
                                                                                             (4.11)
                         2
Now, consider PM Ỹ T      :
                         F
                                                                    
                                                  
                                                                    
                                                                     
                                       2                            
                             PM Ỹ  T                    T
                                            = Tr Ỹ PM PM Ỹ       T
                                       F          
                                                       | {z }       
                                                                     
                                                         PM
                                                                     
                                                  n            o
                                            = Tr PM Ỹ T Ỹ
Similarly,
                                           2         n           o
                                PM S̃ T       = Tr PM S̃ T S̃ .
                                           F
                 2            2
The terms Ỹ T     and S̃ T       on the right side of (4.11) are constants with respect to Z, and
                 F            F
hence can be ignored. We now get
                                2                        2
                   λ PM S̃ T       − (1 − λ) PM Ỹ T        = Tr {PM B} ,
                                F                        F
where
                              B = λ S̃ T S̃ − (1 − λ) Ỹ T Ỹ .
                                                 56


Noting that any projection matrix PM ∈ Rn×n of rank i ≤ n can be decomposed as V V T for
some orthogonal matrix V ∈ Rn×i , we get
                                                                                      
                                                          2                       T 2
                       min        min      λ  PM S̃ T       − (1 − λ) PM Ỹ
                    r∈{1,···,n} Z∈Rr×n                    F                         F
                                         =    min          min      Tr {PM B}
                                            r∈{1,···,n} dim M≤r
                                                                                 n          o
                                         =    min          min         min     Tr V V T B
                                            r∈{1,···,n} i∈{1,···,r} V T V =I
                                                                             i
                                                                         n         o
                                         =      min          min       Tr V V T B .
                                            i=r∈{1,···,n} V T V =I
                                                                     i
From trace optimization problems and their solution in [90], we have
                                       n          o
                  min        min    Tr V V T B         =        min     {β1 + β2 + · · · + · · · βr }
             r∈{1,···,n} V T V =Ir                          r∈{1,···,n}
                                                       = β1 + β2 + · · · + βj
where β1 , · · · , βr are the r smallest eigenvalues of B, j denotes the number of negative eigenvalues
of B and a possible minimizer is matrix Z in which the columns space of Z̃ T (i.e., M) is the span
of eigenvectors corresponding to all negative eigenvalues of B.
    Given a dataset with the target and sensitive labels, Y and S, respectively, the matrix B in (4.9)
and its eigenvalues can be computed offline to determine the upper bound on the optimal dimen-
sionality. By virtue of the greater capacity, the optimal dimensionality required by non-linear
models is lower than the optimal dimensionality of linear predictors and therefore, Theorem 2 is a
tight upper bound for the optimal embedding dimensionality. On large datasets where B ∈ Rn×n
can be a very large matrix, the Nyström method with data sampling [92] can be adopted.
                                                       57


4.3.3     Gradient of Closed-Form Solution
In order to find the gradient of the encoder loss function in (4.3) with JY and JS given in (4.7), we
can ignore the constant terms, kY kF and kSkF . Then, the optimization problem in (4.3) would
be equivalent to
                                                      2                             
                                                                                      2 
                       
                                                  T                           T         
                                                                                         
                                            S                           Y           
                  min (1 − λ) PM                       − λ PM                   
                    Θ                   S                            Y              
                       
                                                                                        
                                                                                         
                                              0n×d
                                                     S
                                                                           0n×d
                                                                                 Y
                                                                                         
                                                          F                            F
                                                                                       
                                          dS                        dY
                                        X
                                                          k  2     X                   2
                      = min (1 − λ)              PM uS − λ                PM u m           ,          (4.12)
                         Θ                           S                      Y Y        
                                         k=1                      m=1
                                                                                               
                                                                           ST                 T
                                                                                        Y      
where the vectors ukS and um     are the k-th   and   m-th  columns  of         and            , respec-
                              Y                                                                
                                                                          0n×d             0n×d
                                                                              S                 Y
tively. Let M be an arbitrary matrix function of Θ, and θ be an arbitrary scalar element of Θ.
Then, from [105] we have
                              ∂kPM uk2                        ∂M
                                              = 2 uT PM⊥          M † u,                              (4.13)
                                   ∂θ                          ∂θ
where
                             
                                                                  
                             
                           
                              ∇Tzi   [M ] ij   ∇  (z
                                                  θ i ) + ∇ T [M ]
                                                            zj    ij ∇θ (zj ), i ≤ n
                    ∂M
                           =
                     ∂θ ij  
                             
                             0,                                                 else.
 Equation (4.13) can be directly used to obtain the gradient of objective function in (4.12).
    Directly computing the gradient in equation (4.13) requires a pseudo-inverse of the matrix
M ∈ R2n×n , which has a complexity of O(n3 ). For large datasets, this computation can get
prohibitively expensive. Therefore, we approximate the gradient using a single batch of data as we
                                                       58


optimize the encoder end-to-end. Similar approximations [92] are in fact, commonly employed to
scale up kernel methods. Thus, the computational complexity of computing the loss for OptNet-
ARL reduces to O(b3 ), where b is the batch size. Since maximum batch sizes in training NNs are of
the order of 10 to 1000, computing the gradient is practically feasible. We note that the procedure
presented in this section is a simple SGD in which its stability can be guaranteed under Lipschitz
and smoothness assumptions on encoder network [106].
4.4      Experiments
In this section, we will evaluate the efficacy of our proposed approach, OptNet-ARL, on three
different tasks; Fair Classification on UCI [107] dataset, mitigating leakage of private informa-
tion on the CelebA dataset, and ablation study on a Gaussian mixture example. We also compare
OptNet-ARL with other ARL baselines in terms of stability of optimization, the achievable trade-
off front between the target and adversary objectives, convergence speed, and the effect of em-
bedding dimensionality. We consider three baselines, (1) SGDA-ARL: vanilla stochastic gradient
descent ascent that is employed by multiple ARL approaches including [24, 103, 2, 1, 108] etc.,
(2) ExtraSGDA-ARL: a state-of-the-art of stochastic gradient descent ascent that uses an extra
gradient step [72] for optimizing minimax games. Specifically, we use the ExtraAdam algorithm
from [69], and (3) SARL: a global optimum solution for a kernelized regressor encoder and linear
target and adversary [49]. Specifically, hypervolume (HV) [109], a metric for stability and good-
ness of trade-off (comparing algorithms under multiple objectives) is also utilized. A larger HV
indicates a better Pareto front achieved, and the standard deviation of the HV represents stability.
    In the training stage, the encoder, a DNN, is optimized end-to-end against kernel (RBF Gaus-
sian kernel) ridge regressors in the case of OptNet-ARL and multi-layer perceptrons (MLP)s for
                                                  59


the baselines. Table 4.1 summarizes the network architecture of all experiments. We note that the
optimal embedding dimensionality, r, for the binary target is equal to one, which is consistent with
Fisher’s linear discriminant analysis [110]. The embedding is instance normalized (unit norm).
So we adopted a fixed value of σ = 1 for the Gaussian Kernel in all the experiments. We let the
regression regularization parameter be 10−4 for all experiments. The learning rate is 3×10−4 with
weight decay of 2 × 10−4 , and we use Adam as an optimizer for all experiments.
    At the inference stage, the encoder is frozen, features are extracted, and a new target predictor
and adversary are trained. At this stage, for both OptNet-ARL and the baselines, the target and
adversary have the same model capacity. Furthermore, each experiment on each dataset is repeated
five times with different random seeds (except for SARL, which has a closed-form solution for the
encoder) and several different trade-off parameters λ ∈ [0, 1). We report the median and standard
deviation across the five repetitions.
4.4.1     Fair Classification
We consider fair classification on two different tasks. UCI Adult Dataset: It includes 14 features
from 45, 222 instances. The task is to classify the annual income of each person as high (50K or
above) or low (below 50K). The sensitive feature we wish to be fair with respect to is the gender
of each person. UCI German Dataset: It contains 1000 instances of individuals with 20 different
attributes. The target task is to predict their creditworthiness while being unbiased with respect
to age. The correlation between the target and sensitive attributes is 0.03 and 0.02 for the Adult
and German datasets, respectively. This indicates that the target attributes are almost orthogonal
to the sensitive attributes. Therefore, the sensitive information can be totally removed with only a
negligible sacrifice on the performance of the target task.
    Stability and Performance: Since there is no trade-off between the two attributes, we compare
                                                   60


                         Table 4.1: Network Architectures in Experiments.
        Method             Encoder       Embd           Target      Adversary    Target  Adversary
        (ARL)                             Dim           (Train)        (Train)   (Test)    (Test)
                                                                 Adult
        SGDA [24, 2]       MLP-4-2           1          MLP-4         MLP-4     MLP-4-2   MLP-4-2
        ExtraSGDA [72]     MLP-4-2           1          MLP-4         MLP-4     MLP-4-2   MLP-4-2
        SARL [49]          RBF krnl          1           linear         linear  MLP-4-2   MLP-4-2
        OptNet-ARL         MLP-4-2           1         RBF krnl      RBF krnl   MLP-4-2   MLP-4-2
        (ours)
                                                                German
        SGDA [24, 2]        MLP-4            1          MLP-2         MLP-2     logistic  logistic
        ExtraSGDA [72]      MLP-4            1          MLP-2         MLP-2     logistic  logistic
        SARL [49]          RBF krnl          1           linear         linear  logistic  logistic
        OptNet-ARL          MLP-4            1         RBF krnl      RBF krnl   logistic  logistic
        (ours)
                                                                CelebA
        SGDA [24, 2]      ResNet-18        128         MLP-64         MLP-64   MLP-32-16 MLP-32-16
        ExtraSGDA [72]    ResNet-18        128        MLP-64-32 MLP-64-32 MLP-32-16 MLP-32-16
        OptNet-ARL        ResNet-18    [1, 128]        RBF krnl      RBF krnl MLP-32-16 MLP-32-16
        (ours)
                                                          Gaussian Mixture
        SGDA [24, 2]       MLP-8-4           2         MLP-8-4       MLP-8-4    MLP-4-4   MLP-4-4
        ExtraSGDA [72]     MLP-8-4           2         MLP-8-4       MLP-8-4    MLP-4-4   MLP-4-4
        SARL [49]          RBF krnl          2           linear         linear  MLP-4-4   MLP-4-4
        RBF-OptNet-ARL     MLP-8-4           2         RBF krnl      RBF krnl   MLP-4-4   MLP-4-4
        (ours)
        IMQ-OptNet-ARL     MLP-8-4   [1, · · · , 512] IMQ krnl       IMQ krnl   MLP-4-4   MLP-4-4
        (ours)
stability by reporting the median and standard deviation of the target and adversary performance
in Table 4.2. Our results indicate that OptNet-ARL achieves a higher accuracy for target tasks
and lower leakage of the sensitive attribute with less variance. For instance, in the Adult dataset,
our OptNet-ARL method achieves 83.81% target accuracy with almost zero sensitive leakage. For
OptNet-ARL, the standard deviation of the sensitive attribute is exactly zero, which demonstrates
its effectiveness and stability in comparison to the baselines. Similarly, for the German dataset,
OptNet-ARL achieves 80.13% for sensitive accuracy, which is close to random chance (around
81%) while having the largest target accuracy compared to the baselines.
                                                      61


               Table 4.2: Fair classification on UCI Adult and German datasets (in %)
                                          Adult Dataset                    German Dataset
          Method                   Target       Sensitive  Diff      Target      Sensitive   Diff
                                 (income)       (gender)  67.83     (credit)       (age)      81
          Raw Data                  85.0          85.0     17.6       80.0         87.0      6.0
          LFR [36]                  82.3          67.0      0.4       72.3         80.5      0.5
          AEVB [94]                 81.9          66.0      1.4       72.5         79.5      1.5
          VFAE [37]                 81.3          67.0      0.4       72.7         79.7      1.3
          SARL [49]                 84.1          67.4      0.0       76.3         80.9      0.1
          SGDA-ARL [24]        83.61 ± 0.38  67.08 ± 0.48  0.40  76.53 ± 1.07 87.13 ± 5.70   6.13
          ExtraSGDA-ARL [69]   83.66 ± 0.26  66.98 ± 0.49   0.4  75.60 ± 1.68 86.80 ± 4.05   5.80
          OptNet-ARL           83.81 ± 0.23  67.38 ± 0.00  0.00  76.67 ± 2.21 80.13 ± 1.48   0.87
4.4.2     Mitigating Sensitive Information Leakage
The CelebA dataset [111] contains 202, 599 face images of 10, 177 celebrities. Each image con-
tains 40 different binary attributes (e.g., gender, emotion, age, etc.). Images are pre-processed and
aligned to a fixed size of 112 × 96, and we use the official train-test splits. The target task is defined
as predicting the presence or absence of high cheekbones (binary), with the sensitive attribute be-
ing smiling/not smiling (binary). The choice of this attribute pair is motivated by the presence of
a trade-off between them. We observe that the correlation between this attribute pair is equal to
0.45, indicating that there is no encoder that can maintain target performance without leaking the
sensitive attribute.
    For this experiment, we note that SARL [49] cannot be employed since (1) it does not scale to
large datasets (O(n3 )) like CelebA, and (2) it cannot be applied directly on raw images and needs
features extracted from a pre-trained network. Most other attribute pairs in this dataset either
suffer from severe class imbalance or small correlation, indicating the lack of a trade-off. Network
architecture details are shown in Table 4.1.
    Stability and Trade-off: Figure 4.3 (a) shows the attainment surface [112] and hypervol-
ume [109] (median and standard deviation) for all methods. SGDA-ARL spans only a small part
                                                     62


of the trade-off and, at the same time, exhibits large variance around the median curves. Overall,
both baselines are unstable and unreliable when the two attributes are dependent on each other.
On the other hand, OptNet-ARL solutions are very stable while also achieving a better trade-off
between target and adversary accuracy.
    Optimal Embedding Dimensionality: Figure 4.3 (b) compares the utility-bias trade-off of the
sub-optimal embedding dimensionality (r = 128) with that of the optimal dimensionality (r = 1).
We can observe that optimal embedding dimensionality (r = 1) is producing a more stable trade-
off between adversary and target accuracies.
    Training Time: It takes five runs for SGDA-ARL and ExtraSGDA and two runs for OptNet-
ARL to train a reliable encoder for overall 11 different values of λ ∈ [0, 1). The summary of
training time is given in Figure 4.3 (c). ExtraSGDA-ARL takes an extra step to update the weights,
and therefore, it is slightly slower than SGDA-ARL. OptNet-ARL, on the other hand, is signif-
icantly faster in obtaining reliable results. Even for a single run, OptNet-ARL is faster than the
baselines. This is because OptNet-ARL uses closed-form solvers for adversary and target and
therefore does not need to train any additional networks downstream to the encoder.
    Independent Features: We consider the target task to be the binary classification of smil-
ing/not smiling, with the sensitive attribute being gender. In this case, the correlation between
gender and target feature is 0.02, indicating that the two attributes are almost independent, and
hence it should be feasible for an encoder to remove the sensitive information without affecting the
target task. The results are presented in Figure 4.3 (d). In contrast to the scenario where the two
attributes are dependent, we observe that all ARL methods can perfectly hide the sensitive infor-
mation (gender) from representation without sacrificing on the target task performance. Therefore,
OptNet-ARL is especially effective in a more practical setting where the target and sensitive at-
tributes are correlated and hence can only attain a trade-off.
                                                 63


                                                               (a)                                         (b)
                                                                                            Method                  Target    Sensitive
                                                                                            (ARL)                 (smiling)   (gender)                  Method
        Target   Sensitive                                                                                          MSE            %
      (smiling)  (gender)                   Method         One Run   Training Time          No Privacy               93.0         -                SGDA-ARL [3
        MSE           %                                                                     SGDA [39]                93.0        0.0            ExtraSGDA-ARL 
         93.0        -                  SGDA-ARL [39]         79.8         398.9
                                    ExtraSGDA-ARL [14]        84.0         419.7            ExtraSGDA [14]           93.0        0.0              OptNet-ARL (ou
         93.0       0.0
         93.0       0.0               OptNet-ARL (ours)       76.8         153.6            OptNet-ARL (ours)        93.0        0.0
 s)      93.0       0.0
                                                          e
                                                          (c)                                                  (d)
                                                                                                               d
    d
                       Figure 4.3: CelebA: (a) Trade-off between adversary and target accuracy for dependent pair
                       (smiling/not-smiling, high cheekbones). (b) Comparison between the trade-offs of optimal em-
                       bedding dimensionality        c r = 1 and that of r = 128. (c) Overall and single run training time for dif-
  dversary and target accuracy forTrade-off
                       ferent   ARL     methods.    (d)     dependent between pair adversary and target for independent pair (smiling/not-
                      Figure           2: CelebA: (a) Trade-off between adversary and target
                       smiling, gender).
  de-off between adversary and target for independent                                                                                             accuracy
  d single run (smiling/not-smiling,
                      training time for different ARL            highmethods.
                                                                          cheekbones). (b) Trade-off between adversary and ta
ARL trainingpair       4.4.3     is Ablation
                        time(smiling/not-smiling,
                                     significantly Study lower onthan
                                                                    Mixture
                                                                          that ofof Four
                                                                        gender)               Gaussians
                                                                                       (b) Overall      and single run training time for dif
  aster becauseOptNet-ARLit does not needRegardless
                                                  to train any target         and
                                                                        the fact that OptNet-ARL training time is significant
                       In this experiment,
                      baselines,           evenweaconsider
                                                         singlea simple
                                                                    run ofexample        where the data
                                                                                  OptNet-ARL           isisfaster
                                                                                                            generated     by a mixture
                                                                                                                      because             of fournot need to
                                                                                                                                      it does
                      adversary.
                       different Gaussian distributions. Let {fi }4i=1 be all Gaussian distributions with means at (0, 0),
                       (0, 1), (1, 0), and (1, 1), respectively and covariance matrices all equal to C = 0.22 I2 . Denote by
lack of a trade-off. Network architecture details are
                       f (X) the distribution of the input data. Then, it follows that
  the attainment surface [22] and hypervolume [42]
           259        imbalance or small correlation, indicating the lack of a trade-off. Network
. SGDA spans only a small part of the trade-off and at
                                                                                                                                                           arc
e median   260 curves.shown          in Table
                               Overall             f2.
                                           both baselines
                                                      (X| •) =are      unstable
                                                                   f1 (X)
                                                                                 1
                                                                            + f2 (X) + + f3 (X),
                                                                                                1
                                                                                                               P {•} =
                                                                                                                           1
                                                                                 2              2                          2
ndent on each other. On the other hand, OptNet-ARL
            better Stability              and Trade-off:                  Figure        ??(a)1 shows the            attainment           surface [22]       an
                                                                                 1                                         1
ving a 261             trade-off between           ftarget
                                                      (X| •)and = f4adversary
                                                                      (X) + f2 (X) + + f3 (X),                 P {•} =
           262       (median and standard deviation)                             2 for all methods.
                                                                                                2             SGDA spans   2          only a small part o
RL and263             the same
              ExtraSGDA           and twotimeruns exhibits         large variance
                                                       for OptNet-ARL             to        around the median curves. Overall both b
  values264    of and  The
                       2       unreliable
                           [0,sensitive
                                1]. Theattribute  when
                                            summary         ofthe
                                                      is assumed     two     attributes
                                                                     to be time
                                                                training     the   color (0 forare
                                                                                                 reddependent
                                                                                                     and 1 for blue),on  andeach      other.
                                                                                                                             the target        On the othe
                                                                                                                                          task is
  an extra 265stepsolutions
                        to update the     areweights
                                                very stable            and while
                                                           and therefore,          it     also achieving a better trade-off between 
RL on 266   the other  reconstructing
                            hand      is   the input data.
                                         significantly           We sample
                                                             faster   to  obtain4000 points for training and 1000 points for the testing
                      accuracy.
t-ARL is faster than the baselines. This is because,
  rsary and267 targetTrainingand thereforeTime:   does  It not
                                                           takes need five    runs 64
                                                                         to train      for SGDA-ARL and ExtraSGDA and two run
 oder. 268 train a reliable encoder for overall 11 different values of 2 [0, 1]. The summ
  task to269          is given
              be binary                 in Table
                              classification            2(c). ExtraSGDA-ARL
                                                  of smiling/not         smiling                    takes an extra step to update the weig


set independently. For visualization, the testing set is shown in Figure 4.4 (a). In this illustrative
dataset, the correlation between input data and color is 0.61, and therefore there is no encoder
that results in full target performance at no leakage of the sensitive attribute. Network architecture
details are shown in Table 4.1.
    Stability and Trade-off: Figure 4.4 (b) illustrates the five-run attainment surfaces and me-
dian hypervolumes for all methods. Since the dimensionality of both input and output is 2, the
optimal embedding dimensionality is equal to 2, which we set in this experiment. We note that
SARL achieves hypervolume better than SGDA and ExtraSGDA ARLs, which is not surprising
due to the strong performance of SARL on small-size datasets. However, SARL is not applicable
to large datasets. Among other baselines, ExtraSGDA-ARL appears to be slightly better. In con-
trast, the solutions obtained by RBF-OptNet-ARL (Gaussian kernel) outperform all baselines and
are highly stable across different runs, which can be observed from both attainment surfaces and
hypervolumes. In addition to Gaussian kernel, we also used inverse multi quadratic (IMQ) ker-
nel [113]1 for OptNet-ARL to examine the effect kernel function. As we observe from Figure 4.4
(b), IMQ-OptNet-ARL performs almost similar to OptNet-ARR with Gaussian kernel in terms of
both trade-off and stability.
    Batch Size: In order to examine the effect of batch size on OptNet-ARL (with Gaussian ker-
nel), we train the encoder with different values of batch size between 2 and 4000 (entire training
data). The results are illustrated in Figure 4.4 (c). We observe that the HV is quite insensitive to
batch sizes greater than 25, which implies that the gradient of the mini-batch is an accurate enough
estimator of the gradient of the entire data.
    Embedding Dimensionality: We also study the effect of embedding dimensionality (r) by
   1 k(z, z 0 ) = q    1
                    kz−z 0 k2 +c2
                                                  65


                  1.5
                    1
                  0.5
                    0
                −0.5
                       −0.5  0     0.5  1   1.5
                               (a)                                  (b)
                               (c)                                   (d)
Figure 4.4: Mixture of Gaussians: (a) Input data. The target task is to learn a representation that
is informative enough to reconstruct the input data and, at the same time, hide the color information
(• versus •). (b) The trade-off between the MSEs of adversary and target task for different ARL
methods. (c) The HVs of OptNet-ARL (Gaussian kernel) versus different batch size values in
[2, 4000]. (d) The HV values of OptNet-ARL (Gaussian kernel) versus different values of r in
[1, 512].
examining different values for r in [1, 512] using RBF-OptNet-ARL. The results are illustrated in
Figure 4.4 (d). It is evident that the optimal embedding dimensionality (r = 2) outperforms other
values of r. Additionally, HV of r = 1 suffers severely due to the information loss in embedding,
while for 2 < r ≤ 512, the trade-off performance is comparable to that of optimal embedding
dimensionality, i.e., r = 2.
4.5       Summary
Adversarial representation learning is a minimax theoretic game formulation that affords explicit
control over unwanted information in learned data representations. Optimization algorithms for
                                                 66


ARL, such as SGDA and their variants, are sub-optimal, unstable, and unreliable in practice. In this
chapter, we introduced OptNet-ARL to address this challenge by employing differentiable closed-
form solvers, such as kernelized ridge regressors, to model the ARL players that are downstream
from the representation. OptNet-ARL reduces iterative SGDA to a simple optimization, leading to
a fast, stable, and reliable algorithm that outperforms existing ARL approaches on both small and
large-scale datasets.
                                                  67


Chapter 5
Universal Invariant Representation
Learning
5.1      Introduction
Ideally, the utility-invariance trade-off is defined as a bi-objective optimization problem:
                  inf       EXY [LY (gY (f (X)) , Y )]       such that   Dep (f (X), S) ≤ ,      (5.1)
           f ∈HX , gY ∈HY
where f is the encoder that extracts the representation Z = f (X) from X, gY predicts Yb from
the representation Z, HX and HY are the corresponding hypothesis classes, and LY is the loss
function for predicting the target attribute Y . The function Dep(·, ·) ≥ 0 is a parametric or non-
parametric measure of statistical dependence, i.e., Dep(Q, U ) = 0 implies Q and U are indepen-
dent, and Dep(Q, U ) > 0 implies Q and U are dependent with larger values indicating greater
degrees of dependence. The scalar  ≥ 0 is a user-defined parameter that controls the trade-off
between the two objectives, with  → ∞ being the standard scenario that has no invariance con-
straints with respect to (w.r.t.) S while  → 0 enforces Z ⊥⊥ S (i.e., total invariance). Involving all
Borel functions in HX and HY ensures that the best possible trade-off is included within the feasi-
ble solution space. For example, when  → ∞ and LY is MSE loss, the optimal Bayes estimation,
                                                   68


                                                               gY (Z)             Yb
                    X           f (X)            Z
                                                             Dep(Z, S)             S
Figure 5.1: Invariant representation learning seeks a representation Z = f (X) that contains as
much information necessary for the downstream target predictor gY while being independent of
the semantic attribute S.
gY (f (X)) = E [Y | X] is attainable.
    In this chapter, we consider the linear combination of utility and invariance in (5.1) and de-
fine the optimal utility-invariance trade-off (denoted by TOpt ) as a single objective optimization
problem:
Definition 5.1.
    TOpt :=                                                                                       (5.2)
            (                                                                     )
      inf     (1 − λ)    inf   EX,Y [LY (gY (f (X)) , Y )] + λ Dep (f (X) , S) ,        0 ≤ λ < 1,
    f ∈HX              gY ∈HY
where λ controls the trade-off between utility and invariance (e.g., λ = 0 corresponds to ignoring
the invariance and only optimizing the utility, while λ → 1 corresponds to Z ⊥⊥ S).
    The motivation behind deploying this single-objective IRepL is that any solution to this simpli-
fied problem is a solution to the bi-objective problem in (5.1) and even (5.2) is challenging to solve,
and it has not been fully investigated by existing works. An illustration of the utility-invariance
trade-off is illustrated in Figure 5.2. In this chapter, we restrict HX to be some RKHSs and
Dep(Z, S) to be a simplified version of the Hilbert-Schmidt Independence Criterion (HSIC) [81].
Further, we replace the target loss function in (5.2) by Dep(Z, Y ) as presented and justified in
Sections 5.3.1 and 5.5.2.
                                                  69


                           Utility
                           100%
                                                      le   TOpt
                                                 ib
                                               as
                                             fe
                                           In
                                                               TSub
                        Increase Utility
                                                                    le
                                                               s ib
                                                           Fea
                        Yb ⊥
                           ⊥S                     Increase Independence   Dep(Z, S)
Figure 5.2: (The trade-off (denoted by TOpt ) between utility (target task performance) and invari-
ance (measured by the dependence metric Dep(Z, S)) is induced by a controlled representation
learner in the hypothesis class of all Borel functions.
   The basic idea of representation learning that discards unwanted semantic information has
been explored under many contexts like invariant, fair, or privacy-preserving learning. In domain
adaptation [11, 12, 39], the goal is to learn features that are independent of the data domain. In
fair learning [41, 42, 43, 40, 36, 21, 23, 24, 22, 44, 2, 26, 45, 46, 47, 48, 49], the goal is to discard
the demographic information that leads to unfair outcomes. Similarly, there is growing interest in
mitigating unintended leakage of private information from representations [51, 52, 1, 53, 54].
   A vast majority of this body of work is empirical in nature. They implicitly look for single or
multiple points on the trade-off between utility and semantic information and do not explicitly seek
to characterize the whole trade-off front. Overall, these approaches are not concerned with or aware
of the inherent utility-invariance trade-off. In contrast, with the cost of restricting encoders to lie
in some RKHSs, we exactly characterize the trade-off and propose a practical learning algorithm
that achieves this trade-off.
                                                               70


5.1.1     Adversarial Representation Learning
Most practical approaches for learning fair, invariant, domain adaptive, or privacy-preserving rep-
resentations discussed above are based on adversarial representation learning (ARL). ARL is typi-
cally formulated as
        (                                                                                       )
  inf     (1 − λ)    inf    EX,Y [LY (gY (f (X)) , Y )] − λ     inf   EX,S [LS (gS (f (X)) , S)] ,(5.3)
f ∈HX             gY ∈HY                                      gS ∈HS
where LS is the loss function of a hypothetical adversary gS who intends to extract the semantic
attribute S through the best estimator within the hypothesis class HS and 0 ≤ λ < 1 is the
utility-invariance trade-off parameter. ARL is a special case of (5.2) where the negative loss of
the adversary, −    inf     EX,S [LS (gS (f (X)) , S)] plays the role of Dep(f (X), S). However, this
                  gS ∈HS
form of adversarial learning suffers from a drawback. The induced independence measure is not
guaranteed to account for all modes of non-linear dependence between S and Z if the adversary
loss function LS is not bounded like MSE or cross-entropy [56, 114]. In the case of MSE loss, even
if the loss is maximized at a bounded value, where the corresponding representation Z = f (X) is
also bounded, still, it is not guaranteed that Z ⊥⊥ S is attainable (see Section 5.2 for more details).
This implies that designing the adversary loss in ARL that accounts for all modes of dependence
is challenging, and it can be infeasible for some loss functions.
5.1.2     Trade-Offs in Invariant Representation Learning:
Prior work has established the existence of trade-offs in IRepL, both empirically and theoretically.
In the following, we categorize them based on properties of interest.
Restricted Class of Attributes: A majority of existing work considers IRepL trade-offs under
restricted settings, i.e., binary and/or categorical attributes Y and S. For instance, [60] uses
                                                   71


information-theoretic tools and characterizes the utility-fairness trade-off in terms of lower bounds
when both Y and S are binary labels. Later [55] provided both upper and lower bounds for binary
labels. By leveraging Chernoff bound, [61] proposed a construction method to generate an ideal
representation beyond the input data to achieve perfect fairness while maintaining the best per-
formance on the target task. In the case of categorical features, a lower bound on utility-fairness
trade-off has been provided by [6] for the total invariance scenario (i.e., Z ⊥⊥ S). In contrast to this
body of work, our trade-off analysis applies to multidimensional continuous/discrete attributes. To
the best of our knowledge, the only prior works with a general setting are [49] and [9]. However,
in [9], both S and Y are restricted to be continuous/discrete or binary at the same time (e.g., it is
not possible to have Y binary while S is continuous).
Characterizing Exact versus Bounds on Trade-Off: To the best of our knowledge, all existing
approaches except [49], which obtains the trade-off for the linear dependence only, characterize the
trade-off in terms of upper and/or lower bounds. In contrast, we exactly characterize a near-optimal
trade-off with closed-form expressions for encoders belonging to some RKHSs.
Optimal Encoder and Representation: Another property of practical interest is the optimal en-
coder that achieves the desired point on the utility-invariance trade-off and the corresponding rep-
resentation(s). Existing works which only study bounds on the trade-off do not obtain the encoder
that achieves those bounds. [49] do develop a learning algorithm that obtains a globally optimal
encoder, but only under a linear dependence measure between Z and S. HSIC, a universal measure
of dependence, has been adopted by prior work (e.g., [62]) to quantify all types of dependencies
between Z and S. However, these methods adopt stochastic gradient descent for optimizing the
underlying non-convex optimization problem. As such, they fail to provide guarantees that the
representation learning problem converges to a global optima. In contrast, we obtain a closed-form
solution for the optimal encoder and its corresponding representation while detecting all modes of
                                                   72


dependence between Z and S.
     Summary of Contributions: i) We design a dependence measure that accounts for all modes
of dependence between Z and S (under a mild assumption) while allowing for analytical tractabil-
ity. ii) We employ functions in RKHSs and obtain closed-form solutions for the IRepL optimiza-
tion problem. Consequently, we exactly characterize a near-optimal approximation of TOpt via
encoders restricted to RKHSs. iii) We obtain a closed-form estimator for the encoder that achieves
the near-optimal trade-off, and we establish its numerical convergence. iv) Using random Fourier
features (RFF) [115], we provide a scalable version (in terms of both memory and computation) of
our IRepL algorithm. v) We numerically quantify our TOpt (denoted by K-TOpt ) on an illustrative
problem as well as large-scale real-world datasets, Folktables [116] and CelebA [111], where we
compare K-TOpt to those obtained by existing works.
5.2       Deficiency of Mean-Squared Error as
          A Measure of Dependence
Theorem 5.2. Let HS contain all Borel functions, S be a dS -dimensional RV, and LS (·, ·) be MSE
loss. Then,
                            (                                 )
             Z ∈ arg sup        inf   EX,S [LS (gS (Z) , S)]      ⇔ E[S | Z] = E[S].
                              gS ∈HS
Proof. Let Si , (gS (Z))i , and (E[S | Z] )i denote the i-th entries of S, gS (Z), and E[S | Z], respec-
                                                   73


tively. Then, it follows that
                                                         dS       h                  i
                                                        X
            inf    EX,S [LS (gS (Z) , S)] =         inf      EX,S ((gS (Z))i − Si )2
          gS ∈HS                                 gS ∈HS
                                                        i=1
                                                  dS       h                     i
                                                 X
                                             =        EX,S ((E[S | Z] )i − Si )2
                                                 i=1
                                                  dS    h                  i X  dS
                                                 X
                                                                         2
                                             ≤        ES ((E[S] )i − Si ) =        Var[Si ],
                                                 i=1                           i=1
where the second step is due to the optimality of conditional mean (i.e., Bayes estimation) for
MSE [117] and the last step is because independence between Z and S leads to an upper bound on
                                   n                                   o
MSE. Therefore, if Z ∈ arg sup inf g ∈H EX,S [LS (gS (Z) , S)] , then E[S | Z] = E[S]. On
                                        S     S
the other hand, if E[S | Z] = E[S], then it follows immediately that
                                       (                                 )
                           Z ∈ arg sup      inf   EX,S [LS (gS (Z) , S)] .
                                         gS ∈HS
    This theorem implies that an optimal adversary does not necessarily lead to a representation
Z that is statistically independent of S but instead leads to S being mean independent of the
representation Z.
                                                   74


5.3      Problem Setting
5.3.1     Problem Setup
The representation RV Z can be expressed as
            Z = f (X) := [Z1 , · · · , Zr ]T ∈ Rr ,    Zj = fj (X), fj ∈ HX ∀j = 1, . . . , r,
                                                                                                  (5.4)
                         = Θ [kX (x1 , X), · · · , kX (xn , X)]
where r is the dimensionality of the representation and Θ ∈ Rr×n . As we will discuss in Corollary
5.1, unlike common practice where r is chosen on an ad-hoc basis, it is an object of interest for
optimization. We consider a general scenario where both Y and S can be continuous/discrete or
categorical, or one of Y or S is continuous/discrete while the other is categorical. To accomplish
this, we replace the target loss,      inf   EX,Y [LY (gY (Z), Y )] in (5.2) by the negative of a non-
                                   gY ∈HY
parametric measure of dependence, i.e., −Dep (Z, Y ). The main reason for this replacement is
that maximizing statistical dependency between the representation Z and the target attribute Y can
flexibly learn a representation that is applicable for different downstream target tasks, including
regression, classification, clustering, etc [118]. Particularly, Theorem 5.8 in Section 5.5.2 indicates
that with an appropriate choice of involved RKHS for Dep (Z, Y ), we can learn a representation
that lends itself to an estimator that performs as optimally as the Bayes estimation, i.e., EX [Y |X].
Furthermore, in an unsupervised setting, where there is no target attribute Y , the target loss can
be replaced with Dep (Z, X), which implicitly forces the representation Z to be as dependent on
the input data X. This scenario is of practical interest when a data producer aims to provide an
invariant representation for an unknown downstream target task.
                                                     75


                                                      Xr   X
                                       Dep(Z, Y ) :=             Cov2 (Zj , βY (Y )) βY (·)   Y
                                                      j=1 βY ∈UY
  X       Z = f (X) = Θ kX (·, X)
                                                      Xr    X
                                        Dep(Z, S) :=             Cov2 (Zj , βS (S))  βS (·)   S
                                                      j=1 βS ∈US
Figure 5.3: Our IRepL model consists of three components: i) An r-dimensional encoder f be-
longing to the universal RKHS HX . ii) A measure of dependence that accounts for all dependence
modes between data representation Z and semantic attribute S induced by the covariance between
Z = f (X) and βS (S) where βS belongs to a universal RKHS HS . iii) A measure of dependency
between Z and the target attribute Y defined similarly to that for S.
5.4     Choice of Dependence Measure
We only discuss for Dep (Z, S) since Dep (Z, Y ) follows similarly. Accounting for all possible
non-linear relations between RVs is a key desideratum of dependence measures. A well-known ex-
ample of such measures is MI (e.g., MINE [119]). However, calculating MI for multidimensional
continuous representation is analytically challenging and computationally intractable. Kernel-
based measures are an alternative solution with the attractive properties of being computationally
feasible/efficient and analytically tractable [83].
    Principally, Z ⊥⊥ S iff Cov(α(Z), βS (S)) = 0 for all Borel functions α : Rr → R and
βs : RdS → R belonging to the universal RKHSs HZ and HS , respectively. Alternatively, Z ⊥    ⊥S
iff HSIC(Z, S) = 0 for HSIC [81] being defined as
                                         X      X
                       HSIC(Z, S) :=                  Cov2 (α(Z), βS (S)) ,                  (5.5)
                                        α∈UZ βS ∈US
where UZ and US are countable orthonormal basis sets for the separable universal RKHSs HZ
and HS , respectively. However, since Z = f (X) , calculating Cov(α(Z), βS (S)) necessitates
the application of a cascade of kernels, which limits the analytical tractability of our solution.
                                                   76


Therefore, we adopt a simplified version of HSIC that considers transformation on S only but
affords analytical tractability for solving the IRepL optimization problem. We define this measure
as
                                          Xr    X                       
                          Dep(Z, S) :=               Cov2 Zj , βS (S) ,                          (5.6)
                                          j=1 βS ∈US
where Zj = fj (X) for fj s defined in (5.4). We note that Dep(·, ·), unlike HSIC and other
kernelization-based dependence measures, is not symmetric. However, symmetry is not neces-
sary for measuring statistical dependence. The measure Dep(Z, S) in (5.6) captures all modes of
non-linear dependence under the assumption that the distribution of a low-dimensional projection
of high-dimensional data is approximately normal [120], [121]. To see why this reasoning is rel-
evant, we note from (5.4) that Z can be expressed as Z = ΘV , where V ∈ Rn and Θ ∈ Rr×n .
This indicates that for large n and small r (which is the case for most real-world datasets), Z is
indeed a low-dimensional projection of high-dimensional data. In other words, (Z, βS (S)) is ap-
proximately a jointly Gaussian RV. In our numerical experiments in Section 5.6 we empirically
observe that Dep(Z, S) enjoys a monotonic relation with the underlying invariance measure and
captures all modes of dependency in practice, especially as Z ⊥⊥ S. Nevertheless, if the normal-
ity assumption on the distribution of (Z, βS (S)) fails, Dep(Z, S) reduces to measuring the linear
dependency between Z and βS (S) for all Borel functions βS . This corresponds to measuring the
mean independency of Z from S, i.e., how much information a predictor (linear and non-linear)
can infer (in the sense of MSE) about Z from S. See Section 5.2 for more technical details on
mean independency.
Lemma 5.3. Let KX , KS ∈ Rn×n be the Gram matrices corresponding to HX and HS , re-
spectively, i.e., (KX )ij = kX (xi , xj ) and (KS )ij = kS (si , sj ), where covariance is empirically
                                                   77


estimated as
                                       n                        n n
                                1X                          1 XX
           Cov fj (X), βS (S) ≈           fj (xi )βS (si ) − 2         fj (xi )βS (sk ).
                                    n                       n
                                      i=1                      i=1 k=1
It follows that, the corresponding empirical estimation for Dep (Z, S) is
                                                    1
                             Depemp (Z, S) =           kΘKX HLS k2F ,                        (5.7)
                                                   n2
where H = In − n1 1n 1Tn is the centering matrix, and LS is a full column-rank matrix in which
LS LTS = KS (Cholesky factorization). Furthermore, the empirical estimator in (5.7) has a bias of
O(n−1 ) and a convergence rate of O(n−1/2 ).
Proof. See Appendix B.2
     Notice that the dependence measure between Z and Y can be defined similarly.
5.5      Exact Kernelized Trade-Off
Consider the optimization problem corresponding to TOpt in (5.2). Recall that Z = f (X) is an
r-dimensional RV, where the embedding dimensionality r is also a variable to be optimized. A
common desideratum of learned representations is that of compactness [122], to avoid learning
representations with redundant information where different dimensions are highly correlated to
each other. Therefore, going beyond the assumption that each component of f (i.e., fj s) belongs
to the universal RKHS HX , we impose additional constraints on the representation. Specifically,
we constrain the search space of the encoder f (·) to learn a disentangled representation [122] as
                                                  78


follows
              n                                                                          o
      Ar := (f1 , · · · , fr ) | fi , fj ∈ HX , Cov fi (X), fj (X) + γ hfi , fj iH = δi,j .    (5.8)
                                                                                  X
                                             
In the above set, the Cov fi (X), fj (X) part enforces the covariance matrix of Z = f (X) to be
an identity matrix. This kind of disentanglement is used in the principal component analysis (PCA)
and encourages the variance of each entry of Z to be one and different entries of Z to be uncorre-
lated with each other. The regularization part, γ hfi , fj iH     encourages the encoder components
                                                                X
to be as orthogonal as possible to each other and to be of the unit norm, which aids with numer-
ical stability during empirical estimation [85]. As the following theorem states formally, such
disentanglement is an invertible transformation, and therefore it does not nullify any information.
Theorem 5.4. Let Z = f (X) be an arbitrary representation of the input data, where f ∈ HX .
Then, there exists an invertible Borel function h, such that h ◦ f belongs to Ar .
Proof. See Appendix B.3
    This Theorem implies that the disentanglement preserves the performance of the downstream
task since any target network can revert the disentanglement h and access to the original rep-
resentation Z. In addition, any deterministic measurable transformation of Z will not add any
information about S that does not already exist in Z.
    We define our K−TOpt as
         sup {J (f , λ) := (1 − λ) Dep (f (X), Y ) − λ Dep (f (X), S)} ,         0 ≤ λ < 1,    (5.9)
        f ∈Ar
where λ is the utility-invariance trade-off parameter. Fortunately, the above optimization problem
lends itself to a closed-form solution as follows.
                                                    79


Theorem 5.5. Consider the operator ΣSX to be induced by the bi-linear functional
                               Cov(α(X), βS (S)) = hβS , ΣSX αiH
                                                                     S
and define ΣY X and ΣXX , similarly. Then, a global optima for the optimization problem in (5.9)
is the eigenfunctions corresponding to the r largest eigenvalues of the following generalized eigen-
value problem
                                                     
                  (1 − λ) Σ∗Y X ΣY X − λ Σ∗SX ΣSX f = τ (ΣXX + γ IX ) f ,                        (5.10)
where γ is the disentanglement regularization parameter defined in (5.8), IX is the identity oper-
ator in HX , and Σ∗ is the adjoint of Σ.
Proof. Consider Dep(Z, S) in (5.6):
                                           X X     r
                                                                            
                        Dep(Z, S) =                   Cov2 fj (X), βS (S)
                                          βS ∈US j=1
                                          Xr    X                  2
                                      =                βS , ΣSX fj H
                                                                      S
                                          j=1 βS ∈US
                                          Xr
                                      =       kΣSX fj k2H ,
                                                          S
                                          j=1
where the last step is due to Parseval’s identity for the orthonormal basis set. Similarly, we have
              Pr                2
dep(Z, Y ) =     j=1 kΣY X fj kHY  . Recall that Z = f (X) = [(f1 (X), · · · , fr (X)], then it follows
                                                 80


that
                                    X r                      Xr
         J(f (X)) = (1 − λ)              kΣY X fj k2H    −λ      kΣSX fj k2H
                                                       Y                     S
                                    j=1                      j=1
                                    X r                               Xr
                     = (1 − λ)              ΣY X fj , ΣY X fj H − λ          ΣSX fj , ΣSX fj H
                                                                Y                              S
                                    j=1                               j=1
                         X r
                     =          fj , ((1 − λ) Σ∗Y X ΣY X − λ Σ∗SX ΣSX )fj H ,
                                                                                X
                         j=1
where Σ∗ is the adjoint operator of Σ. Further, note that Cov(fi (X), fj (X)) = hfi , ΣXX fj iH .
                                                                                                       X
As a result, the optimization problem in (5.10) can be restated as
                                X r
                                                                               
             sup                        fj , (1 − λ)Σ∗Y X ΣY X − λ Σ∗SX ΣSX fj H ,           1 ≤ i, k ≤ r
hfi ,(ΣXX +γIX )fk iH =δi,k j=1                                                        X
                      X
where IX denotes identity operator from HX to HX . This optimization problem is known as
generalized Rayleigh quotient [123] and a possible solution to it is given by the eigenfunctions
corresponding to the r largest eigenvalues of the following generalized problem
                   ((1 − λ) ΣXY ΣY X − λ ΣXS ΣSX ) f = λ (ΣXX + γIX ) f.
Remark. If the trade-off parameter λ = 0 (i.e., no semantic independence constraint is imposed)
and γ → 0, the solution in Theorem 5.5 is equivalent to a supervised kernel-PCA. On the other
hand, if λ → 1 (i.e., utility is ignored and only semantic independence is considered), the solution
in Theorem 5.5 is the eigenfunctions corresponding to the r smallest eigenvalues of Σ∗SX ΣSX ,
which are the directions that are the least explanatory of the semantic attribute S.
                                                       81


    Now, consider the empirical counterpart of the optimization problem (5.9),
   sup {J emp (f , λ) := (1 − λ) Depemp (f (X), Y ) − λ Depemp (f (X), S)} ,              0 ≤ λ < 1 (5.11)
  f ∈Ar
where Depemp (f (X), S) is given in (5.7) and Depemp (f (X), Y ) is defined similarly.
Theorem 5.6. Let the Cholesky factorization of KX be KX = LX LTX , where LX ∈ Rn×d
(d ≤ n) is a full column-rank matrix. Let r ≤ d, then a solution to (5.11) is
                       f Opt (X) = ΘOpt [kX (x1 , X), · · · , kX (xn , X)]T ,
                       †
where Θopt = U T LX and the columns of U are eigenvectors corresponding to the r largest
eigenvalues of the following generalized eigenvalue problem.
                                                                                      
                                                                 1 T
          LTX  ((1 − λ)HKY H − λHKS H) LX u = τ                    L HLX + γI u.                     (5.12)
                                                                 n X
                                                   Pr
Further, the objective value of (5.11) is equal to   j=1 βj ,    where {β1 , · · · , βr } are the r largest
eigenvalues of (5.12).
Proof. Consider the Cholesky factorization, Kx = Lx LTx where Lx is a full column-rank matrix.
                                                 82


Using the representer theorem, the disentanglement property in (5.8) can be expressed as
                              
      Cov fi (X), fj (X) + γ hfi , fj iH
                                                X
        X n                            Xn            X n
      1                             1
 =            fi (xk )fj (xk ) − 2          fi (xk )       fj (xm ) + γ hfi , fj iH
      n                           n                                                 X
        k=1                            k=1           m=1
        X n X  n                       Xn
      1                                                              1
 =                 KX (xk , xt )θit         KX (xk , xm )θjm − 2 θiT KX 1n 1Tn KX θj + γ hfi , fj iH
      n                                                             n                               X
        k=1 t=1                       m=1
      1                             1
 =      (KX θi )T Kx θj − 2 θiT KX 1n 1Tn KX θj
      n *                           n                    +
              Xn                    X n
      +γ           θik kX (·, xk ),      θit kX (·, xt )
             k=1                    t=1                   HX
      1 T
 =      θi KX HKX θj + γ θiT KX θj
      n
      1 T                               
 =      θi LX LTX HLX + nγ I LTX θj
      n
 = δi,j .
As a result, f ∈ Ar is equivalent to
                                      1                      
                              ΘLX         LT HLX       + γI LTX ΘT = Ir ,
                                      | n X      {z            }
                                                :=C
where Θ := [θ1 , · · · , θr ]T ∈ Rr×n .
                                                         83


     Let V = LTX ΘT and consider the optimization problem in (13):
           sup {(1 − λ) Depemp (f (X), Y ) − λ Depemp (f (X), S)}
          f ∈Ar
                  1  n                                              o
                                            2                     2
      =    sup         (1 − λ) kΘKX HLY kF − λ kΘKX HLS kF
          f ∈Ar  n2
                  1  n            n                      o        n                      oo
      =    sup         (1 − λ) Tr ΘKX HKY HKX ΘT − λ Tr ΘKX HKS HKX ΘT
          f ∈Ar  n2
                         1 n            T  T
                                             o
      =       max         2
                            Tr ΘLX BLX Θ
          V T CV =Ir n
                         1 n T        o
      =       max         2
                            Tr V   BV                                                    (5.13)
          V T CV =Ir n
where the second step is due to (5.7) and
                            B := LTX ((1 − λ)HKY H − λHKS H) LX
It is shown in [90] that an1 optimizer of (5.13) is any matrix U whose columns are eigenvectors
corresponding to r largest eigenvalues of generalized problem
                                          Bu = τ Cu                                      (5.14)
and the maximum value is the summation of r largest eigenvalues. Once U is determined, then,
any Θ in which LTX ΘT = U is optimal Θ (denoted by Θopt ). Note that Θopt is not unique and
has a general form of
                                    †                             
                           ΘT = LTX U + Λ0 ,       R(Λ0 ) ⊆ N LTX .
   1 Optimal V is not unique.
                                                84


However, setting Λ0 to zero would lead to a minimum norm for Θ. Therefore, we opt Θopt =
      †
U T LX .
Corollary 5.1. Embedding Dimensionality: A useful corollary of Theorem 5.6 is characterizing
optimal embedding dimensionality as a function of the trade-off parameter, λ:
                           (                      )
rOpt (λ) := arg sup           sup {J emp (f , λ)}   = number of non-negative eigenvalues of (5.12).
                   0≤r≤d     f ∈Ar
Proof. From proof of Theorem 5.6, we know that
                                                                               Xr
               sup     {(1 − λ) Depemp (f (X), Y ) − λ Depemp (f (X), S)}    =       τj ,
              f ∈Ar                                                            j=1
where {τ1 , · · · , τn } are eigenvalues of the generalized problem in (5.12) in decreasing order. It
follows immediately that
                                
                        X r     
            arg sup           τj   = number of non-negative elements of {τ1 , · · · , τl }.
                    r           
                          j=1
    To examine these results, consider two extreme cases: i) If there is no semantic independence
constraint (i.e., λ = 0), all eigenvalues of (5.12) are non-negative since HKY H is a non-negative
definite matrix and n1 LTX HLX + γI is a positive definite matrix. This indicates that rOpt is equal
to the maximum possible value (that is equal to d), and therefore it is not required for Z to nullify
any information in X. ii) If we only concern about the semantic independence and ignore the
target task utility (i.e., λ → 1), all eigenvalues of (5.12) are non-positive and therefore rOpt would
be the number of zero eigenvalues of (5.12). This indicates that Depemp (Z, S) in (5.7) is equal to
                                                    85


zero, since Θopt KX is zero for zero eigenvalues of (5.12) when λ → 1. In this case, adding more
dimension to Z will necessarily increase Depemp (Z, S).
    The following Theorem characterizes the convergence behavior of empirical K−TOpt to its
population counterpart.
Theorem 5.7. Assume that kS and kY are bounded by one and fj2 (xi ) ≤ M for any j = 1, . . . , r
and i = 1, . . . , n for which f = (f1 , . . . , fr ) ∈ Ar . Then, for any n > 1 and 0 < δ < 1, with
probability at least 1 − δ, we have
                                                              r                 
                                                                 log(6/δ)        1
                  sup J(f , λ) − sup    J emp (f , λ)   ≤ rM              +O        .
                f ∈Ar             f ∈Ar                           0.222 n        n
Proof. See Appendix B.4
                                                                                   Pn
    Note that, for any x in the training set, fj (x) can be calculated as fj (x) =   i=1 θji kX (xi , x).
We can assume that kX (·, ·) is bounded. For example, in RBF Gaussian and Laplacian RKHSs
                                                                      √
(that are universal), kX (·, ·) ≤ 1. This implies that fj2 (x) ≤ nkθj k, where θj is j-th row of
                                                                                               √
Θ in equation (5.4). One always can normalize fj (x) by dividing it by the maximum of            nkθj k
over js, or by dividing by the maximum of |fj (xi )| over is and js. Notice that, this normalization
is only a scalar multiplication and has no effect on the invariance of Z = f (X) to S and the utility
of any downstream target task predictor gY (Z).
5.5.1     Numerical Complexity
Computational Complexity: If LX in (5.12) is provided in the training dataset, then, the com-
putational complexity of obtaining the optimal encoder is O(l3 ), where l ≤ n is the numerical
rank of the Gram matrix KX . However, the dominating part of the computational complexity is
                                                      86


due to the Cholesky factorization, KX = LX LTX which is O(n3 ). Using random Fourier features
(RFF) [115], kX (x, x0 ) can be approximated by rX (x)T rX (x0 ), where rX (x) ∈ Rd . In this
situation, the Cholesky factorization can be directly calculated as
                                                      
                                                     T
                                           rX (x1 ) 
                                                      
                                                .     
                                   LX =        .
                                                 .      ∈ Rn×d .
                                                                                            (5.15)
                                                      
                                                      
                                            rX (xn ) T
As a result, the computational complexity of obtaining the optimal encoder becomes O(d3 ), where
the RFF dimension, d can be significantly less than the sample size n with negligible sacrifice on
kX (x, x0 ) ≈ rX (x)T rX (x0 ) approximation.
Memory Complexity: The memory complexity of (5.12), if calculated naively, is O(n2 ) since
KY and KS are n by n matrices. However, using RFF together with Cholesky factorization
KY = LY LTY , KS = LS LTS , the left-hand side of (5.12) can be re-arranged as
                                                                     
                     (1 − λ) LTX L̃Y      L̃TY LX − λ LTX L̃S L̃TS LX ,                      (5.16)
where L̃TY = HLY = LY − n1 1n (1Tn LY ) and therefore, the required memory complexity is
O(nd). Note that, L̃TS and HLX can be calculated similarly.
5.5.2     Target Task Performance in K−TOpt
Assume that the desired target loss function is MSE. In the following Theorem, we show that max-
imizing Dep (f (X), Y ) over f ∈ Ar can learn a representation Z that is informative enough for a
target predictor on Z to achieve the most optimal estimation, i.e., the Bayes estimation (E[Y | X]).
                                                 87


Theorem 5.8. Let f ∗ be the optimal encoder by maximizing Dep(f (X), Y ), where γ → 0 and
HY is a linear RKHS. Then, there exist W ∈ RdY ×r and b ∈ RdY such that W f ∗ (X) + b is the
Bayes estimator, i.e.,
                        h                 i                              h              i
                              ∗
                         kW Z + b − Y k 2      =         inf     EX,Y      kh(X) − Y k2
                                                     h is Borel
                                                              h                    i
                                               = EX,Y kE[Y | X] − Y k2 .
Proof. We only prove this theorem for the empirical version due to its convergence to the popula-
tion counterpart. The optimal Bayes estimator can be the composition of the kernelized encoder
Z = f (X) and a linear regressor on top of it. More specifically, Yb = W f (X) + b can approach
to E[Y | X] if we optimize f , W , and b all together. This is because f ∈ HX can approximate
any Borel function (due to the universality of HX ) and, since r ≥ dy , W can be surjective. Let
Z := [z1 , · · · , zn ] ∈ Rr×n and Y := [y1 , · · · , yn ] ∈ Rdy ×n . Further, let Z̃ and ỹ be the centered
(i.e., mean subtracted) version of Z and Y , respectively. We firstly optimize b for any given f , r,
and W :
                                                         n
                                                    1X
                              bopt := arg min               kW zi + b − yi k2
                                            b       n
                                                       i=1
                                          X n                    n
                                       1                     1X
                                    =           yi − W              zi .
                                       n                     n
                                           i=1                  i=1
Then, optimizing over W would lead to
                   1              2      1                                 2
           min         W Z̃ − Ỹ     =      min Z̃ T W T − Ỹ T
            W n                   F     n W                                F
                                               1                               2    1               2
                                     = min           Z̃ T W T − PZ̃ Ỹ T         +     PZ̃ ⊥ Ỹ T
                                         W n                                   F    n               F
                                         1                   2     1         2    1             2
                                     =        PZ̃ ⊥ Ỹ T        =      Ỹ       −     P Ỹ T      ,
                                        n                    F     n         F    n Z̃          F
                                                      88


where PZ̃ denotes the orthogonal projector onto the column space of Z̃ T and a possible min-
imizer is Wopt  T = (Z̃ T )† Ỹ T or equivalently W                         †
                                                             opt = Ỹ (Z̃) . Since the MSE loss is a func-
tion of the range (column space) of Z̃ T , we can consider only Z̃ T with orthonormal columns or
equivalently n1 Z̃ Z̃ T = Ir . In this setting, it holds PZ̃ = n1 Z̃ T Z̃. Now, consider optimizing
f (X) = Θ [kX (x1 , X), · · · , kX (xn , X)]T . We have, Z̃ = ΘKX H where H is the centering
matrix. Let V = LTx ΘT and C = n1 LTX HLX , then it follows that
                                                                                   
                                                   1            2            T   2
                                    min                    Ỹ     − PZ̃ Ỹ
                          ΘKX HKX ΘT =nIr n                     F                F
                           1       2                               1              2
                      =        Ỹ      −         max                   PZ̃ Ỹ T
                           n       F     ΘKX HKX ΘT =nIr n                        F
                           1       2                    1      h                           i
                      =        Ỹ      −     max                             T
                                                           Tr Ỹ HKX Θ ΘKX H Ỹ          T
                           n                             2
                                   F     V T CV =Ir n
                           1         2                   1 h                   T Ỹ HK ΘT
                                                                                            i
                      =         Ỹ      −     max            Tr   ΘK  X  H  Ỹ         X
                           n2        F     V T CV =Ir n
                                                           2
                                 2                  1 h T T T                        i
                      =     Ỹ     −      max         2
                                                        Tr    V   L X Ỹ  Ỹ  L X  V
                                 F     V T CV =Ir n
                                               r
                           1       2      1 X
                      =        Ỹ      − 2       λj ,
                           n       F      n
                                             j=1
where λ1 , · · · , λr are r largest eigenvalues of the following generalized problem
                                                B0 u = λ Cu
and B0 := LTX Ỹ T Ỹ LX . This resembles the eigenvalue problem in Section , equation (5.14)
where λ = 0, HY is a linear RKHS and γ → 0.
    This Theorem implies that not only Dep (f (X), Y ) can preserve all the necessary information
in Z to optimally predict Y , also, the learned representation is simple enough for a linear regressor
to achieve the optimal performance.
                                                         89


5.6      Experiments
In this section, we numerically quantify our K−TOpt through the closed-form solution for the en-
coder obtained in Section 5.5 on an illustrative toy example and two real-world datasets, Folktables
and CelebA.
5.6.1     Baselines
We consider two types of baselines: (1) ARL (the main framework for IRepL) with MSE or Cross-
Entropy as the adversarial loss. Such methods are expected to fail to learn a fully invariant repre-
sentation [56, 114]. These include [24, 22, 2], and SARL [49]. (2) HSIC-based adversarial loss
that accounts for all modes of dependence, and as such is theoretically expected to learn a fully in-
variant representation [62]. Among these baselines, except for SARL, all the others are optimized
via iterative minimax optimization which is often unstable and not guaranteed to converge. On the
other hand, SARL obtains a closed-form solution for the global-optima of the minimax optimiza-
tion under a linear dependence measure between Z and, S which may fail to capture all modes of
dependence between Z and S.
5.6.2     Datasets
Gaussian Toy Example: We design an illustrative toy example where X and S are mean indepen-
dent in some dimensions but not fully independent in those dimensions. Specifically, X and S are
4-dimensional continuous RVs and generated as following
               U = [U1 , U2 , U3 , U4 ] ∼ N (04 , I4 ) , N ∼ N (04 , I4 ) , U ⊥⊥ N
                        π                          h π                   π           i
               X = cos       U + 0.005N, S = sin            [U1 , U2 ] , cos    [U3 , U4 ] ,  (5.17)
                          6                               6                   6
                                                   90


where sin(·) and cos(·) are applied point-wise. To generate the target attribute, we define four
binary RVs as follows.
                                 Yi = 1{|U |>T } (Ui ), i = 1, 2, 3, 4,
                                             i
where 1B (·) is the indicator function, and we set T = 0.6744, so it holds that P[Yi = 0] =
P[Yi = 1] = 0.5 for i = 1, 2, 3, 4. Finally, we define Y as a 16-class categorical RV concatenated
by Yi s. Since S is dependent on X through all the dimensions of X, then, a wholly invariant
Z (i.e., Z ⊥⊥ S) should not contain any information about X. However, since [S1 , S2 ] is only
mean independent of [X1 , X2 ] (i.e., E [S1 , S2 | X1 , X2 ] = E [S1 , S2 ]), ARL baselines with MSE
as the adversary loss, i.e., [24, 22, 2] and SARL cannot capture the dependency of Z to [S1 , S2 ]
and result in a representation that is always dependent on [S1 , S2 ] (see Section 5.2 for theoretical
details). We sample 18, 000 instances from pX,Y,S , independently, and split these samples equally
into training, validation, and testing partitions.
Folktables: We consider a fair representation learning task on Folktables [116] dataset (a deriva-
tion of the US census data). Particularly, we use 2018-WA (Washington) and 2018-NY (New York)
census data where the target attribute Y is the employment status (binary for WA and 4 categories
for NY) and the semantic attribute S is age (discrete value between 0 and 95 years). We seek
to learn a representation that predicts employment status while being fair in demographic parity
(DP) w.r.t. age. DP requires that the prediction Yb be independent of S which can be achieved
by enforcing Z ⊥⊥ S. The WA and NY datasets contain 76, 225 and 196, 967 samples, respec-
tively, each constructed from 16 different features. We randomly split the data into training (70%),
validation (15%), and testing (15%) partitions. Further, we adopt embeddings for categorical fea-
tures (learned in a supervised fashion by Y ) and normalization for continuous/discrete features (by
dividing to the maximum value).
                                                   91


CelebA: CelebA dataset [111] contains 202, 599 face images of 10, 177 different celebrities with
standard training, validation, and testing splits. Each image is annotated with 40 different at-
tributes. We choose the target attribute Y as the high cheekbone attribute (binary) and the semantic
attributes S to be the concatenation of gender and age (a 4-class categorical RV). The objective of
this experiment is similar to that of Folktables. Since raw image data is not appropriate for kernel
methods, we pre-train a ResNet-18 [124] (supervised by Y ) on CelebA images and extract features
of dimension 256. These features are used as the input data for all methods.
5.6.3     Evaluation Metrics
We use the accuracy of the classification tasks (16-class classification for Gaussian toy example,
employment prediction for Folktables, and high cheekbone prediction for CelebA) as a utility. For
Folktables and CelebA datasets, we define DP violation as
                                                  h            i
                                    b
                             DPV(Y , S) := EYb VarS P[Y | S]b                                 (5.18)
and use it as a metric to measure the variance (unfairness) of the prediction Yb w.r.t. the semantic
attribute S. For the Gaussian toy example, the above metric is challenging to compute because S
is a continuous RV. To circumvent this difficulty, we deploy KCC [79]
                                                      Cov(α(Z), β(S))
                     KCC(Z, S) :=         sup       p                       ,                 (5.19)
                                     α∈HZ ,β∈HS      Var(α(Z)) Var(β(S))
as a measure of invariance of Z to S, where HZ and HS are RBF-Gaussian RKHS. The reason for
using KCC instead of HSIC is that, unlike HSIC, KCC is normalized, and therefore it is a more
readily interpretable measure for comparing the invariance of representations between different
                                                  92


methods.
5.6.4     Choice of (Y, S) Pair
The existence of a utility-invariance trade-off ultimately depends on the statistical dependency
between target and semantic attributes. If Dep(Z, S) is negligible, then there does not exist a
trade-off. Keeping this in mind, we first chose the semantic attribute to be a sensitive attribute
for Folktables (i.e., age) and CelebA (i.e., concatenation of age and gender) datasets. Then, we
calculated the data imbalance (i.e., |P[Y = 0] − 0.5|) and KCC(Y, S) for all possible Y s. Finally,
we chose Y with a small data imbalance and a moderate KCC(Y, S). For Folktables dataset,
|P[employment = 0] − 0.5| = 0.04 and KCC(employment, age) = 0.4. For CelebA dataset,
|P[high cheekbone = 0] − 0.5| = 0.05 and KCC(high cheekbone, [age, gender]) = 0.1.
5.6.5     Implementation Details
For all methods, we pick different values of λ (100 λs for the Gaussian toy example and 70 λs for
Folktables and CelebA datasets) between zero and one for obtaining the utility-invariance trade-
off. We train the baselines that use a neural network for encoder five times with different random
seeds. We let the random seed also change the training-validation-testing split for the Folktables
dataset (CelebA and Gaussian datasets have fixed splits).
Embedding Dimensionality: None of the baseline methods have any strategy to find the optimum
embedding dimensionality (r) and they all set r to be constant w.r.t. λ. Therefore, for baseline
methods, we set r = 15 (i.e., the minimum dimensionality required to linearly classify 16 different
categories) for the Gaussian toy example and r = 3 (i.e., the minimum dimensionality required to
linearly classify 4 different categories) for Folktables-NY dataset, that is also equal to rOpt when
                                                  93


                                                                                             3
                               15
              Embedding Dim.                                                Embedding Dim.
                               10
                                                                                             2
                               5                                                             1
                               0                                                             0
                                    10−4   10−3    10−2   10−1   100                             0   0.2    0.4       0.6   0.8   1
                                                  1−λ                                                             λ
                                             (a)                                                           (b)
Figure 5.4: Plots of rOpt (λ) versus the dependence trade-off parameter 1 − λ for (a) the Gaussian
toy dataset and (b) Folktables-NY dataset. There is a non-decreasing relation between rOpt (λ) and
1 − λ.
λ = 0. For K-TOpt , we use rOpt (λ) in Corollary 5.1. See Figure 5.4 for the plot of rOpt versus
λ for the toy Gaussian and Folktables-NY datasets. For Folktables-WA and CelebA datasets,
rOpt (λ = 0) is equal to one, and therefore we let r = 1 for all methods and all 0 ≤ λ < 1.
K−TOpt : We let HX , HS , and HY be RBF Gaussian RKHS, where we compute the corresponding
band-widths (i.e., σs) using the median strategy introduced by [125]. We optimize the regulariza-
tion parameter γ in the disentanglement set (5.8) by minimizing the corresponding target losses
over γs in {10−6 , 10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1} on validation sets. RFF (as discussed in Sec-
tion 5.5.1) is deployed for all datasets. For RFF dimensionality, we started with a small value
and gradually increased it until reaching the maximum possible performance for λ = 0 (i.e. the
standard unconstrained representation learning) on the corresponding validation sets that results in
100 for the Gaussian dataset, 5000 for Folktables dataset, and 1000 for CelebA dataset.
SARL [49]: SARL method is similar to our K−TOpt except that HY and HS are linear RKHSs,
and therefore we set σX and γ similar to that of K−TOpt .
ARL [24, 22, 2]: The representation Z = f (X) is extracted via the encoder f , which is an MLP (4
hidden layers and 15, 15 neurons for Gaussian data; 3 hidden layers and 128, 64 neurons for Folk-
tables and CelebA datasets). These architecture choices were based on starting with a single linear
                                                                       94


layer and gradually increasing the number of layers and neurons until over-fitting was observed.
This results in the number of encoder parameters for the Gaussian toy example to be 735, while
that is 100 = 100 ∗ rOpt (λ → 1) ≤ 100 ∗ rOpt (λ) ≤ 100 ∗ rOpt (λ = 0) = 1500 for K−TOpt . For
Folktables and CelebA, those are 41, 024 and 15, 616, respectively, for ARL and 5000 and 1000
for K−TOpt . The representation Z is fed to a target task predictor gY and a proxy adversary gS
networks where both are MLP with (2 hidden layers, and 16 neurons for Gaussian data, 2 hidden
layers, and 128 neurons for Folktables and CelebA datasets). All involved networks (f , gY , gS ) are
trained end-to-end. We use stochastic gradient descent-ascent (SGDA) [24] with AdamW [126] as
an optimizer to alternately train the encoder, target predictor, and proxy adversary networks. We
choose batch size as 500 for Gaussian data; and 128 for Folktables and CelebA datasets. Then, the
corresponding learning rates are optimized over {10−2 , 10−3 , 5 × 10−4 , 10−4 , 10−5 } by minimiz-
ing the target loss on the corresponding validation sets.
HSIC-IRepL [62]: This method can be formulated as (5.2) where Dep(Z, S) is replaced by
HSIC(Z, S). The encoder and target predictor networks have the same architecture as the ARL.
Therefore, we follow the same optimization procedure as ARL to train the involved neural net-
works.
5.6.6     Results
Utility-Invariance Trade-offs: Figures 5.5 and 5.6 show the utility-invariance and Dep(Z, Y ) −
Dep(Z, S) trade-offs for the toy Gaussian, Folktables-WA, Folktables-NY, and CelebA datasets,
respectively. The invariance measure for the Gaussian toy example is KCC (5.19), and the in-
variance measure for Folktables and CelebA datasets is the fairness measure, DPV (5.18). We
make the following observations: 1) K−TOpt is highly stable and almost spans the entire trade-
off front for all datasets except Folktables-NY which can be due to the inability of scalarized
                                                 95


                                                                                                       0.9
                                           K−TOpt
                                            HSIC
                                0.8         ARL                                                        0.8
                                            ARL
                                           SARL
                 Accuracy                                                                   Accuracy
                                0.6
                                                                                                       0.7
                                0.4                                                                                                   K−TOpt
                                                                                                                                       HSIC
                                                                                                       0.6
                                                                                                                                      OptNet
                                0.2
                                                                                                                                       ARL
                                                                                                                                      SARL
                                 0                                                                     0.5
                                      0        0.2         0.4     0.6    0.8                                0    0.1            0.2
                                                          KCC(Z, S)                                               DPV(Yb , S)
                                                         (a)                                                     (b)
                                                                                                       0.9
                                 0.9
                                 0.8                                                                   0.8
                     Accuracy                                                               Accuracy
                                 0.7
                                                                                                       0.7
                                 0.6
                                                                          K−TOpt                                                 K−TOpt
                                                                           HSIC                                                  SARL
                                                                                                       0.6
                                                                          OptNet                                                  HSIC
                                 0.5
                                                                           ARL                                                   OptNet
                                                                          SARL                                                    ARL
                                 0.4                                                                   0.5
                                       0             5 · 10−2       0.1         0.15                         0   0.1            0.2
                                                          DPV(Yb , S)                                            DPV(Yb , S)
                                                          (c)                                                    (d)
Figure 5.5: Utility versus invariance trade-offs obtained by K−TOpt and other baselines for (a)
Gaussian, (b) Folktables-WA, (c) Folktables-NY, and (d) CelebA datasets. K-TOpt stably spans
the entire trade-off front and considerably dominates other methods for all datasets. (a) ARL
and SARL span a small portion of the trade-off front since S is mean independent (but not fully
independent) of X in some dimensions for the Gaussian toy example. HSIC-IRepL, despite using a
universal dependence measure, performs sub-optimally due to the lack of convergence guarantees
to the global optima.
single-objective formulation in (5.2), in contrast to the constrained optimization in (5.1), to find all
Pareto-optimal points. 2) There is almost the same trend in the trade-off between Dep(Z, Y ) and
Dep(Z, S) (Figure 5.6) as the utility-invariance trade-off (Figure 5.5). This is a desired observa-
tion since Dep(Z, Y )-Dep(Z, S) trade-off is what we actually optimized in (5.9) as a surrogate to
utility-invariance trade-off. 3) The baseline method HSIC-IRepL, despite using a universal depen-
dence measure, leads to a sub-optimal trade-off front due to the lack of convergence guarantees
to the global optima. 4) The baselines, ARL and SARL span only a small portion of the trade-off
front in the Gaussian toy example, since some dimensions of the semantic attribute S in (5.17) are
mean independent (but not fully independent) to some dimensions of X and therefore the adversary
                                                                                       96


                               0.2                                                       0.25
                                                                                           0.2
                             0.15
                Dep(Z, Y )                                                  Dep(Z, Y )
                                                                                         0.15
                               0.1
                                                                                           0.1
                             0.05
                                                                                         0.05
                                0                                                           0
                                     0     0.05         0.1     0.15                             0   0.05         0.1     0.15    0.2
                                                 Dep(Z, S)                                                  Dep(Z, S)
                                                (a)                                                         (b)
                               0.5
                                                                                           0.8
                               0.4
                  Dep(Z, Y )                                                  Dep(Z, Y )
                                                                                           0.6
                               0.3
                                                                                           0.4
                               0.2
                               0.1                                                         0.2
                                 0                                                           0
                                     0   0.05     0.1   0.15   0.2                               0      0.1             0.2      0.3
                                                  Dep(Z, S)                                                  Dep(Z, S)
                                                (c)                                                         (d)
Figure 5.6: Dep(Z, Y ) versus Dep(Z, S) in K−TOpt for (a) Gaussian, (b) Folktables-WA, (c)
Folktables-NY, and (d) CelebA datasets. We can observe that there is the same trend in Dep(Z, Y )-
Dep(Z, S) trade-off as utility-invariance-trade-off in Figure 5.5.
does not provide any information to the encoder to discard [S1 , S2 ] from the representation. In this
dataset, ARL and SARL baselines do not approach Z ⊥⊥ S, i.e., KCC(Z, S) = 0 cannot be attained
for any value of the trade-off parameter λ. 5) ARL shows high deviation on Folktables dataset due
to the unstable nature of the minimax optimization. 6) SARL performs as good as K−TOpt for
CelebA dataset. This is because both S and Y are categorical for CelebA dataset, and therefore
linear RKHS on one-hot encoded attribute performs just as well as universal RKHSs [127].
Universality of Dep(Z, S): We empirically examine the practical validity of our assumption in
Section 5.4 and verify if our dependence measure Dep(Z, S), defined in (5.6), can capture all
modes of dependency between Z and S. Figure 5.7 (a) shows the plot of the universal dependence
measure KCC(Z, S) versus Dep(Z, S) for the Gaussian dataset and Figures 5.7 (b, c) illustrate the
relation between DPV(Yb , S) and Dep(Z, S) for Folktables and CelebA datasets, respectively. We
                                                                       97


                                                                                      0.2
                             0.8
                                                                                    0.15
                 KCC(Z, S)
                             0.6                                                 DPV(Yb , S)
                                                                                      0.1
                             0.4
                                                                                    0.05
                             0.2
                               0                                                        0
                                   0                 0.05            0.1                    0       0.05          0.1    0.15   0.2
                                              Dep(Z, S)                                                    Dep(Z, S)
                                              (a)                                                          (b)
                                                                                        0.2
                             0.1
                                                                                      0.15
                DPV(Yb , S)                                                       DPV(Yb , S)
                                                                                        0.1
                        0.05
                                                                                  5 · 10−2
                              0                                                             0
                                   0   0.05    0.1      0.15   0.2   0.25                       0          0.1          0.2      0.3
                                              Dep(Z, S)                                                      Dep(Z, S)
                                              (c)                                                           (d)
Figure 5.7: Invariance versus Dep(Z, S) of K−TOpt for (a) Gaussian, (b) Folktables-WA, (c)
Folktables-NY, and (d) CelebA datasets. Dep(Z, S) enjoys a monotonic relation with the un-
derlying invariance measures.
observe that there is a non-decreasing relation between the corresponding invariance measures and
Dep(Z, S). More importantly, as KCC(Z, S) → 0 (or DPV(Yb , S) → 0) so does dep(Z, S). These
observations verify that Dep(Z, S) accounts for all modes of dependence between Z and S.
5.6.7    Ablation Study
Effect of Embedding Dimensionality: In this experiment, we examine the significance of the
embedding dimensionality, rOpt (λ), discussed in Corollary 5.1. We obtain the utility-invariance
trade-off when the embedding dimensionality is fixed to be r = rOpt (λ = 0) = 15. A com-
parison plot between the utility-invariance trade-off induced by rOpt (λ) and the fixed r = 15 is
illustrated in Figure 5.8 (a). We can observe that not only the utility-invariance trade-off for fixed
r is dominated by that of rOpt (λ), but also, using fixed r is unable to achieve the total invariance
                                                                            98


                                                                                                              15           eig1
                                     K−TOpt (r = rOpt (λ))
                                      K−TOpt (r = 15)                                                                      eig5
                           0.8
                                                                                                                           eig10
                                                                                                              10           eig15
                Accuracy                                                                        Eigenvalues
                           0.6
                                                                                                               5
                           0.4
                           0.2                                                                                 0
                             0                                                                                −5
                                 0      0.2         0.4        0.6          0.8                                         10−4      10−3     10−2    10−1      100
                                                   KCC(Z, S)                                                                              1−λ
                                               (a)                                                                                   (b)
Figure 5.8: (a) Comparison between the utility-invariance trade-offs induced by the optimal em-
bedding dimensionality rOpt (λ) and that of fixed r = 15. Fixed r = 15 is significantly dominated
by that of rOpt (λ) and fails to attain Z ⊥⊥ S. (b) The first, fifth, tenth, and fifteenth largest eigen-
values in (5.12) versus 1 − λ. Given λ, rOpt is equal to the number of non-negative eigenvalues.
As 1 − λ decreases, the largest eigenvalues approach to negative numbers.
                           0.9
                                                                                                              0.8
                           0.8
                                                                                                Accuracy
                                                                                                              0.7
                Accuracy
                           0.7
                                                                            K−TOpt
                                                                             HSIC
                                                                                                              0.6
                           0.6
                                                                            OptNet                                                    K−TOpt (Age Present)
                                                                             ARL
                                                                            SARL
                                                                                                                                     K−TOpt (Age Removed)
                           0.5                                                                                0.5
                                 0      5 · 10−2         0.1         0.15            0.2                            0          5 · 10−2    0.1      0.15
                                                   DPV(Yb , S)                                                                       DPV(Yb , S)
                                                   (a)                                                                             (b)
Figure 5.9: (a) Utility versus invariance trade-offs for all methods when age (i.e., the sensitive
attribute) is discarded from the input data. (b) A comparison between trade-offs of K−TOpt when
age is present versus age is discarded from the input data. Removing the age attribute slightly
degrades the trade-off due to information discarding.
representation, i.e., Z ⊥⊥ S. Further, rOpt (λ) and some of the largest eigenvalues of (5.12) vs
the invariance trade-off parameter λ are plotted in Figures 5.8 (b, c), respectively. We recall from
Corollary 5.1 that, for any given λ, rOpt is the number of non-negative eigenvalues of (5.12).
                                                                                           99


5.7     Summary
Invariant representation learning (IRepL) often involves a trade-off between utility and invariance.
While the existence of such trade-off and its bounds have been studied, its exact characterization
has not been investigated. This chapter takes some steps to address this problem by, i) establishing
the exact kernelized trade-off (denoted by K−TOpt ), ii) determining the optimal dimensionality of
the data representation necessary to achieve a desired optimal trade-off point, and iii) developing
a scalable learning algorithm for encoders in some RKHSs to achieve K−TOpt . Numerical results
on both an illustrative example and two real-world datasets show that commonly used adversarial
representation learning-based techniques are unable to attain the optimal trade-off estimated by
our solution.
                                                100


Chapter 6
Conclusion and Future Work
Invariant representation learning often introduces a trade-off between utility and invariance. How-
ever, a learning algorithm that can optimally achieve any point on the trade-off front is challenging.
To circumvent this difficulty, in this dissertation, we propose to model IRepL players via functions
in some RKHSs. Consequently, we found a closed-form solution for the underlying IRepL prob-
lem. Additionally, the optimal embedding dimensionality is also determined. We also considered
the case where the encoder is an NN. In this situation, we model the target network and invariance
measure via kernelized ridge regressors that yield, in turn, closed-form solutions for the target net-
work and invariance measure. As a result, the unstable SGDA optimization scheme turns to SGD,
which is a stable optimization scheme. Numerical experiments on real-world datasets like the US
Census and CelebA confirm the optimality and efficiency of our proposed methods compared to
baseline IRepL algorithms. Finally, our theoretical results and empirical solutions shed light on the
utility-invariance trade-offs in various settings, such as algorithmic fairness and privacy-preserving
learning.
6.1       Limitations
This dissertation restricts the target predictor to be modeled via functions in some RKHSs to afford
analytical tractability and theoretical guarantees in Chapters 3, 5. Our proposed IRepL formulation
is under the scalarization of the bi-objective trade-off formulation. Even though any solution to the
                                                   101


scalarized version is a solution to the original bi-objective IRepL problem, some regions on the
utility-invariance trade-off may not be spotted by the scalarized counterpart. In general, IRepL is
a function of the involved dependence measure that quantifies the dependence of learned represen-
tations on the sensitive attribute. As such, the trade-off obtained in this dissertation is optimal for
HSIC-like dependence measures and may not be optimum for other invariance measures like KCC
or COCO.
6.2       Future Work
As a result of the limitations mentioned above, studying the bi-objective trade-off (rather than the
scalarization) and other universal measures are interesting directions for future work. Moreover,
the invariance criterion deployed in this dissertation is analogous to demographic parity in fairness
which can be an unsuitable fairness notion in some practical scenarios [128, 129]. Our method in
this dissertation can be extended to other notions of fairness, such as equalized odds and equality
of opportunity [128] by modifying the invariance measure to capture the conditional dependency
of the representation on the sensitive attribute given the target label.
6.3       Broader Impact
IRepL can enable many machine learning systems to prevent the leakage of private (sensitive)
information while being effective for the desired prediction task(s). In particular, IRepL has an
immediate application in fairness which is a significant societal problem. Our approach in this
dissertation proposes some algorithms that can be used to learn fair representations of data. More
generally, these approaches can enable machine learning systems to discard specific data before
                                                  102


making predictions. Of course, as with any technological or algorithmic solutions, one can also
employ them for harmful purposes.
                                             103


                                        BIBLIOGRAPHY
 [1] P. Roy and V. N. Boddeti, “Mitigating information leakage in image representations: A max-
     imum entropy approach,” IEEE Conference on Computer Vision and Pattern Recognition,
     2019.
 [2] D. Madras, E. Creager, T. Pitassi, and R. Zemel, “Learning adversarially fair and transfer-
     able representations,” arXiv preprint arXiv:1802.06309, 2018.
 [3] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand,
     and V. Lempitsky, “Domain-adversarial training of neural networks,” Journal of Machine
     Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
 [4] V. Grari, O. E. Hajouji, S. Lamprier, and M. Detyniecki, “Learning unbiased representations
     via rényi minimization,” in Joint European Conference on Machine Learning and Knowl-
     edge Discovery in Databases, pp. 749–764, Springer, 2021.
 [5] A. K. Menon and R. C. Williamson, “The cost of fairness in binary classification,” Confer-
     ence on Fairness, Accountability and Transparency, pp. 107–118, 2018.
 [6] H. Zhao and G. J. Gordon, “Inherent tradeoffs in learning fair representations,” arXiv
     preprint arXiv:1906.08386, 2019.
 [7] T. L. Gouic, J.-M. Loubes, and P. Rigollet, “Projection to fairness in statistical learning,”
     arXiv preprint arXiv:2005.11720, 2020.
 [8] H. Zhao, “Costs and benefits of wasserstein fair regression,”                 arXiv preprint
     arXiv:2106.08812, 2021.
 [9] H. Zhao, C. Dan, B. Aragam, T. S. Jaakkola, G. J. Gordon, and P. Ravikumar, “Fundamental
     limits and tradeoffs in invariant representation learning,” arXiv preprint arXiv:2012.10713,
     2020.
[10] H. Zhao, R. T. Des Combes, K. Zhang, and G. Gordon, “On learning invariant represen-
     tations for domain adaptation,” International Conference on Machine Learning, pp. 7523–
     7532, 2019.
[11] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” Inter-
     national Conference on Machine Learning, 2015.
[12] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adap-
     tation,” IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[13] K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. C. Loy, “Domain generalization: A survey,”
     IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
                                               104


[14] L. Zhang and X. Gao, “Transfer adaptation learning: A decade survey,” IEEE Transactions
     on Neural Networks and Learning Systems, 2022.
[15] T. Isobe, D. Li, L. Tian, W. Chen, Y. Shan, and S. Wang, “Towards discriminative represen-
     tation learning for unsupervised person re-identification,” in Proceedings of the IEEE/CVF
     International Conference on Computer Vision, pp. 8526–8536, 2021.
[16] H. Xia, H. Zhao, and Z. Ding, “Adaptive adversarial network for source-free domain adap-
     tation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision,
     pp. 9010–9019, 2021.
[17] X. Liu, Z. Guo, S. Li, F. Xing, J. You, C.-C. J. Kuo, G. El Fakhri, and J. Woo, “Adversarial
     unsupervised domain adaptation with conditional and label shift: Infer, align and iterate,”
     in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10367–
     10376, 2021.
[18] W. Zhang, X. Li, H. Ma, Z. Luo, and X. Li, “Universal domain adaptation in fault diag-
     nostics with hybrid weighted deep adversarial learning,” IEEE Transactions on Industrial
     Informatics, vol. 17, no. 12, pp. 7957–7967, 2021.
[19] X. Fan, Q. Wang, J. Ke, F. Yang, B. Gong, and M. Zhou, “Adversarially adaptive normal-
     ization for single domain generalization,” in Proceedings of the IEEE/CVF Conference on
     Computer Vision and Pattern Recognition, pp. 8208–8217, 2021.
[20] Z. Shen, J. Liu, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui, “Towards out-of-distribution
     generalization: A survey,” arXiv preprint arXiv:2108.13624, 2021.
[21] H. Edwards and A. Storkey, “Censoring representations with an adversary,” arXiv preprint
     arXiv:1511.05897, 2015.
[22] B. H. Zhang, B. Lemoine, and M. Mitchell, “Mitigating unwanted biases with adversarial
     learning,” AAAI/ACM Conference on AI, Ethics, and Society, 2018.
[23] A. Beutel, J. Chen, Z. Zhao, and E. H. Chi, “Data decisions and theoretical implications
     when adversarially learning fair representations,” arXiv preprint arXiv:1707.00075, 2017.
[24] Q. Xie, Z. Dai, Y. Du, E. Hovy, and G. Neubig, “Controllable invariance through adversarial
     feature learning,” Advances in Neural Information Processing Systems, pp. 585–596, 2017.
[25] V. Mirjalili, S. Raschka, A. Namboodiri, and A. Ross, “Semi-adversarial networks: Convo-
     lutional autoencoders for imparting privacy to face images,” in International Conference on
     Biometrics, 2018.
                                              105


[26] M. Bertran, N. Martinez, A. Papadaki, Q. Qiu, M. Rodrigues, G. Reeves, and G. Sapiro,
     “Adversarially learned representations for information obfuscation and inference,” Interna-
     tional Conference on Machine Learning, 2019.
[27] A. B. Arrieta, N. Dı́az-Rodrı́guez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcı́a,
     S. Gil-López, D. Molina, R. Benjamins, et al., “Explainable artificial intelligence (xai):
     Concepts, taxonomies, opportunities and challenges toward responsible ai,” Information
     fusion, vol. 58, pp. 82–115, 2020.
[28] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A survey on bias and
     fairness in machine learning,” arXiv preprint arXiv:1908.09635, 2019.
[29] S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach, “Language (technology) is power:
     A critical survey of” bias” in nlp,” arXiv preprint arXiv:2005.14050, 2020.
[30] S. Qian, V. H. Pham, T. Lutellier, Z. Hu, J. Kim, L. Tan, Y. Yu, J. Chen, and S. Shah, “Are my
     deep learning systems fair? an empirical study of fixed-seed training,” Advances in Neural
     Information Processing Systems, vol. 34, pp. 30211–30227, 2021.
[31] S. Ravfogel, M. Twiton, Y. Goldberg, and R. D. Cotterell, “Linear adversarial concept era-
     sure,” in International Conference on Machine Learning, pp. 18400–18421, PMLR, 2022.
[32] K. Karkkainen and J. Joo, “Fairface: Face attribute dataset for balanced race, gender, and age
     for bias measurement and mitigation,” in Proceedings of the IEEE/CVF Winter Conference
     on Applications of Computer Vision, pp. 1548–1558, 2021.
[33] M. Du, S. Mukherjee, G. Wang, R. Tang, A. Awadallah, and X. Hu, “Fairness via rep-
     resentation neutralization,” Advances in Neural Information Processing Systems, vol. 34,
     pp. 12091–12103, 2021.
[34] A. Wang, A. Liu, R. Zhang, A. Kleiman, L. Kim, D. Zhao, I. Shirai, A. Narayanan, and
     O. Russakovsky, “Revise: A tool for measuring and mitigating bias in visual datasets,”
     International Journal of Computer Vision, pp. 1–21, 2022.
[35] L. Wu, L. Chen, P. Shao, R. Hong, X. Wang, and M. Wang, “Learning fair representations
     for recommendation: A graph-based perspective,” in Proceedings of the Web Conference
     2021, pp. 2198–2208, 2021.
[36] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, “Learning fair representations,”
     International Conference on Machine Learning, pp. 325–333, 2013.
[37] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel, “The variational fair autoencoder,”
     arXiv preprint arXiv:1511.00830, 2015.
                                                106


[38] D. Moyer, S. Gao, R. Brekelmans, A. Galstyan, and G. Ver Steeg, “Invariant representa-
     tions without adversarial training,” in Advances in Neural Information Processing Systems,
     pp. 9102–9111, 2018.
[39] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon, “Adversarial
     multiple source domain adaptation,” Advances in Neural Information Processing Systems,
     vol. 31, pp. 8559–8570, 2018.
[40] F. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R. Varshney, “Optimized
     pre-processing for discrimination prevention,” Advances in Neural Information Processing
     Systems, pp. 3992–4001, 2017.
[41] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairness through awareness,”
     Innovations in Theoretical Computer Science Conference, pp. 214–226, 2012.
[42] S. Ruggieri, “Using t-closeness anonymity to control for non-discrimination.,” Transactions
     on Data Privacy, vol. 7, no. 2, pp. 99–129, 2014.
[43] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian, “Certi-
     fying and removing disparate impact,” ACM SIGKDD International Conference on Knowl-
     edge Discovery and Data Mining, pp. 259–268, 2015.
[44] J. Song, P. Kalluri, A. Grover, S. Zhao, and S. Ermon, “Learning controllable fair represen-
     tations,” International Conference on Artificial Intelligence and Statistics, 2019.
[45] E. Creager, D. Madras, J.-H. Jacobsen, M. A. Weis, K. Swersky, T. Pitassi, and R. Zemel,
     “Flexibly fair representation learning by disentanglement,” International Conference on
     Machine Learning, 2019.
[46] F. Locatello, G. Abbati, T. Rainforth, S. Bauer, B. Schölkopf, and O. Bachem, “On the fair-
     ness of disentangled representations,” Advances in Neural Information Processing Systems,
     pp. 14611–14624, 2019.
[47] J. Mary, C. Calauzenes, and N. El Karoui, “Fairness-aware learning for continuous attributes
     and treatments,” International Conference on Machine Learning, pp. 4382–4391, 2019.
[48] N. Martinez, M. Bertran, and G. Sapiro, “Minimax pareto fairness: A multi objective per-
     spective,” International Conference on Machine Learning, pp. 6755–6764, 2020.
[49] B. Sadeghi, R. Yu, and V. Boddeti, “On the global optima of kernelized adversarial rep-
     resentation learning,” IEEE International Conference on Computer Vision, pp. 7971–7979,
     2019.
                                               107


[50] H. Zhang, Y.-F. Zhang, W. Liu, A. Weller, B. Schölkopf, and E. P. Xing, “Towards principled
     disentanglement for domain generalization,” in Proceedings of the IEEE/CVF Conference
     on Computer Vision and Pattern Recognition, pp. 8024–8034, 2022.
[51] J. Hamm, “Minimax filter: Learning to preserve privacy from inference attacks,” Journal of
     Machine Learning Research, vol. 18, no. 1, pp. 4704–4734, 2017.
[52] M. Coavoux, S. Narayan, and S. B. Cohen, “Privacy-preserving neural representations of
     text,” arXiv preprint arXiv:1808.09408, 2018.
[53] T. Xiao, Y.-H. Tsai, K. Sohn, M. Chandraker, and M.-H. Yang, “Adversarial learning of
     privacy-preserving and task-oriented representations,” Proceedings of the AAAI Conference
     on Artificial Intelligence, pp. 12434–12441, 2020.
[54] M. Dusmanu, J. L. Schönberger, S. N. Sinha, and M. Pollefeys, “Privacy-preserving visual
     feature descriptors through adversarial affine subspace embedding,” IEEE Conference on
     Computer Vision and Pattern Recognition, 2021.
[55] D. McNamara, C. S. Ong, and R. C. Williamson, “Costs and benefits of fair representation
     learning,” AAAI/ACM Conference on AI, Ethics, and Society, pp. 263–270, 2019.
[56] E. Adeli, Q. Zhao, A. Pfefferbaum, E. V. Sullivan, L. Fei-Fei, J. C. Niebles, and K. M. Pohl,
     “Representation learning with statistical independence to mitigate bias,” IEEE/CVF Winter
     Conference on Applications of Computer Vision, pp. 2513–2523, 2021.
[57] C. Agarwal, H. Lakkaraju, and M. Zitnik, “Towards a unified framework for fair and sta-
     ble graph representation learning,” in Uncertainty in Artificial Intelligence, pp. 2114–2124,
     PMLR, 2021.
[58] K. Yang, J. H. Yau, L. Fei-Fei, J. Deng, and O. Russakovsky, “A study of face obfuscation
     in imagenet,” in International Conference on Machine Learning, pp. 25313–25330, PMLR,
     2022.
[59] B. Rodrı́guez-Gálvez, R. Thobaben, and M. Skoglund, “A variational approach to privacy
     and fairness,” in 2021 IEEE Information Theory Workshop (ITW), pp. 1–6, IEEE, 2021.
[60] H. Zhao, J. Chi, Y. Tian, and G. J. Gordon, “Trade-offs and guarantees of adversarial repre-
     sentation learning for information obfuscation,” arXiv preprint arXiv:1906.07902, 2019.
[61] S. Dutta, D. Wei, H. Yueksel, P.-Y. Chen, S. Liu, and K. Varshney, “Is there a trade-off
     between fairness and accuracy? A perspective using mismatched hypothesis testing,” Inter-
     national Conference on Machine Learning, pp. 2803–2813, 2020.
                                              108


[62] N. Quadrianto, V. Sharmanska, and O. Thomas, “Discovering fair representations in the
     data domain,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
     Recognition, pp. 8227–8236, 2019.
[63] A. Letcher, D. Balduzzi, S. Racaniere, J. Martens, J. N. Foerster, K. Tuyls, and T. Graepel,
     “Differentiable game mechanics.,” Journal of Machine Learning Research, vol. 20, no. 84,
     pp. 1–40, 2019.
[64] L. Mescheder, S. Nowozin, and A. Geiger, “The numerics of gans,” in Advances in Neural
     Information Processing Systems, pp. 1825–1835, 2017.
[65] V. Nagarajan and J. Z. Kolter, “Gradient descent gan optimization is locally stable,” in Ad-
     vances in Neural Information Processing Systems, pp. 5585–5595, 2017.
[66] D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel, “The mechanics
     of n-player differentiable games,” in International Conference on Machine Learning, 2018.
[67] C. Daskalakis and I. Panageas, “The limit points of (optimistic) gradient descent in min-max
     optimization,” in Advances in Neural Information Processing Systems, 2018.
[68] C. Jin, P. Netrapalli, and M. I. Jordan, “What is local optimality in nonconvex-nonconcave
     minimax optimization?,” arXiv preprint arXiv:1902.00618, 2019.
[69] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien, “A variational inequal-
     ity perspective on generative adversarial networks,” International Conference on Learning
     Representations, 2019.
[70] S. Wang, Y. Teng, and P. Perdikaris, “Understanding and mitigating gradient flow patholo-
     gies in physics-informed neural networks,” SIAM Journal on Scientific Computing, vol. 43,
     no. 5, pp. A3055–A3081, 2021.
[71] N. Loizou, H. Berard, G. Gidel, I. Mitliagkas, and S. Lacoste-Julien, “Stochastic gradient
     descent-ascent and consensus optimization for smooth games: Convergence analysis un-
     der expected co-coercivity,” Advances in Neural Information Processing Systems, vol. 34,
     pp. 19095–19108, 2021.
[72] G. Korpelevich, “The extragradient method for finding saddle points and other problems,”
     Matecon, vol. 12, pp. 747–756, 1976.
[73] H. K. Khalil, “Nonlinear systems,” Printice-Hall Inc, 1996.
[74] I. T. Jolliffe and J. Cadima, “Principal component analysis: a review and recent develop-
     ments,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and
     Engineering Sciences, vol. 374, no. 2065, p. 20150202, 2016.
                                               109


[75] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel
     eigenvalue problem,” Neural Computation, vol. 10, no. 5, pp. 1299–1319, 1998.
[76] S. Baharlouei, M. Nouiehed, A. Beirami, and M. Razaviyayn, “R’enyi fair inference,” arXiv
     preprint arXiv:1906.12005, 2019.
[77] A. Pérez-Suay, V. Laparra, G. Mateo-Garcı́a, J. Muñoz-Marı́, L. Gómez-Chova, and
     G. Camps-Valls, “Fair kernel learning,” Joint European Conference on Machine Learning
     and Knowledge Discovery in Databases, pp. 339–355, 2017.
[78] A. Rényi, “On measures of dependence,” Acta Mathematica Academiae Scientiarum Hun-
     garica, vol. 10, no. 3-4, pp. 441–451, 1959.
[79] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of Machine
     Learning Research, vol. 3, no. 6, pp. 1–48, 2002.
[80] A. Gretton, A. Smola, O. Bousquet, R. Herbrich, B. Schölkopf, and N. Logothetis, “Be-
     haviour and convergence of the constrained covariance,” Max Planck Institute for Biological
     Cybernetics, 2004.
[81] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf, “Measuring statistical dependence
     with Hilbert-Schmidt norms,” International Conference on Algorithmic Learning Theory,
     pp. 63–77, 2005.
[82] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis. Cambridge Uni-
     versity Press, 2004.
[83] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schölkopf, “Kernel methods for
     measuring independence,” Journal of Machine Learning Research, vol. 6, no. 12, pp. 2075–
     2129, 2005.
[84] B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet, “Universality, characteristic ker-
     nels and RKHS embedding of measures,” Journal of Machine Learning Research, vol. 12,
     no. 7, 2011.
[85] K. Fukumizu, F. R. Bach, and A. Gretton, “Statistical consistency of kernel canonical cor-
     relation analysis.,” Journal of Machine Learning Research, vol. 8, no. 2, 2007.
[86] E. Kreyszig, Introductory functional analysis with applications, vol. 1. wiley New York,
     1978.
[87] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv
     preprint arXiv:1409.7495, 2014.
                                              110


 [88] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
      and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing
      Systems, pp. 2672–2680, 2014.
 [89] D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society,
      vol. 48, no. 3, pp. 334–334, 1997.
 [90] E. Kokiopoulou, J. Chen, and Y. Saad, “Trace optimization and eigenproblems in dimension
      reduction methods,” Numerical Linear Algebra with Applications, vol. 18, no. 3, pp. 565–
      602, 2011.
 [91] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality
      constraints,” SIAM Journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–
      353, 1998.
 [92] S. Kumar, M. Mohri, and A. Talwalkar, “Sampling methods for the Nyström method,” Jour-
      nal of Machine Learning Research, vol. 13, no. Apr, pp. 981–1006, 2012.
 [93] D. Dua and C. Graff, “UCI machine learning repository,” 2017.
 [94] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint
      arXiv:1312.6114, 2013.
 [95] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination
      cone models for face recognition under variable lighting and pose,” IEEE Transactions on
      Pattern Analysis and Machine Intelligence, no. 6, pp. 643–660, 2001.
 [96] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech.
      rep., Citeseer, 2009.
 [97] C. Ionescu, O. Vantzos, and C. Sminchisescu, “Training deep networks with structured lay-
      ers by matrix backpropagation,” arXiv preprint arXiv:1509.07838, 2015.
 [98] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr, “End-to-end representa-
      tion learning for correlation filter based tracking,” in IEEE Conference on Computer Vision
      and Pattern Recognition, 2017.
 [99] B. Amos and J. Z. Kolter, “Optnet: Differentiable optimization as a layer in neural net-
      works,” in International Conference on Machine Learning, 2017.
[100] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter, “Differentiable
      convex optimization layers,” in Advances in Neural Information Processing Systems, 2019.
[101] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi, “Meta-learning with differentiable
      closed-form solvers,” in International Conference on Learning Representations, 2018.
                                                 111


[102] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with differentiable convex
      optimization,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[103] Y. Elazar and Y. Goldberg, “Adversarial removal of demographic attributes from text data,”
      Empirical Methods in Natural Language Processing, 2018.
[104] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generaliza-
      tion in neural networks,” in Advances in Neural Information Processing Systems, 2018.
[105] G. H. Golub and V. Pereyra, “The differentiation of pseudo-inverses and nonlinear least
      squares problems whose variables separate,” SIAM Journal on Numerical Analysis, vol. 10,
      no. 2, pp. 413–432, 1973.
[106] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: Stability of stochas-
      tic gradient descent,” in International Conference on Machine Learning, pp. 1225–1234,
      PMLR, 2016.
[107] D. Dua and C. Graff, “Uci machine learning repository (2017),” URL http://archive. ics. uci.
      edu/ml, 2017.
[108] B. Kim, H. Kim, K. Kim, S. Kim, and J. Kim, “Learning not to learn: Training deep neural
      networks with biased data,” in IEEE Conference on Computer Vision and Pattern Recogni-
      tion, 2019.
[109] E. Zitzler and L. Thiele, “Multiobjective optimization using evolutionary algorithms—a
      comparative case study,” in International Conference on Parallel Problem Solving from Na-
      ture, pp. 292–301, 1998.
[110] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of Eugen-
      ics, vol. 7, no. 2, pp. 179–188, 1936.
[111] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” IEEE
      International Conference on Computer Vision, 2015.
[112] J. Knowles, “A summary-attainment-surface plotting method for visualizing the perfor-
      mance of stochastic multiobjective optimizers,” in International Conference on Intelligent
      Systems Design and Applications (ISDA), 2005.
[113] C. R. Souza, “Kernel functions for machine learning applications,” Creative Commons
      Attribution-Noncommercial-Share Alike, vol. 3, p. 29, 2010.
[114] V. Grari, O. E. Hajouji, S. Lamprier, and M. Detyniecki, “Learning unbiased representations
      via Rényi minimization,” arXiv preprint arXiv:2009.03183, 2020.
[115] A. Rahimi, B. Recht, et al., “Random features for large-scale kernel machines.,” Advances
      in Neural Information Processing Systems, vol. 3, no. 4, p. 5, 2007.
                                                112


[116] F. Ding, M. Hardt, J. Miller, and L. Schmidt, “Retiring adult: New datasets for fair machine
      learning,” arXiv preprint arXiv:2108.04884, 2021.
[117] J. Jacod and P. Protter, Probability essentials. Springer Science & Business Media, 2012.
[118] E. Barshan, A. Ghodsi, Z. Azimifar, and M. Z. Jahromi, “Supervised principal component
      analysis: Visualization, classification and regression on subspaces and submanifolds,” Pat-
      tern Recognition, vol. 44, no. 7, pp. 1357–1371, 2011.
[119] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm,
      “Mutual information neural estimation,” International Conference on Machine Learning,
      pp. 531–540, 2018.
[120] P. Diaconis and D. Freedman, “Asymptotics of graphical projection pursuit,” The Annals of
      Statistics, pp. 793–815, 1984.
[121] P. Hall and K.-C. Li, “On almost linearity of low dimensional projections from high dimen-
      sional data,” The Annals of Statistics, pp. 867–889, 1993.
[122] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new per-
      spectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8,
      pp. 1798–1828, 2013.
[123] R. L. Strawderman, “The symmetric eigenvalue problem (classics in applied mathematics,
      number 20),” Journal of the American Statistical Association, vol. 94, no. 446, p. 657, 1999.
[124] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Pro-
      ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–
      778, 2016.
[125] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola, “A kernel sta-
      tistical test of independence,” Advances in Neural Information Processing Systems, vol. 20,
      pp. 585–592, 2007.
[126] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint
      arXiv:1711.05101, 2017.
[127] Y. Li, R. Pogodin, D. J. Sutherland, and A. Gretton, “Self-supervised learning with kernel
      dependence maximization,” arXiv preprint arXiv:2106.08320, 2021.
[128] M. Hardt, E. Price, N. Srebro, et al., “Equality of opportunity in supervised learning,” in
      Advances in Neural Information Processing Systems, pp. 3315–3323, 2016.
[129] A. Chouldechova, “Fair prediction with disparate impact: A study of bias in recidivism
      prediction instruments. big data 5, 2 (2017), 153–163,” arXiv preprint arXiv:1610.07524,
      2017.
                                                113


[130] J. Adebayo and L. Kagal, “Iterative orthogonal feature projection for diagnosing bias in
      black-box models,” Fairness, Accountability, and Transparency in Machine Learning, 2016.
[131] P. Adler, C. Falk, S. A. Friedler, T. Nix, G. Rybeck, C. Scheidegger, B. Smith, and
      S. Venkatasubramanian, “Auditing black-box models for indirect influence,” Knowledge and
      Information Systems, vol. 54, no. 1, pp. 95–122, 2018.
[132] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, “Man is to computer
      programmer as woman is to homemaker? debiasing word embeddings,” in Advances in
      Neural Information Processing Systems, pp. 4349–4357, 2016.
[133] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University Press, 2004.
[134] B. Fish, J. Kun, and A. D. Lelkes, “Fair boosting: a case study,” in Workshop on Fairness,
      Accountability, and Transparency in Machine Learning, Citeseer, 2015.
[135] G. Goh, A. Cotter, M. Gupta, and M. P. Friedlander, “Satisfying real-world goals with
      dataset constraints,” in Advances in Neural Information Processing Systems, pp. 2415–2423,
      2016.
[136] F. Kamiran and T. Calders, “Classification with no discrimination by preferential sampling,”
      in Proc. 19th Machine Learning Conf. Belgium and The Netherlands, vol. 1, Citeseer, 2010.
[137] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma, “Fairness-aware classifier with prejudice
      remover regularizer,” in Joint European Conference on Machine Learning and Knowledge
      Discovery in Databases, pp. 35–50, Springer, 2012.
[138] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma, “Enhancement of the neutrality in rec-
      ommendation.,” in Decisions@ RecSys, pp. 8–14, Citeseer, 2012.
[139] T. Kamishima, S. Akaho, H. Asoh, and I. Sato, “Model-based approaches for independence-
      enhanced recommendation,” in International Conference on Data Mining Workshops,
      pp. 860–867, IEEE, 2016.
[140] E. Kazemi, M. Zadimoghaddam, and A. Karbasi, “Scalable deletion-robust submodular
      maximization: Data summarization with privacy and fairness constraints,” in International
      Conference on Machine Learning, pp. 2549–2558, 2018.
[141] D. H. Kim, T. Kong, and S. Jeong, “Finding solutions to generative adversarial privacy,”
      arXiv preprint arXiv:1810.02069, 2018.
[142] J. Komiyama, A. Takeda, J. Honda, and H. Shimao, “Nonconvex optimization for regression
      with fairness constraints,” in International Conference on Machine Learning, pp. 2742–
      2751, 2018.
                                               114


[143] A. J. Laub, Matrix analysis for scientists and engineers, vol. 91. Siam, 2005.
[144] G. Meanti, L. Carratino, L. Rosasco, and A. Rudi, “Kernel methods through the roof: han-
      dling billions of points efficiently,” arXiv preprint arXiv:2006.10350, 2020.
[145] G. Ristanoski, W. Liu, and J. Bailey, “Discrimination aware classification for imbalanced
      datasets,” in ACM International Conference on Information & Knowledge Management,
      pp. 1529–1532, ACM, 2013.
[146] S. Samadi, U. Tantipongpipat, J. H. Morgenstern, M. Singh, and S. Vempala, “The price
      of fair pca: One extra dimension,” in Advances in Neural Information Processing Systems,
      pp. 10999–11010, 2018.
[147] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi, “Fairness beyond dis-
      parate treatment & disparate impact: Learning classification without disparate mistreat-
      ment,” in International Conference on World Wide Web, pp. 1171–1180, International World
      Wide Web Conferences Steering Committee, 2017.
[148] I. Žliobaite, F. Kamiran, and T. Calders, “Handling conditional discrimination,” in Interna-
      tional Conference on Data Mining, pp. 992–1001, IEEE, 2011.
[149] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in
      European Conference on Computer Vision, pp. 630–645, Springer, 2016.
[150] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
      arXiv:1412.6980, 2014.
[151] Y. Li, K. Swersky, and R. Zemel, “Learning unbiased features,” arXiv preprint
      arXiv:1412.5244, 2014.
[152] B. Schölkopf, R. Herbrich, and A. J. Smola, “A generalized representer theorem,” in Inter-
      national Conference on Computational Learning Theory, 2001.
[153] C. A. Micchelli, Y. Xu, and H. Zhang, “Universal kernels,” Journal of Machine Learning
      Research, vol. 7, no. Dec, pp. 2651–2667, 2006.
[154] K. Fukumizu, F. R. Bach, and M. I. Jordan, “Dimensionality reduction for supervised learn-
      ing with reproducing kernel hilbert spaces,” Journal of Machine Learning Research, vol. 5,
      no. Jan, pp. 73–99, 2004.
[155] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,” ProPublica, May, vol. 23,
      p. 2016, 2016.
[156] J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in com-
      mercial gender classification,” in Conference on Fairness, Accountability and Transparency,
      pp. 77–91, 2018.
                                                 115


[157] A. Datta, M. C. Tschantz, and A. Datta, “Automated experiments on ad privacy settings: A
      tale of opacity, choice, and discrimination,” Proceedings on Privacy Enhancing Technolo-
      gies, vol. 2015, no. 1, pp. 92–112, 2015.
[158] L. Sweeney, “Discrimination in online ad delivery,” Queue, vol. 11, no. 3, pp. 10–29, 2013.
[159] A. Caliskan, J. J. Bryson, and A. Narayanan, “Semantics derived automatically from lan-
      guage corpora contain human-like biases,” Science, vol. 356, no. 6334, pp. 183–186, 2017.
[160] S. Verma and J. Rubin, “Fairness definitions explained,” in IEEE/ACM International Work-
      shop on Software Fairness (FairWare), pp. 1–7, IEEE, 2018.
[161] T. Kamishima, S. Akaho, and J. Sakuma, “Fairness-aware learning through regularization
      approach,” in International Conference on Data Mining Workshops, pp. 643–650, IEEE,
      2011.
[162] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi, “Fairness beyond dis-
      parate treatment & disparate impact: Learning classification without disparate mistreat-
      ment,” in International Conference on World Wide Web, pp. 1171–1180, 2017.
[163] Y. Bechavod and K. Ligett, “Penalizing unfairness in binary classification,” arXiv preprint
      arXiv:1707.00044, 2017.
[164] R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth,
      “A convex framework for fair regression,” arXiv preprint arXiv:1706.02409, 2017.
[165] M. Kearns, S. Neel, A. Roth, and Z. S. Wu, “Preventing fairness gerrymandering: Audit-
      ing and learning for subgroup fairness,” in International Conference on Machine Learning,
      pp. 2564–2572, 2018.
[166] B. Fish, J. Kun, and Á. D. Lelkes, “A confidence-based approach for balancing fairness and
      accuracy,” in International Conference on Data Mining, pp. 144–152, SIAM, 2016.
[167] D. Alabi, N. Immorlica, and A. T. Kalai, “Unleashing linear optimizers for group-fair learn-
      ing and optimization,” arXiv preprint arXiv:1804.04503, 2018.
[168] L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi, “Classification with fairness con-
      straints: A meta-algorithm with provable guarantees,” in Conference on Fairness, Account-
      ability, and Transparency, pp. 319–328, 2019.
[169] T. Calders, F. Kamiran, and M. Pechenizkiy, “Building classifiers with independency con-
      straints,” in IEEE International Conference on Data Mining Workshops, pp. 13–18, 2009.
[170] F. Kamiran and T. Calders, “Classifying without discriminating,” in International Confer-
      ence on Computer, Control and Communication, pp. 1–6, 2009.
                                                116


[171] F. Kamiran and T. Calders, “Data preprocessing techniques for classification without dis-
      crimination,” Knowledge and Information Systems, vol. 33, no. 1, pp. 1–33, 2012.
[172] M. Wick, J.-B. Tristan, et al., “Unlocking fairness: a trade-off revisited,” Advances in neural
      information processing systems, vol. 32, 2019.
[173] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” in The Col-
      lected Works of Wassily Hoeffding, pp. 409–426, Springer, 1994.
[174] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf, “Distinguishing cause
      from effect using observational data: methods and benchmarks,” Journal of Machine Learn-
      ing Research, vol. 17, no. 1, pp. 1103–1204, 2016.
[175] D. Greenfeld and U. Shalit, “Robust learning with the hilbert-schmidt independence crite-
      rion,” in International Conference on Machine Learning, pp. 3759–3768, PMLR, 2020.
[176] S. Yu, F. Alesiani, X. Yu, R. Jenssen, and J. C. Principe, “Measuring dependence with
      matrix-based entropy functional,” arXiv preprint arXiv:2101.10160, 2021.
[177] B. Sadeghi, L. Wang, and V. N. Boddeti, “Adversarial representation learning with closed-
      form solvers,” in Joint European Conference on Machine Learning and Knowledge Discov-
      ery in Databases, pp. 731–748, Springer, 2021.
[178] C. Frogner, C. Zhang, H. Mobahi, M. Araya-Polo, and T. Poggio, “Learning with a wasser-
      stein loss,” arXiv preprint arXiv:1506.05439, 2015.
[179] L. G. S. Giraldo, M. Rao, and J. C. Principe, “Measures of entropy from data using infinitely
      divisible kernels,” IEEE Transactions on Information Theory, vol. 61, no. 1, pp. 535–548,
      2014.
[180] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola, “A kernel method
      for the two-sample-problem,” Advances in Neural Information Processing Systems, vol. 19,
      2006.
[181] J. Song, P. Kalluri, A. Grover, S. Zhao, and S. Ermon, “Learning controllable fair represen-
      tations,” International Conference on Artificial Intelligence and Statistics, 2019.
[182] B. Sadeghi and V. N. Boddeti, “Imparting fairness to pre-trained biased representations,” in
      Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work-
      shops, pp. 16–17, 2020.
                                                117


                                            APPENDIX
A.1    Proof of Lemma 3.1
Lemma. Let X and U be two RVs with E[X] = 0, E[U ] = b, where CX  0. Consider a linear
           b = W Z + b, where W ∈ RdU ×r is the parameter matrix, and Z ∈ Rr is an encoded
regressor, U
version of X for a given Θ: X 7→ Z = ΘX, Θ ∈ Rr×dX . The minimum MSE that can be
achieved by designing W is
                             h            i                              2
                       min E kU − U   b k2 = Tr [CU ] − PM Q−T CXU         ,
                       W                                      X          F
where M = QX ΘT ∈ RdX ×r , and CX = QTX QX (Cholesky factorization).
Proof. Direct calculation yields:
                                  
                                 2
           JU := E U − U      b
                     h n                                   oi
               = Tr E (U − b − W Z)(U − b − W Z)         T
                    h n
               = Tr E (U − b)(U − b)T + (W ΘX)(W ΘX)T
                                                                                oi
                                     − (U − b)(W ΘX)T − (W ΘX)(U − b)T
                     h                                                            i
                                                T
               = Tr CU + (W Θ)CX (W Θ) − CU X (W Θ) − (W Θ)CU XT              T
                     h                                                                 i
               = Tr CU + (W ΘQTX )(W ΘQTX )T − CU X (W Θ)T − (W Θ)CUT X
                    h
               = Tr (W ΘQTX − CU X Q−1                 T
                                            X )(W ΘQX − CU X QX )
                                                                  −1 T
                                                                                       i
                                                   + CU − (CU X Q−1X  )(C     Q
                                                                           UX X
                                                                                 −1 )T
                                              2                2
               = QX ΘW T − Q−T       X CXY       − Q−TX CXU      + Tr[CU ]
                                              F                F
Hence, the minimizer of JU is obtained by minimizing the first term in the last equation, which is
a standard least square error problem. Let M = QX Θ, then the minimizer is given by
                                       W T = M † Q−T X CXU .
    Finally, Using the orthogonal decompositions
                                2                      2                       2
                   Q−TX CXU         = PM Q−T X CXU F + PM⊥ QX CXU F
                                                                   −T
                                F
                                                 118


and
                                  2                                     2                 2
      QX ΘW T − Q−T     X CXU        = M W T − PM Q−T        X CXU F + PM⊥ QX CXU F
                                                                                   −T
                                  F
                                                                                 2
                                     = M | {zM }† Q−T
                                                    X CXU − PM QX CXU
                                                                         −T
                                           PM
                                                                                 F
                                                                                 2
                                                           + PM⊥ Q−T    X CXU F
                                                              2
                                     = PM⊥ Q−T    X CXU F ,
we obtain the minimum value of JU as
                                                                   2
                                   Tr [CU ] − PM Q−T   X CXU F .
A.2     Relation Between Constrained Optimization Problem in (3.7) and Its
        Scalarization in (3.8)
Consider the optimization problem in (3.7)
                            Gα = arg min JY (G),        s.t.   JS (G) ≥ α.                  (1)
                                       G
and the optimization problem in (3.8)
                                        Gλ = arg min Jλ (G)                                 (2)
                                                   G
where
                         Jλ(G) = (1 − λ) JY (G) − λ JS (G),          λ ∈ [0, 1).
Claim For each λ ∈ [0, 1), solution Gλ of (2) is also a solution of (1) with
                                            α = JS (Gλ ).                                   (3)
Proof. Let us consider (1) while assuming that (2) is satisfied. For each λ and corresponding
solution Gλ , let α be given as in (3). For an arbitrary G satisfying JS (G) ≥ α, we have
                      (1 − λ) JY (Gλ ) − λ α = (1 − λ) JY (Gλ ) − λ JS (Gλ )
                                                                                            (4)
                                               ≤ (1 − λ) JY (G) − λ JS (G),
where the second step is from the assumption that (2) is satisfied. Consequently, we have
                        (1 − λ) [JY (G) − JY (Gλ )] ≥ λ [JS (G) − α] ≥ 0.                   (5)
Since JS (G) ≥ α, it follows that JY (G) ≥ JY (Gλ ) and consequently Gλ is a possible minimizer
of problem (1).
                                                  119


A.3     Proof of Theorem 3.2
Theorem. As a function of G ∈ RdX ×r , the objective function in equation (3.8) is neither convex
nor differentiable.
Proof. Recall that PG is equal to G(GT G)† GT . Therefore, due to the involvement of the pseudo
inverse, (3.8) is not differentiable (see [105]).
    For non-convexity consider the theorem that f (G) is convex in G ∈ RdX ×r if and only if
h(t) = f (t G1 + G2 ) is convex in t ∈ R for any constants G1 , G2 ∈ RdX ×r (see [133]).
    In order to use the above theorem, consider rank one matrices
                                                                               
                                1 0 ... 0                           1 0 ... 0
                              0 0 . . . 0                       1 0 . . . 0
                                                                               
                                                                               
                       G1 =    0 0 . . . 0 and G2 = 0 0 . . . 0 .
                                                                               
                               .. .. . .                         .. .. . .     
                              . .        .                      . .        . 
                                0 0 ... 0                           0 0 ... 0
Define G = (t G1 + G2 ). Then
                                                                                          
                                                           (t + 1)2 (t + 1)      0  ... 0
                                                         (t + 1)         1      0  . . . 0
                                                                                          
                                                 1                                        
             PG (t) = G(GT G)† GT =                      0               0      0  . . . 0 .
                                           (t + 1)2 + 1 
                                                              ..         ..     .. ..
                                                                                           
                                                                                           
                                                               .          .      .     . 
                                                               0          0      0  ... 0
Using basic properties of trace we get
                                                                            
                             (1 − λ) JY (G) − λ JS (G) = Tr PG (t)B ,
where the matrix B is given in (3.11) and we used Lemma 3.1. Now, represent B as
                                                                  
                                              b11 b12 . . . b1d
                                                                  
                                            b12 b22 . . . b2d 
                                      B=    ..    .. . .
                                                                   .
                                                                   
                                             .      .     .       
                                             b1d b2d . . . bdd
Thus,
                                                  2b (t + 1) + b22 − b11
                            Tr PG (t)B = b11 + 12                              .
                                                         (t + 1)2 + 1
It can be shown that the above function of t is convex only if b12 = 0 and b11 = b22 . On the other
hand, if these two conditions hold, it can be similarly shown that (1 − λ) JY (G) − λ JS (G) is non-
convex by considering a different pair of matrices G1 and G2 . This implies that (1 − λ) JY (G) −
λ JS (G) is not convex.
                                                   120


A.4     Proof of Theorem 3.3
Theorem. Assume that the number of negative eigenvalues (β) of B in (3.11) is j. Denote γ =
min{r, j}. Then, the minimum value in (3.9) is given as
                                             β1 + β2 + · · · + βγ
where β1 ≤ β2 ≤ . . . ≤ βγ < 0 are the γ smallest eigenvalues of B. And the minimum can be
attained by G = V , where the columns of V are eigenvectors corresponding to all the γ negative
eigenvalues of B.
Proof. Consider the inner optimization problem of (3.10) in (3.9). Using the trace optimization
problems and their solutions in [90], we get
                                                     h           i
                       min Jλ (G) = min Tr GT BG = β1 + β2 + · · · + βi ,
                  GT G=Ii                GT G=Ii
where β1 , β2 , . . . , βi are i smallest eigenvalues of B and minimum value can be achieved by the
matrix V whose columns are corresponding eigenvectors. If the number of negative eigenvalues
of B is less than r, then the optimum i in (3.9) is j, otherwise the optimum i is r.
A.5     Non-Linear Extension Through Kernelization
We assume that X is non-linearly mapped to φX (X) as illustrated in Figure 3.5. Recall from (2.6)
that
                                  Z = Θ [kX (x1 , X), · · · , kX (xn , X)]T .
    For a given fixed Θ, we have
                                       JY = min MSE (Yb − Y ).
                                              WY ,bY
Note that the above optimization problem can be separated over WY , bY . Therefore, for a given
WY , we first minimize over bY :
                                           n                         o
                                    min E kWY Z + bY − Y k2
                                    bY
                                                n
                                            1X
                                    = min         kWY zk + bY − yk k2
                                       bY n
                                              k=1
                                           n
                                       1X
                                    =         kWY zk + c − yk k2 ,
                                       n
                                         k=1
where the minimizer c is
                                               n
                                            1X
                                      c=          (yk − WY zk )
                                            n
                                              k=1
                                              Xn                   n                            (6)
                                            1                  1X
                                         =        yk − WY             zk
                                            n                  n
                                              k=1                k=1
                                         = E {Y } − WY E {Z} .
                                                    121


Let all the columns of C be equal to c. Therefore we now have
                        min MSE (Yb − Y )
                      WY ,bY
                              1
                      = min      kWY ΘKX + C − Y k2F
                         WY n
                              1                        2
                      = min        WY ΘKX H − Ỹ
                         WY n                          F
                              1                            2
                      = min        HKX ΘT WYT − Ỹ T
                         WY n                              F
                              1                        2     1             2                 (7)
                      = min        M WYT − PM Ỹ T        +     PM⊥ Ỹ T
                         WY n                          F     n             F
                                                           2
                         1                                      1             2
                      =      M
                             |   M }† PM Ỹ T − PM Ỹ T
                                {z                            +      PM⊥ Ỹ T
                         n                                      n             F
                               PM
                                                           F
                         1               2
                      =      PM⊥ Ỹ T
                         n               F
                         1         2    1            2
                      =      Ỹ T     −    PM Ỹ T      ,
                         n         F    n            F
where the third step is due to (6), M = HKX ΘT , and the fifth step is the orthogonal decompo-
sition w.r.t. M . Using the same approach, we get
                                        1       2    1              2
                                  JS =     S̃ T    −      PM S̃ T     .                      (8)
                                        n       F    n              F
    Assume that the columns of LX are the orthonormal basis for the columns space of HKX .
For any M , there exists G such that LX G = M . In general, there is no bijection between Θ
and G in the equality HKX ΘT = LX G. But, there is a bijection between G and Θ when
constrained to Θ’s in which R(ΘT ) ⊆ N (HKX )⊥ . This restricted bijection is sufficient since
for any ΘT ∈ N (HKX ) we have M = 0. Once G is determined, ΘT can be obtained as
                         ΘT = (HKX )† LX G + Θ0 , Θ0 ⊆ N (HKX ).
However, since
                                         2                        2
                       kΘk2F = ΘT          = (HKX )† LX G + kΘ0 k2F ,
                                         F                        F
choosing Θ0 = 0 results in minimum kΘkF , which is favorable in terms of robustness to the
noise. Similar to (3.6), we have PM = LX PG LTX . If we assume that the rank of PG is i, Jλ (G)
in (3.10) can be expressed as
                                                 2                              2
               Jλ (G) = λ LX GGT LTX S̃ T          − (1 − λ) LX GGT LTX Ỹ T      ,
                                                 F                              F
where PG = GGT for some orthogonal matrix G ∈ RdX ×i . This resembles the optimization
problem in (3.9) and therefore have the same solution as Theorem 3.3 with modified B as
                                                                   
                             B = LTX λ S̃ T S̃ − (1 − λ) Ỹ T Ỹ LX .                        (9)
                                                 122


Once G is determined, Θ can be computed as Θ = GT LTX (HKX )† .
A.6     Proof of Theorem 3.4
Theorem. Let the columns of LX be the orthonormal basis for HKX . Further, assume that the
columns of VS are the singular vectors corresponding to zero singular values of S̃LX and the
columns of VY are the singular vectors corresponding to non-zero singular values of Ỹ LX . Then,
we have
                                             1          2     1
                  γmin := min JY (Θ) =           Ỹ T      − kỸ LX k2F
                             Θ               n          F     n
                                                         1         2    1             2
                  γmax :=         min      JY (Θ) =         Ỹ T     −     Ỹ LX VS
                            arg max JS (Θ)              n          F    n             F
                                                        1         2    1            2
                  αmin :=         max      JS (Θ) =        S̃ T      −    S̃LX VY
                            arg min JY (Θ)              n         F    n            F
                                             1         2
                  αmax := max JS (Θ) =          S̃ T
                              Θ             n          F
Proof. Firstly, we recall from Section that instead of Θ, we consider G. These two matrices are
related to each other as HKX ΘT = LX G = M , where the columns of LX are the orthogonal
basis for the column space of HKX . Therefore we can now express the projection onto M in
terms of projection onto G, i.e.,PM = LX PG LX . Using (7), we get
                         1         2    1                    2
                γmin =       Ỹ T    − max PM Ỹ T
                         n         F    n Θ                  F
                         1         2    1                            2
                     =       Ỹ T    − max LX PG LTX Ỹ T
                         n         F    n G                          F
                                                (                                       )
                         1         2    1                        h                    i
                     =       Ỹ T    − max           max Tr GT LTX Ỹ T Ỹ LX G
                         n         F    n i          T
                                                  G G=Ii
                         1         2    1    h                         i
                     =       Ỹ T    − Tr VYT LTX Ỹ T Ỹ LX VY                              (10)
                         n         F    n
                         1         2    1X 2
                     =       Ỹ T    −         σk
                         n         F    n
                                           k
                         1         2    1 X 2
                     =       Ỹ T    −          σk
                         n         F    n
                                          σk >0
                     =
                         1
                             Ỹ T 2  −
                                        1
                                           Ỹ LX ,
                                                    2
                         n         F    n           F
where the fourth step is borrowed from trace optimization problems studied in [90] and σk ’s are
the singular values of Ỹ LX .
    In order to interpret the bounds in more detail, we consider the one-dimensional case where
                                                   123


X, Y, ∈ R. In this setting, the correlation coefficient (denoted by ρ(·, ·)) between X and Y is
                                                          Ỹ X̃ T
                                     ρ(X, Y ) = p
                                                       Ỹ Ỹ T X̃ X̃ T
                                                    kỸ LX kF
                                                  =                                             (11)
                                                         σ
                                                    s Y
                                                             γ
                                                  = 1 − min        ,
                                                              σY2
where σY2 = kỸ k2F /n. As a result, the normalized MSE can be expressed as
                                        γmin
                                                 = 1 − ρ2 (X, Y ).                              (12)
                                          σY2
Therefore, the lower bound of the target’s MSE is independent of the encoder and is only related
to the alignment between the subspaces spanned by the data and labels.
    Next, we find an encoder that allows the target task to obtain its optimal loss, γmin , while
seeking to minimize the leakage of sensitive attributes as much as possible. Thus, we constrain
the domain of the encoder to {arg min JY (Θ)}. Assume that the columns of the encoder G is
the concatenation of the columns of VY together with at least one singular vector corresponding
to a zero singular value of Ỹ LX . Therefore VY ⊆ G and consequently kLX PV LTX U k2F ≤
                                                                                        Y
kLX PG LTX U k2F for arbitrary matrix U . As a result, JS (G) ≥ JS (VY ) and at the same time
JY (G) = JY (VY ). The latter can be observed from
                                                2                          2
                             LX PG LTx Ỹ T        = Ỹ LX PG LTX
                                                F                          F
                                                                                   2
                                                   =   Ỹ LX GG LX Ỹ T
                                                                     T    T
                                                                                   F            (13)
                                                                               2
                                                   =   Ỹ LX VY VYT LTX
                                                                               F
                                                                             2
                                                   =   LX PV LTX Ỹ T            .
                                                               Y             F
We then have
                                   1          2     1                        2
                         αmin =        S̃ T      −     LX PV LTX S̃ T
                                   n          F     n           Y            F
                                   1          2     1    h                           i
                                =      S̃ T                  T    T
                                                 − Tr VY LX S̃ S̃LX VY  T                       (14)
                                   n          F     n
                                   1          2     1                 2
                                =      S̃ T      −      S̃LX VY          .
                                   n          F     n                 F
    This bound can again be interpreted under the one-dimensional setting of X, S ∈ R as
                                         αmin
                                                  = 1 − ρ2 (X, S)                               (15)
                                           σS2
                                                     124


   On the other hand, αmax turns out to be,
                                                        1            2
                                              αmax =          S̃ T
                                                        n            F                                     (16)
                                                    = σS ,2
which can be achieved via a trivial choice of G = 0. However, we let the columns of G be the sin-
gular vectors corresponding to all zero singular values of S̃LX in order to maximize PM Ỹ T
                                                                                                             F
and at the same time ensuring that JS (G) equal to αmax . As a result, we have
                                            1         2      1              2
                                 γmax =        Ỹ T     −          Ỹ LX VS   .
                                            n         F      n              F
For the one dimensional case i.e., X, Y, S ∈ R, we get VS = 0 and consequently,
                                                 γmax = σY2 .                                              (17)
B.1     A Population Expression for Definition in (5.6)
A population expression for Dep(Z, S) in (5.6) is given in the following.
                      X r n
                                                                        
 Dep(Z, S) =                EX,S,X 0 ,S 0 fj (X) fj (X 0 ) kS (X, X 0 )
                      j=1
                                                                                   
                                      +EX fj (X) EX 0 fj (X 0 ) ES,S 0 kS (X, S 0 )
                                                                                                         o
                                                       −2 EX,S fj (X) EX 0 [fj (X 0 )] ES 0 [kS (S, X 0 )]
where (X 0 , S 0 ) is independent of (X, S) with the same distribution as pXS .
Proof. We first note that this population expression is inspired by that of HSIC [81].
   Consider the operator ΣSX induced by the linear functional Cov (α(X), βS (S))
= hβS , ΣSX αiH . Then, it follows that
                     S
                       X r  X                               
  Dep(Z, S) =                     Cov2 fj (X), βS (S)
                       j=1 βS ∈US
                       X r  X                     2
                  =                βS , ΣSX fj H
                                                    S
                       j=1 βS ∈US
                       X r  X                     2
                  =                βS , ΣSX fj H
                                                    S
                       j=1 βS ∈US
                       X r
                 (a)
                  =        kΣSX fj k2H
                                        S
                       j=1
                                                      125


                    X r
                =         ΣSX fj , ΣSX fj H
                                              S
                    j=1
                    X r
               (b)                                     
                =        Cov fj (X), (ΣSX fj )(S)
                    j=1
                    X r                                          
                =        Cov fj (X), kS (·, S), ΣSX fj H
                                                                S
                    j=1
                    X r
                                                                     
                =        Cov fj (X), Cov(fj (X 0 ), kS (S 0 , S))
                    j=1
                    X r                                                                                       
                =        Cov fj (X), EX 0 ,S 0 [fj (X 0 ) kS (S, S 0 )] − EX 0 [fj (X 0 )] ES 0 [kS (S, S 0 )]
                    j=1
                    X r n
                                                                     
                =         EX,S,X 0 ,S 0 fj (X) fj (X 0 ) kS (S, S 0 )
                    j=1
                                                                                
                              +EX fj (X) EX 0 fj (X 0 ) ES,S 0 kS (S, S 0 )
                                                                                                     o
                                                −2 EX,S fj (X) EX 0 [fj (X 0 )] ES 0 [kS (S, S 0 )]
where (a) is due to Parseval relation for orthonormal basis and (b) is from the definition of ΣSX .
B.2     Proof of Lemma 5.3
Lemma. Let KX , KS ∈ Rn×n be Gram matrices corresponding to HX and HS , respectively,
i.e., (KX )ij = kX (xi , xj ) and (KS )ij = kS (si , sj ), where covariance is empirically estimated
as
                                         n                              n n
                                1X                              1 XX
           Cov fj (X), βS (S) ≈             fj (xi )βS (si ) − 2              fj (xi )βS (sk ).
                                     n                          n
                                        i=1                           i=1 k=1
It follows that, the corresponding empirical estimation for Dep (Z, S) is
                                                        1
                             Depemp (Z, S) :=            2
                                                           kΘKX HLS k2F ,                                      (18)
                                                      n
where H = In − n1 1n 1Tn is the centering matrix, and LS is a full column-rank matrix in which
LS LTS = KS (Cholesky factorization). Furthermore, the empirical estimator in (5.7) has a bias of
O(n−1 ) and a convergence rate of O(n−1/2 ).
Proof. Firstly, let us reconstruct the orthonormal set US when n i.i.d. observations {sj }n             j=1 are
                                                    126


given. Invoking representer theorem, for two arbitrary elements βi and βm in US , we have
                                         * n                     n
                                                                                   +
                                           X                    X
                    hβi , βm iH =               αj kS (sj , ·),      ηl kS (sl , ·)
                                S
                                           j=1                  l=1                  HS
                                         Xn X  n
                                     =            αj ηl kS (sj , sl )
                                         j=1 l=1
                                     =   α T KS η
                                         D               E
                                     =     LTS α, LTS η q
                                                           R
where LS ∈ Rn×q is a full column-rank matrix and KS = LS LTS is the Cholesky factorization
of KS . As a result, searching for βi ∈ US is equivalent to searching for LTS α ∈ Uq where Uq is
any complete orthonormal set for Rq . Using empirical expression for covariance, we get
                                     r
                                        (     n                             n              n
                                                                                                      )2
                             X X          1X                            1 X               X
     Depemp (Z, S) :=                            fj (xi )βS (si ) − 2            fj (xi )     βS (sk )
                                          n                            n
                           βS ∈US j=1        i=1                           i=1            k=1
                              X X      r n                                                    o2
                                            1 T                    1 T               T
                      =                       θ K K α − 2 θj KX 1n 1n KS α
                                            n j X S               n
                             T
                           LS α∈Uq   j=1
                              X X      r                         2
                                            1 T
                      =                       θ K HKS α
                                            n j X
                           LT        j=1
                             S α∈Uq
                              X X      r                             2
                                            1 T                  T
                      =                       θ K HLS LS α
                                            n j X
                           LT        j=1
                             S α∈Uq
                            X X   r                        2
                                        1 T
                      =                   θ K HLS ζ
                                        n j X
                           ζ∈Uq j=1
                            X 1
                      =            2
                                     kΘKX HLS ζk22
                                 n
                           ζ∈Uq
                            1
                      =      2
                               kΘKX HLS k2F ,
                           n
where f (X) = Θ[kX (x1 , X), · · · , kX (xn , X)]T and Θ := [θ1 , · · · , θr ]T .                      
    We now show that the bias of Depepm (Z, S) for estimating Dep(Z, S) in (5.7) is O n1 . To
                                                  127


achieve this, we split Depepm (Z, S) into three terms as,
     1                                1 n                                   o
                            2                                             T
         kΘK    X  HL   S Fk     =       Tr    ΘK    X HK    S HK     X Θ
    n2                               n2                                                         
                                      1                         1 T                    1 T         T
                                 =       Tr ΘKX I − 11                    KS I − 11           KX Θ
                                     n2                         n                      n
                                      1       n                        o 2        n                    o
                                 =        Tr    K    Θ T ΘK K            −     Tr   1 T K ΘT ΘK K 1
                                                   X           X S                        X       X S
                                     n2 |              {z              } n |
                                                                             3
                                                                                             {z        }
                                                        I                                    II
                                        1       n                                 o
                                     + 4 Tr 1T KX ΘT ΘKX 11T KS 1                                        (19)
                                        n |                    {z                  }
                                                               III
Let  cn
      p  denote the set of all p-tuples drawn without replacement from {1, · · · , n}. Moreover, let
Θ = [θ1 , · · · , θr ]T ∈ Rr×n and (A)ij denote the element of an arbitrary matrix A at i-th row and
j-th column. Then, it follows that
    (I):
                              h n                           oi
                            E Tr KX ΘT ΘKX KS
                                                                 
                             r          
                                                                  
                                                                   
                            X                                        
                       =                              T
                                 E Tr KX θk θk KX KS 
                                        
                                          |   {z  }               
                                                                   
                            k=1              :=αk
                            Xr     h n                 oi
                       =         E Tr αk αTk KS
                            k=1
                                                                                               
                            Xr
                                   X                                 X                         
                       =         E  (αk αTk )ii (KS )ii +                  (αk αTk )ij (KS )ji 
                            k=1       i                            (i,j)∈cn
                                                                          2
                              Xr          h                    i
                       = n         EX,S fk2 (X)kS (S, S)
                              k=1
                                         X r
                                  n!                                                    
                            +                  EX,S,X 0 ,S 0 fk (X)fk (X 0 )kS (S, S 0 )
                              (n − 2)!
                                         k=1
                                                  Xr
                                          n!                                                   
                       = O(n) +                       EX,S,X 0 ,S 0 fk (X)fk (X 0 )kS (S, S 0 )          (20)
                                     (n − 2)!
                                                  k=1
where (X, S) and (X 0 , S 0 ) are independently drawn from the joint distribution pXS .
    (II):
                          h                             i
                       E 1T KX ΘT ΘKX KS 1
                                                             
                         r
                       X                                     
                  =          E 1T KX θk θkT KX KS 1
                                     | {z }
                       k=1              αk
                                                          128


            Xr     h                   i
        =             T       T
                 E 1 α k α k KS 1
            k=1
                                                          
            Xr        Xn X  n X  n
        =        E                  (αk αTk )mi (KS )mj 
            k=1      m=1 i=1 j=1
                                                                               
             r
            X X                                    X                           
        =        E  (αk αTk )ii (KS )ii +                  (αk αTk )mm (KS )mj 
            k=1       i                           (m,j)∈cn2
                                                                                  
         Xr
                X                                      X                          
       +     E             (αk αTk )mi (KS )mm +              (αk αTk )mj (KS )mj 
         k=1      (m,i)∈cn2                          (m,j)∈cn2
                                                       
            Xr           X
                                                       
        +        E              (αk αTk )mi (KS )mj 
            k=1      (m,i,j)∈cn3
              Xr          h                   i
        = n        EX,S fk2 (X)kS (S, S)
              k=1
               n!     X r             h                   i
        +                                2
                            EX,S,S 0 fk (X)kS (S, S )   0
            (n − 2)!
                      k=1
                      X r
               n!                                               
        +                   EX,S,X 0 fk (X)fk (X 0 )kS (S, S)
            (n − 2)!
                      k=1
                      X r
               n!                                                    
        +                   EX,S,X 0 ,S 0 fk (X)fk (X 0 )kS (S, S 0 )
            (n − 2)!
                      k=1
                      X r
               n!                                                        
        +                   EX,S fk (X)EX 0 [fk (X 0 )]ES 0 [kS (S, S 0 )]
            (n − 3)!
                      k=1
                      X r
               n!                                                        
        =                   EX,S fk (X)EX 0 [fk (X 0 )]ES 0 [kS (S, S 0 )]
            (n − 3)!
                      k=1
        +   O(n2 ).                                                                  (21)
(III):
                  h                                 i
               E 1T KX ΘT ΘKX 11T KS 1
                                                         
                 r
               X                                         
           =         E 1T KX θk θkT KX 11T KS 1
                             | {z }
               k=1              αk
               X r     h                        i
           =         E 1T αk αTk 11T KS 1
               k=1
                                             129


                                                               
                       Xr          X
                   =        E           (αk αTk )ij (KS )ml 
                       k=1       i,j,m,l
                                                                                  
                                   X r             X
                                                                                  
                   = O(n3 ) +            E                 (αk αTk )ij (KS )ml 
                                   k=1        (i,j,m,l)∈cn4
                                   X r
                           n!                                                            
                   =                     EX [fk (X)] EX 0 fk (X 0 ) ES,S 0 kS (S, S 0 )
                       (n − 4)!
                                   k=1
                   +   O(n3 ).                                                                        (22)
Using above calculations together with Lemma 2 lead to
                                                                               
                                                                                 1
                               Dep(Z, S) =        E [Depemp (Z, S)] + O              .
                                                                                 n
    We now obtain the convergence of depemp (Z, S). Consider the decomposition in (19) together
with (20), (21), and (22). Let αk := KX θk , then it follows that
              P {Dep(Z, S) − Depemp (Z, S) ≥ t}
                ( r
                   X                                                 
            ≤ P        EX,S,X 0 ,S 0 fk (X)fk (X 0 )kS (S, S 0 )
                   k=1
                            r                                                      )
                 (n − 2)! X        X                                       1
              −                            (αk αTk )ij (KS )ji + O              ≥ at
                    n!                   n                                 n
                           k=1 (i,j)∈c
                                         2
                ( r
                   X                                                       
            + P        EX,S fk (X)EX 0 [fk (X 0 )]ES 0 [kS (S, S 0 )]
                   k=1
                          r                                                          )
              (n − 3)! X          X                                           1
            −                              (αk αTk )mi (KS )mj + O                 ≥ bt
                  n!                     n                                    n
                        k=1 (i,j,m)∈c
                                         3
                ( r
                   X                                                        
            + P        EX [fk (X)] EX 0 fk (X 0 ) ES,S 0 kS (S, S 0 )
                   k=1
                            r                                                                    )
                 (n − 4)! X           X                                         1
              −                                 (αk αTk )ij (KS )ml + O              ≥ (1 − a − b)t ,
                    n!                        n                                 n
                           k=1 (i,j,m,l)∈c
                                              4
                                                                                     
where a, b > 0 and a + b < 1. For convenience, we omit the term O n1 and add it back in the
last stage.
    Define ζ := (X, S) and consider the following U-statistics [173]
                                                           X      X r
                                             (n − 2)!
                       u1 (ζi , ζj ) =                                  (αk αTk )ij (KS )ij
                                                 n!
                                                        (i,j)∈cn2
                                                                  k=1
                                                       130


                                                                   X         Xr
                                                    (n − 3)!
                         u2 (ζi , ζj , ζm ) =                                     (αk αTk )mi (KS )mj
                                                       n!
                                                               (i,j,m)∈cn 3
                                                                             k=1
                                                                    X         X r
                                                    (n − 4)!
                     u3 (ζi , ζj , ζm , ζl ) =                                     (αk αTk )ij (KS )ml
                                                       n!
                                                               (i,j,m,l)∈cn 4
                                                                              k=1
Then, from Hoeffding’s inequality [173] it follows that
                                                              −2a2 t2 n       −2b2 t2 n      −2(1−a−b)2 t2
                                                                                                             n
         P {Dep(Z, S) − Depemp (Z, S)              ≥ t} ≤   e 2r2 M 2    + e 3r2 M 2    +e       4r2 M 2       ,
where we assumed that kS (·, ·) is bounded by one and fk2 (Xi ) is bounded by M for any k =
1, · · · , r and i = 1, · · · , n.
     Further, if 0.22 ≤ a < 1, it holds that
                           −2a2 t2 n        −2b2 t2 n       −2(1−a−b)2 t2              −a2 t2 n
                                                                              n
                         e 2r2 M 2     + e 3r2 M 2     +e       4r2 M 2         ≤   3e r2 M 2 .
Consequently, we have
                                                                                    −a2 t2 n
                            P {|Dep(Z, S) − Depemp (Z, S)|             ≥ t} ≤    6e r2 M 2 .
Therefore, with probability at least 1 − δ, it holds
                                                               s
                                                                                             
                                                                  r2 M 2 log(6/σ)              1
                     |Dep(Z, S) − Depemp (Z, S)|           ≤                           +O           .            (23)
                                                                         α2 n                  n
B.3         Proof of Theorem 5.4
Theorem. Let Z = f (X) be an arbitrary representation of the input data, where f ∈ HX . Then,
there exist an invertible Borel function h, such that, h ◦ f belongs to Ar .
Proof. Recall that the space of disentangled representation is
                   n                                                                                        o
         Ar := (f1 , · · · , fr ) fi , fj ∈ HX , Cov fi (X), fj (X) + γhfi , fj iH                     = δi,j ,
                                                                                                  X
where γ > 0. Let IX denote the identity operator from HX to HX . We claim that h :=
[h1 , · · · , hr ], where
                                                                             
                                        hf1 , f1 iH     · · · hf1 , fr iH
                                                    X                      X
                                              ..       . .           ..      
                             G0 =              .           .          .      
                                        hfr , f1 iH     · · · hfr , fr iH
                                                    X                       X
                             G = G0      −1/2
                                          X r
                             hj ◦ f =             gjm (ΣXX + γIX )−1/2 fj ,            ∀j = 1, · · · , r
                                          m=1
                                                              131


is the desired invertible transformation. To see this, construct
                                          
         Cov hi (f (X)), hj (f (X)) + γhhi ◦ f , hj ◦ f iH
                                                                      X
     =    hi ◦ f , (ΣXX + γIX ) hj ◦ f H
         * r                                      X                                              +
            X                                       Xr
     =            gim (ΣXX + γIX )−1/2 fi ,              gjk (ΣXX + γIX ) (ΣXX + γIX )−1/2 fj
            m=1                                     k=1                                           HX
          Xr X   r
     =               gim gjk fi , fj H = (G G0 G)ij = δi,j
                                          X
         m=1 k=1
The inverse of h is h0 := [h01 , · · · , h0r ] where
                                  1/2
                         H = G0
                                    X  r
                         h0j ◦h=           hjm (ΣXX + γIX )1/2 hj ,           ∀j = 1, · · · , r.
                                    m=1
B.4     Proof of Theorem 5.7
Theorem. Assume that kS and kY are bounded by one and fj2 (xi ) ≤ M for any j = 1, . . . , r
and i = 1, . . . , n for which f = (f1 , . . . , fr ) ∈ Ar . Then, for any n > 1 and 0 < δ < 1, with
probability at least 1 − δ, we have
                                                                   r                   
                                               emp                     log(6/δ)          1
                  sup J(f , λ) − sup J             (f , λ) ≤ rM                  +  O          .
                f ∈Ar              f ∈Ar                                0.222 n          n
Proof. Recall that in the proof of Lemma 5.3 we have shown that with probability at least 1 − δ,
the following inequality holds
                                                          s
                                                                                     
                                       emp                  r2 M 2 log(6/σ)           1
                  |Dep(Z, S) − Dep (Z, S)| ≤                                   + O          .
                                                                 0.222 n              n
     Using the same reasoning for dep(Z, Y ), with probability at least 1 − δ, we have
                                                          s
                                                                                      
                                       emp                   r2 M 2 log(6/σ)           1
                 |Dep(Z, Y ) − Dep (Z, Y )| ≤                                  +  O          .
                                                                 0.222 n               n
Since J(f (X)) = (1 − λ) dep(Z, Y ) − λ dep(Z, S) and J emp (f (X)) := (1 − λ) depemp (Z, Y ) −
λ depemp (Z, S), it follows that with probability at least 1 − δ,
                                                            r                    
                                     emp                       log(6/σ)           1
                       |J(f , λ) − J       (f , λ)| ≤ rM             2
                                                                           +O         .
                                                                0.22 n            n
                                                         132


We complete the proof by noting that, the following inequality holds for any bounded J and J emp :
            sup J(f , λ) − sup J emp (f , λ) ≤ sup |J(f , λ) − J emp (f , λ)| .
           f ∈Ar            f ∈Ar                  f ∈Ar
                                               133