DATA-CENTRIC PRIVACY-PRESERVING MACHINE LEARNING
                               By
                        Junyuan Hong
                     A DISSERTATION
                         Submitted to
                 Michigan State University
          in partial fulfillment of the requirements
                       for the degree of
         Computer Science—Doctor of Philosophy
                              2023


                                        ABSTRACT
With the rapid popularization of machine learning (ML), privacy emerges as a critical obsta-
cle for extracting knowledge from sensitive data into models. Traditional machine learning
methods industriously curate data from millions of clients (e.g., edge devices) and train
models upon the data, which causes tremendous risks of leaking data providers’ sensitive
information. In this thesis, we are devoted to exploiting learning algorithms that protect
data providers’ privacy based on three different ways of data use: from central to distributed
settings. First, we investigate private data-centralized learning (CL) in the rigorous notion
of differential privacy (DP), where we search for the iterative dynamic privacy allocation in
gradient descent toward higher model utility. Our theoretic work shows that optimal privacy
allocation can improve the sample efficiency of DP learning. Though data can be protected
in learning, CL has to make the assumption that the data management institute can be
trusted, which does not have technical guarantees, though. Rather than aggregating data,
federated learning (FL) coordinates clients to periodically share models trained on local data.
Confronting the great data and device heterogeneity from clients in FL, we propose novel
algorithms that can effectively train models from clients with heterogeneous data distribu-
tions and device capabilities. One of our methods enhances the knowledge transfer from one
supervised domain to an unsupervised domain, and can reduce the performance gap between
clients from different social groups. We also developed a hardware-adaptive learning algo-
rithm that makes FL inclusive for devices with various capabilities. Not only during training,
our algorithm also enables models to be customizable at test time for facilitating dynamic
computation budgets. Though FL mitigates the risks by distributed data, the local training
of large models could still be a significant burden for resource-limited edge devices. Last,
observing the limitations of CL in the privacy of data storage and FL in computation, we
propose a new computation paradigm, outsourcing training without uploading data (OT).
To learn effective knowledge about private data, we sample the proximal proxy dataset from
the open-source data for cloud training. Our method can efficiently and effectively spot sim-


ilar samples from privacy-free open-source data, and therefore can transfer the computation
costs of training to the cloud server.


This thesis is dedicated to my parents.
                   iv


                               ACKNOWLEDGEMENTS
First and foremost, I would like to extend my profound gratitude towards my advisor, Dr.
Jiayu Zhou, whose invaluable counsel, motivation, and unwavering support have been indis-
pensable throughout my doctoral voyage. Dr. Zhou’s role in my academic journey transcends
that of a mere advisor, concurrently acting as a trusted confidante who imparts his exten-
sive knowledge and experience in various facets of my Ph.D. odyssey. My experience in
Dr. Zhou’s research laboratory has been nothing short of extraordinary, a consequence of a
fortunate decision taken years prior. The wealth of knowledge that I’ve accumulated under
his guidance is vast, ranging from efficacious research methodologies to the art of influen-
tial leadership. Ranked among the world’s most esteemed advisors, Dr. Zhou’s proclivity
towards nurturing the growth of his students, demonstrated by his constant support and
prompt responses, is unparalleled. His enduring support over the years has been a linchpin
in my academic accomplishments, and for that, I will remain forever appreciative.
    I would like to express my deepest appreciation to my dissertation committee members,
Dr. Anil K. Jain, Dr. Sijia Liu, and Dr. Zhangyang Wang for their insightful comments
and helpful suggestions. Their expertise and feedback have played a pivotal role in shaping
both this dissertation and my job talk. I would like to extend my profound gratitude to Dr.
Zhangyang Wang, who has diligently co-supervised me alongside Dr. Zhou throughout my
doctoral journey. His unflagging zeal, remarkable intellect, and unwavering professionalism
have continuously served as the ideal I aspire to emulate in my future endeavors.
    I was fortunate to work as an intern at Sony AI with amazing colleagues and mentors:
Dr. Lingjuan Lyu, Dr. Spranger Michael, Yuyang Deng, Yi Zeng, Dr. Nidham Gazagnadou,
Dr. Virat Shejwalkar, Jiahua Dong, Dr. Chen Chen, Dr. Weiming Zhuang, Dr. Jingtao Li,
Yubin Hu, and Ronghang Zhu. I enjoyed the wonderful and productive months with you.
    I am grateful to have had the pleasure and fortune of having supportive and encouraging
colleagues during my Ph.D. I am thankful to all my colleagues from the Intelligent Data
Analytics (ILLIDAN) Lab: Zhuangdi Zhu, Mengying Sun, Kaixiang Lin, Boyang Liu, Qi
                                              v


Wang, Inci Baytas, Ikechukwu Uchendu, Andy Tang, Liyang Xie, Jun Chen, Shuyang Yu,
Haobo Zhang, Siqi Liang, Yijiang Pang, and Hoang Cao Bao. I would love to extend my
deepest appreciations to my collaborators: Dr. Haotao Wang, Dr. Zhangyang Wang, Dr.
Ruoxi Jia, Dr. Hiroko Dodge, Dr. Xiaomin Ouyang, Dr. Mehrdad Mahdavi, Dr. Yang
Zhou, Dr. Vishnu Boddeti, Dr. Steve Drew, Dr. Aston Zhang, Dr. Zenglin Xu, and Dun
Zeng.
   I would also like to extend my heartfelt appreciation to my dear friends at MSU who
provided unwavering support and encouragement during the moments when I felt frustrated
during my Ph.D. study. Dr. Wei Wang, Dr. Kaixiong Zhou, Meijun Gao, Tianxudong
Tang, Fei Zhang, Rundong Zhao, Xu Dong, Feiran Li, Xitong Zhang, Jingwen Shi, Wentao
Wang, Xinyu Lei, Wei Jin, Wentao Bao, Junwen Chen, Xinda Qi, Pengyu Chu, Xiaorui Liu,
Shaohua Yang, Qiaozi Gao, Dong Chen, Dr. Nan Du, Dr. Ding Wang, Mi Gong, Guangjing
Wang, Ze Zhang, Ruoze Su, Hao Wang, Ruoqiao Chen, Pai Li, Xuan Xie, Cathy Bing, Zeyu
Qin, Zhiyuan Ren, Shengjie Zhu, and Xinyi Wang, your friendship and belief in me have
been truly invaluable.
   Finally, I would like to thank my family and my girlfrind, Zhujie Hong, Yafen Chen and
Han Meng, for their unconditional love and support.
                                             vi


                             TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . .               .  . . . . . . .  .  . . .  1
  1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .  . . .   1
  1.2 Overview of Thesis Structure . . . . . . . . . . . . . .      . . . . . . . . .  . . .   2
  1.3 Defitinion of Privacy-Preserving Learning . . . . . . . .     . . . . . . . . .  . . .   3
  1.4 A Glance of Prior Arts . . . . . . . . . . . . . . . . . .    . . . . . . . . .  . . .   4
CHAPTER 2 DYNAMIC PRIVACY BUDGET ALLOCATION FOR ENHANCED
             CENTRALIZED PRIVATE LEARNING . . . . . . . . . . . . . . . .                      7
  2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     7
  2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       9
  2.3 Private Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      11
  2.4 Dynamic Policies by Minimizing Utility Upper Bounds . . . . . . . . . . . .             14
  2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     27
CHAPTER 3 FEDERATED ADVERSARIAL DEBIASING FOR TRANSFERABLE
             AND FAIR REPRESENTATIONS . . . . . . . . . . . . . . . . . . .                   29
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    29
  3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      32
  3.3 Federated Adversarial Debiasing . . . . . . . . . . . . . . . . . . . . . . . . .       33
  3.4 Optimality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     35
  3.5 Experiments on Unsupervised Domain Adaptation . . . . . . . . . . . . . . .             40
  3.6 Experiments on Fair Federated Learning . . . . . . . . . . . . . . . . . . . .          45
CHAPTER 4 EFFICIENT FEDERATED LEARNING FOR ON-DEMAND AND
             IN-SITU CUSTOMIZATION . . . . . . . . . . . . . . . . . . . . . .                49
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    49
  4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      52
  4.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     54
  4.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      55
  4.5 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     61
CHAPTER 5 OUTSOURCING TRAINING WITHOUT UPLOADING                              DATA      . . . 67
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . .  . . .  67
  5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . .  . . .  70
  5.3 Outsourcing Model Training With Open-Source Data . . . . . .            . . . .  . . .  71
  5.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . .  . . .  76
CHAPTER 6 OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                84
APPENDIX A     DYNAMIC PRIVACY BUDGET ALLOCATION . . . . . . . . . . 101
APPENDIX B     FEDERATED ADVERSARIAL DEBIASING FOR FAIR AND
               TRANSFERABLE REPRESENTATIONS . . . . . . . . . . . . . . 104
                                            vii


APPENDIX C EFFICIENT FEDERATED LEARNING FOR ON-DEMAND AND
           IN-SITU CUSTOMIZATION . . . . . . . . . . . . . . . . . . . . . 111
APPENDIX D OUTSOURCING TRAINING WITHOUT UPLOADING DATA . . 121
                              viii


                                        CHAPTER 1
                                     INTRODUCTION
1.1     Motivation
When machine learning becomes the fundamental block in artificial intelligence, the recent
advances in deep machine learning hinge on a large amount of data. Such great demands
for massive data cast significant challenges on the broader applications of deep learning in
practical scenarios, where data are collected from the amount of distributed personal devices
containing private information and therefore cannot be freely collected without privacy,
communication, and computation costs. For example, the widely-used ImageNet dataset [26]
gathers millions of images from Internet, which are captured by camera sensors distributed
across the world. When human objects are involved in the system, the privacy of the personal
information (including identities, activities, and preferences) in the photos has to be carefully
protected.
    In this thesis, we study privacy protection from a data-centric perspective: How data
are used in private learning?. We answer this question from three ways of data use. 1)
Centralize data. First, centralized learning has been a long-term topic, where all data are
gathered into an institute to conduct follow-up learning. For centralized data storage, the
institute responsible for data management is promised to protect the users’ data, where the
model delivered by centralized learning should not leak private information by any means.
2) Distributed data and training. However, the storage institute may break the trust
when the clients can not monitor the data use. For example, Facebook sells the users’ data
to Cambridge Analytics without acknowledging users [2]. The leakage strengthens people’s
doubts about the trustworthiness of big corporations. Therefore, in the last decades, many
countries or organizations have carried out laws to constrain the use of private data by any
institute, including General Data protection Regulation [1], and Health Insurance Portability
and Accountability Act (HIPPA) [4]. 3) Distributed data but centralized training.
                                                1


Even if the data storage is trustworthy, the trained model may be inverse-engineered to leak
private information on publishing [45]. For distributed learning, private information can be
mined by exploring the transmitted gradients [166]. Meanwhile, the computation overhead
brought by local training calls for full cloud training without accessing private data.
    Observing the risks of data privacy, we aim to protect privacy in machine learning from
a data-centric perspective and focus on efficient, effective, and flexible privacy-preserving
learning. First, we use the notion of differential privacy (DP) [38] to quantify the privacy
risks in centralized learning and design algorithms for sample-efficient DP learning. Second,
we consider distributed data and training based on a federated learning framework, where
data and computation tasks are distributed to clients with heterogeneous data, devices,
and demands. The heterogeneity of clients makes it difficult to accommodate the federated
learning to more data and clients and therefore calls for advanced learning algorithms that
are flexible with hardware and data biases. Third, in our ongoing work, we consider cloud
training where data are maintained locally by a client, and computation-intensive training
is done on the cloud. To train a model that is effective on local data, it is essential to find
proxy data that is privacy-free and consists of similar features as the private data, critically
via a privacy-preserving collaboration.
1.2     Overview of Thesis Structure
This section summarizes each of the chapters in this thesis.
    Chapter 2 elaborates on the theoretical analysis of differentially-private gradient descent
to establish the utility-enhanced convergence bound. Chapter 3 discuss how to close the
domain or group biases in federated learning towards fair and transferable representations.
In Chapter 4, we study the heterogeneity of hardware and the split-mix principle for model-
customizable training and inference. In Chapter 5, we present our ongoing work for out-
sourcing training without uploading data by sampling a proximal dataset from open-source
data.
                                                2


                            Abbreviation             Definition
                                  DP            Differential Privacy
                                 GD              Gradient Descent
                                 PGD         Private Gradient Descent
                                  FL            Federated Learning
                              Table 1.1: Overview of Abbreviations.
1.3       Defitinion of Privacy-Preserving Learning
In this section, we present definitions of privacy-preserving learning (PPL) in three different
ways of data use.
1.3.1     Private Gradient Descent for Centralized Learning
     We consider a learning task by empirical risk minimization (ERM) f (θ) =           1
                                                                                          PN
                                                                                       N    n=1 f (θ; xn )
on a private dataset {xn }N   n=1 and θ ∈ R . The gradient methods are defined as θt+1 =
                                              D
θt − ηt ∇t , where ∇t = ∇f (θt ) = N1 n ∇f (θt ; xn ) denotes the non-private gradient at itera-
                                      P
                                         (n)
tion t, ηt is the step learning rate. ∇t     = ∇f (θt ; xn ) denotes the gradient on a sample xn .
Ic denotes the indicator function that returns 1 if the condition c holds, otherwise 0.
     Generally, we define the Private Gradient Descent (PGD) method as iterations for t =
1...T:
                           θt+1 = θt − ηt φt = θt − ηt (∇t + σt Gνt /N ),                          (1.1)
where φt = gt is the gradient privatized from ∇t as shown in Algorithm 2.1, G/N is the bound
of sensitivity of the gathered gradient excluding one sample gradient, and νt ∼ N (0, I) is a
vector element-wisely subject to Gaussian distribution. We use σt to denote the noise scale
at step t and use σ to collectively represents the schedule (σ1 , . . . , σT ) if not confusing. When
the Lipschitz constant is unknown, we can control the upper bound by scaling the gradient
if it is over some constant. The scaling operation is often called clipping in literature since it
clips the gradient norm at a threshold. After the gradient is noised, we apply a modification,
φ(·), to enhance its utility.
                                                 3


1.3.2    Data-Distributed Learning via Federated Learning
    As centralizing data into an untrustworthy center causes uncontrollable risks, learning
with distributed data and computation could alleviate the issue accordingly. One extraor-
dinary example of this distributed learning is to federated learning (FL) strategy, which
yields a global model by aggregating models from many clients. Formally, FL minimizes the
objective
                                           K
                                   1      X      X
                               P                         L(f (x; θ), y)
                                  k |Dk | k=1  (x,y)∈Dk
where L is loss function (e.g., cross-entropy loss), f is a model parameterized by θ, and
{Dk }Kk=1 are the training datasets on K participants with the image-label pairs (x, y). Fol-
lowing the standard FL setting [96], only model parameters can be shared to protect privacy.
1.3.3    Outsourcing Training without Uploading Private Data
    As FL requires intensive local training, it is less suitable for resource-constrained edge
devices. To transfer the computation overhead to cloud without using private data, the goal
of Outsourcing Training without Uploading Private Data (OT) is to train a model on a proxy
dataset proxy such that the model can transfer to the target private data distribution. The
training can be formulated as the empirical risk minimization on the cloud dataset. The loss
is defined as
                                             1          X
                        L(θ; Dproxy ) =                          L(f (x; θ), y).
                                         |Dproxy |
                                                    (x,y)∈Dproxy
A good proxy dataset should make the trained model transferable to the private data, for
which purpose the proxy data should follow a similar distribution as the private data. To
circumvent the privacy risk, the proxy dataset should be constructed from a non-private
source and any interaction with the private data should be accounted for with privacy risks.
1.4     A Glance of Prior Arts
In this section, we briefly overview the recent advances of privacy-preserving learning ap-
proaches in three settings.
                                                   4


    Centralized Private Learning. (1) Private Learning. Recent years witnessed increas-
ing attention to the privacy risk associated to learning from sensitive training data. For
example, an attacker could retrieve the training data from the models generated by the
widely-used empirical risk minimization (ERM) [45]. Many efforts have been devoted to
privacy-preserving learning. With the introduction of differential privacy (DP) [34], we are
now able to measure and defend the risk quantitatively [11, 70, 120, 128]. The main idea
is to introduce stochastic perturbations to the learning process, and the perturbations can
be done in any query operations [35], such as gradient computations [3] or objective evalu-
ation [20]. When proper noise is introduced before publishing the model, such as Gaussian
mechanism [39], one can no longer easily retrieve the training data by resampling [45]. (2)
Adaptive Privacy Perturbation. The key to achieving high utility in privacy-preserving learn-
ing is perturbation control. [162] improved the performance of models in a stochastically
private manner by selecting the gradient candidates. [75] proposed adaptively and privately
querying the effects of the noised gradient updates. Both mechanisms rely on querying the
model outputs several times via an exponential noise mechanism [34] which degrades the
effectiveness. Instead, [9] showed a simple adaptive scaling based on the noised value is
capable of reducing expectation error. Inspired by this idea, our gradient protector sequen-
tially predicts optimization updates based on current and previous protected gradients which
reduces both query times and privacy costs.
    Federated Learning [96] is a distributed learning framework that allows users with
different capabilities to collaboratively train a model without sharing their own data. A
critical challenge in FL is the heterogeneity among users. Viewing the learning process of
FL as knowledge transfer among different users, heterogeneity in user data leads to nega-
tive transfer between users and compromises generalization [13]. One idea to alleviate the
negative effect of heterogeneity during the training is to find a consensus among users. For
example, in [49, 85, 77, 43], the consensus on task knowledge is achieved by distillation. In
this work, we seek an alternative and efficient approach by adversarial debiasing the users
                                               5


of different groups.
     Training Outsourcing. There are a series of efforts studying how to leverage the data
and computation resources on the cloud to assist client model training, especially when
client data cannot be shared [156, 142]. We categorize them as follows: 1) Feature sharing:
Methods like group knowledge transfer [49], split learning [136] and domain adaptation [32,
30] transfer edge knowledge by communicating features extracted by networks. To provide a
theoretical guarantee of privacy protection, [105] proposed an advanced information removal
to disentangle sensitive attributes from shared features. In the notion of rigorous privacy
definition, Liu et al. leverage public data to assist private information release [89]. Earlier,
data encryption was used for outsourcing which however is too computation-intensive for a
client and less applicable for large-scale data and deep networks [21, 76]. Federated Learning
(FL) considered the same constraint on data sharing but the private knowledge is shared
via models trained locally [97]. 2) Private labeling: PATE and its variants were proposed
to generate client-approximated labels for unlabeled public data, on which a model can be
trained [108, 109]. Without training multiple models by clients, Private kNN was a more
efficient alternative that explored the private neighborhood of public images for labeling [168].
These approaches are based on a strong assumption of the availability of public data that
is iid as the local data. This paper considers a more practical yet challenging setting where
public data are from multiple agnostic sources with heterogeneous features.
                                                 6


                                         CHAPTER 2
      DYNAMIC PRIVACY BUDGET ALLOCATION FOR ENHANCED
                        CENTRALIZED PRIVATE LEARNING
Protecting privacy in learning while maintaining the model performance has become in-
creasingly critical in many applications that involve sensitive data. Private Gradient De-
scent (PGD) is a commonly used private learning framework, which noises gradients based
on the Differential Privacy protocol. Recent studies show that dynamic privacy schedules
of decreasing noise magnitudes can improve loss at the final iteration, and yet theoretical
understandings of the effectiveness of such schedules and their connections to optimization
algorithms remain limited. In this paper, we provide comprehensive analysis of noise in-
fluence in dynamic privacy schedules to answer these critical questions. We first present
a dynamic noise schedule minimizing the utility upper bound of PGD, and show how the
noise influence from each optimization step collectively impacts utility of the final model.
Our study also reveals how impacts from dynamic noise influence change when momentum
is used. We empirically show the connection exists for general non-convex losses, and the
influence is greatly impacted by the loss curvature.
2.1     Introduction
In the era of big data, privacy protection in machine learning systems is becoming a cru-
cial topic as increasing personal data involved in training models [37] and the presence of
malicious attackers [126, 45]. In response to the growing demand, differential-private (DP)
machine learning [38] provides a computational framework for privacy protection and has
been widely studied in various settings, including both convex and non-convex optimiza-
tion [139, 138, 64].
    One widely used procedure for privacy-preserving learning is the (Differentially) Private
Gradient Descent (PGD) [11, 3]. A typical gradient descent procedure updates its model by
gradients of the loss evaluated on a training dataset. When the data is sensitive, the gradients
should be privatized to prevent excess privacy leakage. The PGD privatizes a gradient by
                                                7


adding controlled noise. As such, the models from PGD is expected to have a lower utility
as compared to those from unprotected algorithms. In the cases where strict privacy control
is exercised, or equivalently, a tight privacy budget, accumulating effects from highly-noised
gradients may lead to unacceptable model performance. It is thus critical to design effective
privatization procedures for PGD to maintain a great balance between utility and privacy.
     Recent years witnessed a promising direction of privatization that dynamically allocate
a privacy budget for each iteration to boost utility, under the constraint of a specified total
privacy budget. One example is [75], which reduces the budget-bonded noise magnitude
when the loss does not decrease, due to the observation that gradients become very small
when approaching convergence, and a static noise scale will overwhelm these gradients. An-
other example is [159], which periodically decreases the magnitude following a predefined
strategy, e.g., exponential decaying or step decaying. Both approaches confirmed the em-
pirically advantages of decreasing noise magnitudes. Intuitively, the dynamic mechanism
may coordinate with certain properties of the learning task, e.g., training data and loss sur-
face. Following the work, improved allocation policies are proposed, e.g., policies transferred
from auxiliary datasets [57], policies with distributed budgets [22], and a combination with
adaptive learning rate [155]. Yet there is little theoretical analysis available and two impor-
tant questions remain unanswered: 1) What is the form of utility-preferred budget (or noise
equivalently) schedules? 2) When and to what extent such an allocation policy improves
utility?
     Though there are theoretical studies of static-allocation policies, e.g., [139], the data
efficiency is not the focus as discussions usually assume an unlimited amount of data is
available. However, we argue that the data efficiency with limited data size is critical in
practice, especially when DP makes the learning more data-hungry [98]. One example is fed-
erated learning [96, 98], a distributed learning framework that aggregates many local models
to form a stronger global model, where each model is privately trained on a local client,
typically with very limited private data. Another example is biomedical applications, where
                                                8


collecting samples involves expansive clinical trials or cohort studies, resulting the scarcity
of training set. To study biomarkers of Alzheimer’s, NIH has funded Alzheimer’s Disease
Neuroimaging Initiative for $40 million, which collected imaging and genetic biomarkers
from only 800 patients after 5 years’ extensive and collaborative efforts [145]. Therefore,
we believe data efficiency needs to be taken into account in developing a private learning
algorithm.
    To answer these questions, in this paper we develop a principled approach to construct
dynamic schedules and quantify their utility bounds in different learning algorithms. Our
contributions are summarized as follows. 1) For the class of loss functions satisfying the
Polyak-Lojasiewicz condition [114], we show that dynamic schedules, that improve the utility
upper bound with high data-efficiency, are shaped by the changing influence of per-iteration
noise on the final loss. As the influence is tightly connected to the loss curvature, the
advantage of using dynamic schedules therefore depends on the loss function. 2) Beyond
vanilla gradient descent, our results show the gradient methods with momentum implicitly
introduce a dynamic schedule and result in an non-monotonous influence trend. 3) We
also show that our results are generalizable to population bounds in high probability or
based on uniform stability theorems. Though our major focus is on the theoretic study, we
empirically validate the results on a non-convex loss function formulated by a neural network.
The empirical results suggest that a preferred dynamic schedule admits the exponentially
decaying form, and works better when learning with high-curvature loss functions. Moreover,
dynamic schedules give higher utility under stricter privacy conditions (e.g., smaller sample
size and less privacy budget).
2.2     Related Work
Differentially Private Learning. Differential privacy (DP) characterizes the chance that
an algorithm output (e.g., a learned model) leaks private information of its training data
when the output distribution is known. Since outputs of many learning algorithms have
undetermined distributions, the probabilistic risk is hard to measure. A common approach
                                               9


to tackle this issue is to inject randomness with known distribution to privatize the learning
procedures. Classical methods include output perturbation [20], objective perturbation [20]
and gradient perturbation [3, 11, 148]. Among these approaches, the Private Gradient De-
scent (PGD) has attracted extensive attention in recent years because it can be flexibly
integrated with variants of gradient-based iteration methods, e.g., stochastic gradient de-
scent, momentum methods [116], and Adam [71], for (non-)convex problems.
    Dynamic Policies for Privacy Protection. [139] studied the empirical risk minimiza-
tion using dynamic variation reduction of perturbed gradients. They showed that the utility
upper bound can be achieved by gradient methods under uniform noise parameters. Instead
of enhancing the gradients, [159, 75] showed the benefits of using a dynamic schedule of pri-
vacy parameters or equivalently noise scales. Following [75], a series of work [22, 61, 152, 164]
adaptively allocate privacy budget towards better privacy-utility trade-off. Moreover, adap-
tive sensitivity control [113, 130] and dynamic batch sizes [42] are also shown to improve
convergence.
    Utility Upper Bounds. Utility upper bounds are a critical metric for privacy schedules,
which characterizes the maximum utility that a schedule can deliver in theory. [139] is the
first to prove the utility bound under the PL condition. Recently, [165] proved the utility
bound by using the momentum of gradients [115, 71]. In this paper, we improve the upper
bound by a more accurate estimation of the dynamic influence of step noise. We show that
introducing a dynamic schedule further boosts the sample-efficiency of the upper bound.
Table 2.1 summarizes the upper bounds of a selection of state-of-the-art algorithms based
on private gradients (up block, see Section A.2 for the full list), and methods studied in this
paper (down block), showing the benefits of dynamic influence.
    Especially, a closely-related work by Feldman et al. achieved a convergence rate similar
to ours in terms of generalization error bounds, by dynamically adjusting batch sizes [42].
However, the approach requires controllable batch sizes, which may not be feasible in many
applications. In federated learning, for example, where users update models locally and then
                                              10


Table 2.1: Comparison of utility upper bound using different privacy schedules. The algorithms
are T -iteration 12 R-zCDP under the PL condition (unless marked with * for convexity). The O
notation in this table drops other ln terms. Unless otherwise specified, all algorithms (including
                                                  2
non-private GD) terminate at step T = O(ln NDR ). Assume loss functions are 1-smooth and 1-
Lipschitz continuous, and all parameters satisfy their numeric assumptions. Key notations: Op –
bound occurs in probability p; D – feature dimension; N – sample size; R – privacy budget where
R,δ is the equivalent budget accounted by (, δ)-DP; ci – constant.
                      Algorithm                     Schedule (σt2 )              Utility Upper Bound
                                                                                             3
                                                                                                     
     *GD+Adv [11]                                    O ln(N/δ) R,δ                   O DNlnR,δN
                                                                                                    
                                                                                             ln2 N
     GD+MA [139]                                         O( RT,δ )                   O D  N 2 R,δ
                                                                                           √
                                                                                                            
     GD+MA (adjusted utility) [157]                      O( RT,δ )                          D
                                                                               O min N R,δ , N 2 R2D ln N
                                                         2                             2 2           ,δ
                                                           n ln(n/δ)                     D ln (1/p)
     *GD+Adv+BBImp [24]                             O          R,δ                Op R,δ N 1−c
                                                                                    √                       
                                                                                       D ln(N D/(1−p))
     Adam+MA [165]                                       O( RT,δ )            Op           N R,δ
                                                                                        O ND2 R
                                                                                                   
     GD, Non-Private                                           0
                                                               T
                                                                                       O DNln2 RN
                                                                                                    
     GD+zCDP, Static Schedule                                  R
                                                                      
                                                            γ (t−T )/2
                                                                                        O ND2 R
                                                                                                   
     GD+zCDP, Dynamic Schedule                       O           R
                                                               T
                                                                               O ND2 R (c + ln N IT >T̂ )
                                                                                                              
     Momentum+zCDP, Static Schedule                            R
                                                                            
                                                    c1 γ T +t +c2 γ (T −t)/2
                                                                               O ND2 R (1 + NcD
                                                                                                               
     Momentum+zCDP, Dynamic Schedule           O                 R                                2 R IT >T̂ )
pass the parameters to server for aggregation, the server has no control over batch sizes, and
coordinating users to use varying batch sizes may not be realistic. On the other hand, our
proposed method can still be applied for enhancing utility, as the server can dynamically
allocate privacy budget for each round when the presence of a user in the global aggregation
is privatized [98].
2.3      Private Gradient Descent
Notations. We consider a learning task by empirical risk minimization (ERM) f (θ) =
      n=1 f (θ; xn ) on a private dataset {xn }n=1 and θ ∈ R . The gradient methods are
 1
   PN                                           N                          D
N
defined as θt+1 = θt − ηt ∇t , where ∇t = ∇f (θt ) = N1 n ∇f (θt ; xn ) denotes the non-private
                                                                P
                                                                  (n)
gradient at iteration t, ηt is the step learning rate. ∇t              = ∇f (θt ; xn ) denotes the gradient
on a sample xn . Ic denotes the indicator function that returns 1 if the condition c holds,
otherwise 0.
                                                 11


    Assumptions. (1) In this paper, we assume f (θ) is continuous and differentiable. Many
commonly used loss functions satisfy this assumption, e.g., the logistic function. (2) For
a learning task, only finite amount of privacy cost is allowed where the maximum cost is
called privacy budget and denoted as R. (3) Generally, we assume that loss functions f (θ; x)
(sample-wise loss) are G-Lipschitz continuous and f (θ) (the empirical loss) is M -smooth.
Definition 1 (G-Lipschitz continuity). A function f (·) is G-Lipschitz continuous if, for
G > 0 and all x, y in the domain of f (·), f (·) satisfies kf (y) − f (x)k ≤ G ky − xk. .
Definition 2 (m-strongly convexity). A function f (·) is m-strongly convex if f (y) ≥ f (x) +
∇f (x)T (y − x) +  m
                   2
                      ky − xk2 , for some m > 0 and all x, y in the domain of f (·).
Definition 3 (M -smoothness). A function is M -smooth w.r.t. l2 norm if f (y) ≤ f (x) +
∇f (x)T (y − x) +  M
                    2
                      ky − xk2 , for some constant M > 0 and all x, y in the domain of f (·).
    For a private algorithm M(d) which maps a dataset d to some output, the privacy cost is
measured by the bound of the output difference on the adjacent datasets. Adjacent datasets
are defined to be datasets that only differ in one sample. In this paper, we use the zero-
Concentrated Differential Privacy (zCDP, see Definition 4) as the privacy measurement,
because it provides the simplicity and possibility of adaptively composing privacy costs at
each iteration. Various privacy metrics are discussed or reviewed in [27]. A notable example
is Moment Accoutant (MA) [3], which adopts similar principle for composing privacy costs
while is less tight for a smaller privacy budget. We note that alternative metrics can be
adapted to our study without major impacts to the analysis.
Definition 4 (ρ-zCDP [18]). Let ρ > 0. A randomized algorithm M : Dn → R satisfies
ρ-zCDP if, for all adjacent datasets d, d0 ∈ Dn , Dα (M(d)kM(d0 )) ≤ ρα, ∀α ∈ (1, ∞) where
Dα (·k·) denotes the Rényi divergence [119] of order α.
    The zCDP provides a linear composition of privacy costs of sub-route algorithms. When
the input vector is privatized by injecting Gaussian noise of N (0, σt2 I) for the t-th iteration,
                                               12


the composed privacy cost is proportional to               ρt where the step cost is ρt =       1
                                                                                                   .  For
                                                      P
                                                         t                                     σt2
simplicity, we absorb the constant coefficient into the (residual) privacy budget R. The
formal theorems for the privacy cost computation of composition and Gaussian noising is
included in Lemmas 2 and 3.
Algorithm 2.1: Privatizing Gradients
                               (1)      (n)
Input: Raw gradients [∇t , . . . , ∇t ] (n = N by default), vt , residual privacy budget Rt
assuming the full budget is R and R1 = R.
                                    (i)
     ρt ← 1/σt2 , ∇t ← n1 ni=1 ∇t (Budget request)
                             P
  1:
  2: if ρt < Rt then
  3:     Rt+1 ← Rt − ρt
  4:     gt ← ∇t + Gσt νt /N , νt ∼ N (0, I) (Privacy noise)
  5:     mt+1 ← φ(mt , gt ) or g1 if t = 1
  6:     return ηt mt+1 , Rt+1 (Utility projection)
  7: else
  8:     Terminate
     Generally, we define the Private Gradient Descent (PGD) method as iterations for t =
1...T:
                              θt+1 = θt − ηt φt = θt − ηt (∇t + σt Gνt /N ),                         (2.1)
where φt = gt is the gradient privatized from ∇t as shown in Algorithm 2.1, G/N is the bound
of sensitivity of the gathered gradient excluding one sample gradient, and νt ∼ N (0, I) is a
vector element-wisely subject to Gaussian distribution. We use σt to denote the noise scale
at step t and use σ to collectively represents the schedule (σ1 , . . . , σT ) if not confusing. When
the Lipschitz constant is unknown, we can control the upper bound by scaling the gradient if
it is over some constant. The scaling operation is often called clipping in literatures since it
clips the gradient norm at a threshold. After the gradient is noised, we apply a modification,
φ(·), to enhance its utility. In this paper, we consider two types of φ(·):
                  φ(mt , gt ) = gt (GD),                                                             (2.2)
                  φ(mt , gt ) = [β(1 − β t−1 )mt + (1 − β)gt ]/(1 − β t ) (Momentum)                 (2.3)
                                                   13


    We now show that the PGD using Algorithm 2.1 guarantees a privacy cost less than R:
Theorem 1. Suppose f (θ; x) is G-Lipschitz continuous and the PGD algorithm with priva-
tized gradients defined by Algorithm 2.1, stops at step T . The PGD algorithm outputs θT
and satisfies ρ-zCDP where ρ ≤ 21 R.
    Note that Theorem 1 allows σt to be different throughout iterations. Next we present a
principled approach for deriving dynamic schedules optimized for the final loss f (θT ).
2.4     Dynamic Policies by Minimizing Utility Upper Bounds
To characterize the utility of the PGD, we adopt the Expected Excess Risk (EER), which
notion is widely used for analyzing the convergence of random algorithms, e.g., [11, 139].
Due to the presence of the noise and the limitation of learning iterations, optimization
using private gradients is expected to reach a point with a higher loss (i.e., excess risk) as
compared to the optimal solution without private protection. Define θ∗ = arg minθ f (θ),
after Algorithm 2.1 is iterated for T times in total, the EER gives the expected utility
degradation:
                                  EER = Eν [f (θT +1 )] − f (θ∗ ).
Due to the variety of loss function and complexity of recursive iterations, an exact EER with
noise is intractable for most functions. Instead, we study the worst case scenario, i.e., the
upper bound of the EER, and our goal is to minimize the upper bound. For consistency, we
call the upper bound of EER divided by the initial error as ERUB. Since the analytical form
of EER is either intractable or complicated due to the recursive iterations of noise, studying
the ERUB is a convenient and tractable alternative. The upper bound often has convenient
functional forms which are (1) sufficiently simple, such that we can directly minimize it, and
(2) closely related to the landscape of the objective depending on both the training dataset
and the loss function. As a consequence, it is also used in previous PGD literature [113, 139]
for choosing proper parameters. Moreover, we let ERUBmin be the achievable optimal upper
bound by a specific choice of parameters, e.g., the σ and T .
                                               14


    As the EER is iteratively determined by Eq. (2.1), we define the influence of the dynamics
in noise magnitude σt as the derivative: qt∗ =     ∂ EER
                                                     ∂σt
                                                         . Accordingly, we can approximate the
EER shift as qt∗ ∆σt when σt increases by ∆σt . However, because the EER is strongly data-
dependent, the derived qt∗ on a given dataset may not generalize to another dataset. Instead,
we consider a more general term based on ERUB, i.e., qt =          ∂ ERUB
                                                                     ∂σt
                                                                          .
    In this paper, we consider the class of loss functions satisfying the Polyak-Lojasiewicz
(PL) condition which bounds losses by corresponding gradient norms. It is more general than
the m-strongly convexity. If f is differentiable and M -smooth, then m-strongly convexity
implies the PL condition.
Definition 5 (Polyak-Lojasiewicz condition [114]). For f (θ), there exists µ > 0 and for
every θ, k∇f (θ)k2 ≥ 2µ(f (θ) − f (θ∗ )).
    The PL condition helps us to reveal how the influence of step noise propagates to the
final excess error, i.e., EER. Though the assumption was also used previously in (author?)
[139, 165], neither did they discuss the propagated influence of noise. In the following
sections, we will show how the influence can tighten the upper bound in gradient descent
and its momentum variant.
2.4.1    Gradient Descent Methods and Noise Influences
    For the brevity of variables, we first define the following summarized constants:
                                                   DG2                         DG2
                                                                                    
             non-private ERUB : α ,                                   ≤O               ,    (2.4)
                                         2RM N 2 (f (θ1 ) − f (θ∗ ))          RM N 2
                                         M
                         curvature : κ ,    ,                                               (2.5)
                                         µ
                                              1
                convergence rate : γ , 1 − ,                                                (2.6)
                                              κ
which satisfy κ ≥ 1 and γ ∈ [0, 1). Here, α is upper bounded by non-private ERUB within
      l       2
                         m
                   −µ)
T = O(ln N R(M  DG2
                        )  iterations. Therefore, α provide a simple reference of an ideal con-
vergence bound, reaching which indicates a superior performance with privacy guarantee. κ
characterizes the curvature of f (·) which is the condition number of f (·) if f (·) is strongly
                                                15


convex, and γ is the convergence rate for non-private SGD (c.f. Theorem 2 with σt = 0).
κ tends to be large if the function is sensitive to small differences in inputs, and 1/α tends
to be large if more samples are provided and with a less strict privacy budget. The con-
vergence of PGD under the PL condition has been studied for private [139] and non-private
[68, 101, 118] ERM. Below we extend the bound in [139] by considering dynamic influence
of noise and relax σt to be dynamic:
Theorem 2. Let α, κ and γ be defined in Eq. (2.6), and ηt =             M
                                                                         1
                                                                           .  Suppose f (θ; xi ) is G-
Lipschitz and f (θ) is M -smooth satisfying the Polyak-Lojasiewicz condition. For PGD, the
following holds:
                                         XT
                        ERUB = γ T + R           qt σt2 , where qt , γ T −t α.                    (2.7)
                                             t=1
    Theorem 2 degenerates to a non-private variant as no noise is applied, i.e., σt = 0 for all t.
In Eq. (2.7), the step noise magnitude σt2 has an exponential influence, qt , on the EER. Note
we ignore the constant factor R in the influence. The Eq. (2.7) implies that the influence of
noise at step t increase quickly by an exponential rate. Importantly, the increasing rate is
the same as the convergence rate, i.e., the first term in Eq. (2.7). The dynamic characteristic
of the influence is the key to prove a tighter bound. Plus, on the presence of the dynamic
influence, it is natural to choose a dynamic σt2 . When relaxing qt to a static 1, a static σt2
was studied by [139] They proved a bound which is nearly optimal except a ln2 N factor. To
get the optimal bound, in the following sections, we look for the σ and T that minimize the
upper bound.
    Uniform Schedule. The uniform setting of σt has been previously studied in [139].
Here, we show that the bound can be further tightened by considering the dynamic influence
of iterations and a proper T .
Theorem 3. Suppose conditions in Theorem 2 are satisfied. When σt2 = T /R, let α, γ and
κ be defined in Eq. (2.6) and let T be: T = O κ ln 1 + κα       1
                                                                      . Meanwhile, if κ ≥ 1−c  1
                                                                 
                                                                                                  > 1,
                                                16


1/α > 1/α0 for some constant c ∈ (0, 1) and α0 > 0, the corresponding bound is:
                                                    κ2
                                                                        
                                                                       1
                          ERUBmin  uniform
                                           =Θ              ln 1 +            .                  (2.8)
                                                κ + 1/α               κα
Sketch of proof. The key of proof is to find a proper T to minimize
                                                XT
                        ERUB = E = γ T +               γ T −t αRσ 2
                                                   t=1
                                                    T
                                             1−γ
                                  = γ T + αT          = γ T + ακ(1 − γ T )T                     (2.9)
                                              1−γ
where we use σt =         T /R. Vanishing its gradient is to solve γ T ln γ + ακ(1 − γ T ) −
                       p
ακT γ T ln γ = 0, which however is intractable. In [139], T is chosen to be O(ln(1/α))
and ERUB is relaxed as γ T + ακT 2 . The approximation results in a less tight bound as
O(α(1 + κ ln2 (1/α))) which explodes as κ → ∞.
    We observe that for a super sharp loss function, i.e., a large κ, any minor perturbation
may result in tremendously fluctuating loss values. In this case, not-stepping-forward will
                                                                        
be a good choice. Thus, we choose T = ln(1/γ)     1
                                                       ln 1 + ln(1/γ)                     1
                                                                                               which
                                                                                            
                                                                    α
                                                                           ≤ O κ ln 1 + κα
converges to 0 as κ → +∞. The full proof is deferred to the appendix.
    Dynamic Schedule. A dynamic schedule can improve the upper bound delivered by
the uniform schedule. First, we observe that the excess risk in Eq. (2.7) is upper bounded
by two terms: the first term characterizes the error due to the finite iterations of gradient
descents; the second term, a weighted sum, comes from error propagated from noise at each
iteration. Now we show for any {qt |qt > 0, t = 1, . . . , T } (not limited to the qt defined in
Eq. (2.7)), there is a unique σt minimizing the weighted sum:
Lemma 1 (Dynamic schedule). Suppose σt satisfy                   σt−2 = R. Given a positive sequence
                                                         PT
                                                             t=1
{qt }, the following equation holds:
                       XT                X T √ 2                       1 XT
                                                                                   r
                                                                                     qi
                 min R         qt σt2 =          qt , when σt =    2
                                                                                        .      (2.10)
                  σ        t=1              t=1                          R     i=1   qt
Remarkably, the difference between the minimum and T Tt=1 qt (uniform σt ) monotonically
                                                                 P
                                √
increases by the variance of qt w.r.t. t.
                                                 17


     We see that the dynamics in σt come from the non-uniform nature of the weight qt .
Since qt presents the impact of the σt on the final error, we denote it as influence. Given
the dynamic schedule in Eq. (2.10), it is of our interest to which extent the ERUB can be
improved. First, we present Theorem 4 to show the optimal T and ERUB.
Theorem 4. Suppose conditions in Theorem 2 are satisfied. Let α, κ and γ be defined
in Eq. (2.6). When ηt = M1 , σt (based on Eqs. (2.7) and (2.10)) and the T minimizing
                          √
                            (1/γ)T −1 √ t
ERUB are, i.e., σt = R 1−√γ
                        1
                                       γ , T = 2κ ln 1 + κα 1
                                                                   . Meanwhile, when κ ≥ 1 and
                    2
                                                             
1/α ≥ 1/α0 for some positive constant α0 , the minimal bound is:
                                                        κ2
                                                              
                                 ERUBdynamic
                                        min   =Θ                 .                       (2.11)
                                                     κ2 + 1/α
2.4.2     Discussion
     In Theorems 3 and 4, we present the tightest bounds for functions satisfying the PL
condition, to our best knowledge. We further analyze the advantages of our bounds from
two aspects: sample efficiency and robustness to sharp losses.
     Sample efficiency. Since dataset cannot be infinitely large, it is critical to know how
accurate the model can be trained privately with a limited number of samples. Formally, it
is of interest to study when κ is fixed and N is large enough such that α  1. Then we have
the upper bound in Eq. (2.8) as
                                                                DG2 ln(N )
                                                                      
                                                 1
                     ERUBuniform
                          min
                                           2
                                   ≤ O κ α ln           ≤ Õ                 ,           (2.12)
                                                κα                M N 2R
where we ignore κ and other logarithmic constants with Õ as done in (author?) [139].
As a result, we get a bound very similar to [139], except that R is replaced by RM A =
2 / ln(1/δ) using Moment Accountant. In comparison, based on Lemma 4, R = 2ρ = 2 +
4 ln(1/δ) + 4 ln(1/δ)( + ln(1/δ) if θT satisfies ρ-zCDP. Because ln(1/δ) > 1, it is easy to
               p
see R = RzCDP > RM A when  ≤ 2 ln(1/δ). As compared to the one reported in [139],
our bound saved a factor of ln N and thus require less sample to achieve the same accuracy.
                                               18


Remarkably, the saving is due to the maintaining of the influence terms as shown in the
proof of Theorem 3.
                                                                              
                                                                         DG2
    Using the dynamic schedule, we have ERUBdynamicmin   ≤ O(α) = O     M N 2R
                                                                                 , which saved
another ln N factor in comparison to the one using the uniform schedule Eq. (2.12). As
shown in Table 2.1, such advantage maintains when comparing with other baselines and
reaches the ideal non-private case , recalling the meaning of α.
    Stability on ill-conditioned loss. Besides sample efficiency, we are also interested in
robustness of the convergence under the presence of privacy noise. Because of the privacy
noise, the private gradient descent will be unable to converge to where a non-private al-
gorithm can reach. Specifically, when the samples are noisy or have noisy labels, the loss
curvature may be sharp. The sharpness also implies lower smoothness, i.e., a small M or
has a very small PL parameter. Thus, gradients may change tremendously at some steps
especially in the presence of privacy noise. Such changes have more critical impact when
only a less number of iterations can be executed due to the privacy constraint. Assume α is
some constant while κ  1/α, we immediately get:
                                                                     M N 2R
                                                                       
                                               1           1
               ERUBuniform
                       min   = Θ κ ln 1 +            =Θ        ≤O               ,
                                              κα          α           DG2
               ERUBdynamic
                      min    = Θ(1).
Both are robust, but the dynamic schedule has a smaller factor since 1/α could be a large
number. In addition, the factor implies that when more samples are used, the dynamic
schedule is more robust.
2.4.3    Gradient Descent Methods with Momentum
    Section 2.4.1 shows that the step noise has an exponentially increasing influence on the
final loss, and therefore a decreasing noise magnitude improves the utility upper bound
by a ln N factor. However, the proper schedule can be hard to find when the curvature
information, e.g., κ, is absent. A parameterized method that less depends on the curvature
information is preferred. On the other hand, long-term iterations will result in forgetting of
                                               19


the initial iterations, since accumulated noise overwhelmed the propagated information from
the beginning. This effect will reduce the efficiency of the recursive learning frameworks.
   Alternative to GD, the momentum method can mitigate the two issues. It was originally
proposed to stabilize the gradient estimation [115]. In this section, we show that momentum
(agnostic about the curvature) can flatten the dynamic influence and improve the utility
upper bound. Previously, (author?) used the momentum as an estimation of gradient
mean, without discussions of convergence improvements. (author?) gave a bound for the
Adam with DP. However, the derivation is based on gradient norm, which results in a looser
bound (see Table 2.1).
   The momentum method stabilizes gradients by moving average history coordinate values
and thus greatly reduces the variance. The φ(mt , gt ) can be rewritten as:
                                             vt+1
                      mt+1 = φ(mt , gt ) =          ,
                                           1 − βt
                                                              Xt
                       vt+1 = βvt + (1 − β)gt = (1 − β)                β t−i gt , v1 = 0,                  (2.13)
                                                                   i=1
where β ∈ [0, 1]. Note vt+1 is a biased estimation of the gradient expectation while mt+1 is
unbiased.
Theorem 5 (Convergence under PL condition). Suppose f (θ; xi ) is G-Lipschitz, and f (θ)
is M -smooth and satisfies the Polyak-Lojasiewicz condition. Assume β 6= γ and β ∈ (0, 1).
                            p                                          −1
Let ηt = 2Mη0
               and η0 ≤ 8      1 + 64βγ(γ − β)−2 (1 − β)−3 + 1                . Then the following holds:
                         EER ≤ γ T + 2Rη0 α U3 (σ, T )            (f (θ1 ) − f (θ∗ ))
                                                                
                                                   | {z }
                                                 noise varinace
                                         η0 X T
                                   −ζ                 γ T −t E kvt+1 k2                                    (2.14)
                                        2M |     t=1
                                                          {z           }
                                                  momentum effect
                                                                                            2
where γ = 1 −       ,                    − 14 η0 ≥ 0, and U3 =                                      β 2(t−i) σi2 .
                 η0                                                   PT              (1−β)    Pt
                  κ
                      ζ =1−      1
                                     η2
                              β(1−β)3 0                                  t=1  γ T −t (1−β t )2  i=1
   The upper bound includes three parts that influence the bound differently: (1) Convergence.
The convergence term is mainly determined by η0 and κ. η0 should be in (0, κ) such that
                                                   20


the upper bound can converge. A large η0 will be preferred to speed up convergence if it
does not make the other two terms worse. (2) Noise Variance. The second term compressed
in U3 is the effect of the averaged noise, ti=1 β 2(t−i) σi2 . One difference introduced by the
                                                  P
momentum is the factor (1 − β)/(1 − β t ) which is less than γ t at the beginning and converges
to a non-zero constant 1 − β. Therefore, in U3 , γ T −t (1 − β)/(1 − β t ) will be constantly less
than γ T meanwhile. Furthermore, when t > T̂ , the moving average ti=1 β 2(t−i) σi2 smooths
                                                                                     P
the influence of each σt . (3) Momentum Effect. The momentum effect term can improve the
upper bound when η0 is small. For example, when β = 0.9 and γ = 0.99, then η0 ≤ 0.98/M
which is a rational value. Following the analysis, when M is large which means the gradient
norms will significantly fluctuate, the momentum term may take the lead. Adjusting the
noise scale in this case may be less useful for improving utility.
    To give an insight on the effect of dynamic schedule, we provide the following utility
bounds.
Theorem 6 (Uniform schedule). Suppose the assumptions in Theorem 5 are satisfied. Let
                                                                        l                 m
σt2 = T /R, and let: T̂ = max t s.t. γ t−1 ≥ 1−β      1−β
                                                          t ,  T   =     O  κ
                                                                            η0
                                                                               ln  1  + η0
                                                                                        κα
                                                                                               . Given some
positive constant c and α0 > 0 with 1/α > 1/α0 , the following inequality holds:
                                          κ2                                              i
                                                h
                                                               T̂ −1
                                                                              η0 
                      ERUBmin ≤ O                 I      +γ          ln 1 +         I        .
                                       κ + η0 /α T ≤T̂                        κα T >T̂
Theorem 7 (Dynamic schedule). Suppose the assumptions in Theorem 5 are satisfied. Let
                2 ) , β < γ and T̂ = max t s.t. γ                   t . Use the following schedule: σt =
         2η0 α                                                  1−β
α0 = γ(1−γβ                                          t−1
                                                          ≥ 1−β                                             2
   PT q qi                   l                 m
R
 1
     i=1     qt
                ,    T dyn
                           =  O  2κ
                                 η0
                                    ln  1 +  η0
                                             κα
                                                    , where qt = c1 γ T +t IT ≤T̂ + γ T̂ −1 c2 γ T −t IT >T̂ for
some positive constants c1 and c2 . The following inequality holds:
                                                    XT
                              ERUB ≤ γ T + 2η0 α              Rqt σt2 ,
                                                        t=1
                                                                                  
                                                κα            κα
                            ERUBmin ≤ O                               I    + IT >T̂     .
                                            κα + η0 κα + η0 T ≤T̂
Discussion. Theoretically, the dynamic schedule is more influential in vanilla gradient de-
scent methods than the momentum variant. The result is mainly attributed to the averaging
                                                     21


operation. The moving averaging, (1 − β)          β t−i gi /(1 − β t ), increase the influence of the
                                            Pt
                                              i=1
under-presented initial steps and decrease the one of the over-sensitive last steps. Coun-
terintuitively, the preferred dynamic schedule should be increasing since qt decreases when
t ≤ T̂ .
2.4.4      Extension to Private Stochastic Gradient Descent
     Though PGD provides a guarantee both for utility and privacy, computing gradients of
the whole dataset is impractical for large-scale problems. For this sake, studying the con-
vergence of Private Stochastic Gradient Descent (PSGD) is meaningful. The Algorithm 2.1
can be easily extended to PSGD by subsampling n gradients where the batch size n  N .
According to [159], when privacy is measured by zCDP, there are two ways to account for
the privacy cost of PSGD depending on the batch-sampling method: sub-sampling with or
without replacement. In this paper, we focus on the random subsampling with replacement
since it is widely used in deep learning in literature, e.g., [3, 42]. Accordingly, we replace
N in the definition of α by n because the term is from the sensitivity of batch data (see
Eq. (2.1)). For clarity, we assume that T is the number of iterations rather than epochs and
that ∇ ˜ t is mean stochastic gradient.
     When a batch of data are randomly sampled, the privacy cost of one iteration is cp2 /σt
where c is some constant, p = n/N is the sample rate, and 1/σt2 is the full-batch privacy
cost. Details of the sub-sampling theorems are referred to the Theorem 3 of [159] and
their empirical setting. Threfore, we can replace the privacy constraint t p2 /σt2 = R by
                                                                                    P
                                        N2
   t 1/σt = R where R = R/p = n2 R. Remarkably, we omit the constant c because it
          2     0         0       2
P
will not affect the results regarding uniform or dynamic schedules. Notice N 2 R in the α is
replaced by n2 R0 = N 2 R. Thus, the form of α is not changed which provides convenience
for the following derivations.
     Now we study the utility bound of PSGD. To quantify the randomness of batch sampling,
we define a random vector ξt with E[ξt ] = 0 and E kξt k2 ≤ D such that ∇       ˜ t ≤ ∇t + σg ξt /n for
some positive constant σg . Because ξt has similar property to the privacy noise νt , we can
                                              22


easily extend the PGD bounds to PSGD bounds by following theories.
Theorem 8 (Utility bounds of PSGD). Let α, κ and γ be defined in Eq. (2.6), and ηt =                M
                                                                                                     1
                                                                                                       .
Suppose f (θ; xi ) is G-Lipschitz and f (θ) is M -smooth satisfying the Polyak-Lojasiewicz
                                                                        √
condition. For PSGD, when batch size satisfies n = max{N R, 1}, the following holds:
ERUB = γ T +αg σg2 +R0 Tt=1 qt σt2 , where qt , γ T −t α,     t 1/σt = R . where αg = 2µN 2 R(f (θ1 )−f (θ∗ )) .
                                                                     2         0                   D
                        P                                 P
Theorem 9 (PSGD with momentum). Let αg =                          D
                                                        2µN 2 R(f (θ1 )−f (θ∗ ))
                                                                                 . Suppose assumptions
                                                                    √
in Theorem 5 holds. When batch size satisfies n = max{N R, 1}, the U3 (σ, T ) has to be
replaced by Ũ3 = U3g + U3 , with αR0 U3g ≤ αg σg2 when PSGD is used.
    See proof on page 104. As shown above, the utility bound of PSGD differs from the
PGD merely by αg σg2 . Note αg = O( ND2 R ) which fits the order of dynamic-schedule bounds.
In addition, α and other variables are not changed. Hence, the conclusions w.r.t. the
dynamic/uniform schedules maintain the same.
2.4.5    Comaprison of generalization bounds
    In addition to the empirical risk bounds in Table 2.1, in this section we study the true
risk bounds, or generalization error bounds. True risk bounds characterize how well the
learnt model can generalize to unseen samples subject to the inherent data distribution. By
leveraging the generic learning-theory tools, we extend our results to the True Excess Risk
(TER) for strongly convex functions as follows. For a model θ, its TER is defined as follows:
                        TER , Ex∼X [E[f (θ; x)]] − minθ̂ Ex∼X [f (θ̂; x)],
where the second expectation is over the randomness of generating θ (e.g., the noise and
stochastic batches). Assume a dataset d consist of N samples drawn i.i.d. from the distri-
bution X . Two approaches could be used to extend the empirical bounds to the true excess
risk: One is proposed by [124] where the true excess risk of PGD can be bounded in high
                                              ln2 N
probability. For example, [11] achieved a       N
                                                     bound with N 2 iterations. Alternatively,
instead of relying on the probabilistic bound, (author?) [12] used the uniform stability to
                                               23


Table 2.2: Comparison of true excess risk bounds. The algorithms are T -iteration 12 R-zCDP
or equivalently (, δ)-DP under the µ-strongly-convex condition. The O notation in this
table drops other ln terms. Assume loss functions are 1-smooth and 1-Lipschitz continuous,
and all parameters satisfy their numeric assumptions. * marks the method with convex
assumption.
       Algorithm                                    Utility Upper Bd.                                 T
                                                       √       2            
                                                            D ln N ln(1/p)
       GD+Adv [11]                              O1−p          pµN R,δ
                                                                                                  O(N 2 )
                                                                                                     N 2 R,δ
                                                                     
                                                             D ln N
       SVRG+MA [139]                                  O     µN 2 R,δ
                                                                                               O(ln     D
                                                                                                              )
                                                                 √
                                                                                                      N2
                                                                            
                                                        1      2 D
       SSGD+zCDP [42]                           O     √     +√          ln N                  O( 16D/R2 +4N )
                                                       N √      RN        
                                                                                                        N2R
       * SGD+MA [12]                           O max          √ D , √1                      O(min{ N8
                                                                                                      , 32D,δ })
                                                            N R,δ        N
                                   True risk in 
                                                high probability
                                                      q             (1 − p)       
                                                  G2        D ln(N ) ln(1/p)     4                    N2R
       GD+zCDP, Static Schedule          O1−p µN                   NR
                                                                              +  p
                                                                                                O(ln    D
                                                                                                            )
                                                         q                    
                                                      2
                                                     G         D ln(1/p)                              N2R
       GD+zCDP, Dynamic Schedule            O1−p µN                NR
                                                                           + p4                 O(ln    D
                                                                                                            )
                                               q                                      
                                            G2       D ln(1/p)                                        N2R
       Momentum+zCDP, Static Sch.   O1−p µN             NR
                                                                (c + ln N IT >T̂ ) + p4         O(ln    D
                                                                                                            )
                                               q                                       
                                            G2      D ln(1/p)          cD              4              N2R
       Momentum+zCDP, Dynamic Sch. O1−p µN              NR
                                                                (1 +        I
                                                                      N 2 R T >T̂
                                                                                   ) + p
                                                                                                O(ln    D
                                                                                                            )
                                      True risk by uniform   stability
                                                                                                      N2R
                                                                    
       GD, Non-Private                                   O ND    2R                             O(ln    D
                                                                                                            )
                                                                                                      N2R
                                                                    
       GD+zCDP, Static Schedule                         O DNln   2R
                                                                   N
                                                                                                O(ln    D
                                                                                                            )
                                                                                                      N2R
                                                                   
       GD+zCDP, Dynamic Schedule                         O ND    2R                             O(ln    D
                                                                                                            )
                                                                                                      N2R
                                                                             
       Momentum+zCDP, Static Sch.               O ND  2 R (c + ln N IT >T̂ )                    O(ln    D
                                                                                                            )
                                                                                                      N2R
                                                                             
       Momentum+zCDP, Dynamic Sch.              O ND              cD
                                                     2 R (1 + N 2 R IT >T̂ )                    O(ln    D
                                                                                                            )
give a tighter bound. Later, (author?) [42] improve the efficiency of gradient computation
to achieve a similar bound. Both approaches introduce an additive term to the empirical
bounds. In this section, we adopt both approaches to investigate the two types of resulting
true risk bounds.
    (1) True Risk in High Probability. First, we consider the high-probability true risk
bound. Based on Section 5.4 from [124] (restated in Theorem 10), we can relate the EER to
the TER.
Theorem 10. Let f (θ; x) be G-Lipschitz, and f (θ) be µ-strong convex loss function given
any x ∈ X . With probability at least 1 − p over the randomness of sampling the data set d,
                                                    24


the following inequality holds:
                                       s
                                         2G2 p                     4G2
                           TER(θ) ≤             f (θ) − f (θ∗ ) +       ,                 (2.15)
                                          µN                      pµN
where θ∗ = arg minθ f (θ).
    To apply the Eq. (2.15), we need to extend EER, the expectation bound, to a high-
probability bound. Following [11] (Section D), we repeat the PGD with privacy budget R/k
for k times. Note, the output of all repetitions is still of R budget. When k = 1, let the EER
of the algorithm be denoted as F (R). Then the EER of one execution of the k repetitions is
F (R/k) where privacy is accounted by zCDP. When k = log2 (1/p) for p ∈ [0, 1], by Markov’s
inequality, there exists one repetition whose EER is F (R/ log2 (1/p)) with probability at least
1 − 1/2k = 1 − p. Combined with Eq. (2.15), we use the bounds of uniform schedule and
dynamic schedules in Section 2.4.2 to obtain:
                                               r                           !!
                                           2
                                         G        D  ln(N  ) ln(1/p)     4
                      TERuniform ≤ Õ                                +        ,           (2.16)
                                         µN               NR             p
                                               r                   !!
                                           2
                                         G        D  ln(1/p)     4
                     TERdynamic ≤ Õ                           +       ,                  (2.17)
                                         µN           NR         p
where we again ignore the κ and other constants. Similarly, we can extend the momentum
methods.
    (2) True Risk by Uniform Stability. Following (author?) [12], we use the uniform
stability (defined in Definition 6) to extend the empirical bounds. We restate the related
definition and theorems as follows.
Definition 6 (Uniform stability). Let s > 0. A randomized algorithm M : DN → Θ is
s-uniformly stable w.r.t. the loss function f if for any neighbor datasets d and d0 , we have:
                           supx∈X E[f (M(d); x) − f (M(d0 ); x)] ≤ s,
where the expectation is over the internal randomness of M.
                                               25


Theorem 11 (See, e.g., [123]). Suppose M : DN → Θ is a s-uniformly stable algorithm
w.r.t. the loss function f . Let D be any distribution over data space and let d ∼ DN . The
following holds true.
                          Ed∼DN [E[f (M(d); D) − f (M(d); d)]] ≤ s,
where the second expectation is over the internal randomness of M. f (M(d); D) and f (M(d); d)
represent the true loss and the empirical loss, respectively.
Theorem 12 (Uniform stability of PGD from [12]). Suppose η < 2/M for M smooth,
G-Lipschitz f (θ; x). Then PGD is s-uniformly stable with s = G2 T η/N .
     Combining Theorems 11 and 12, we obtain the following:
                                                        ηT
                                    TER ≤ EER +G2          .
                                                        N
Because EER in this paper compresses a γ T or similar exponential terms, unlike [12], we
cannot directly minimize the TER upper bound w.r.t. T and η in the presence of a polynomial
                                                         2
form of γ T and T . Therefore, we still use T = O(ln NDR ) and η for minimizing EER. Note
that
                                          G2     N 2R           G2
                                                                  
                              2 ηT
                             G     ≤ O(       ln       )≤O
                                N        MN        D            M
where we assume N  D and use ln N ≤ N . Because the term O (G2 /M ) is constant and
independent from dimension, we follow [12] to drop the term when comparing the bounds.
After dropping the additive term, it is obvious to see that the advantage of dynamic schedules
still maintains since TER ≤ EER. A similar extension can be derived for [139].
     We summarize the results and compare them to prior works in Table 2.2 where we include
an additional method: Snowball Stochastic Gradient Descent (SSGD). SSGD dynamically
schedule the batch size to achieve an optimal convergence rate in linear time.
     Discussion. By using uniform stability, we successfully transfer the advantage of our
dynamic schedules from empirical bounds to true risk bounds. The inherent reason is that our
                                               26


bounds only need ln N iterations to reach the preferred final loss. With uniform stability,
the logarithmic T reduce the gap caused by transferring. Compared to the [42, 12], our
method has remarkably improved efficiency in T from N or N 2 to ln(N ). That implies fewer
iterations are required for converging to the same generalization error.
2.5     Experiments
We empirically validate the properties of privacy schedules and their connections to learning
algorithms. In this section, we briefly review the schedule behavior on quadratic losses under
varying data sensitivity.
    Dataset. We create a subset of the MNIST dataset [74] including 1000 handwritten
images of 10 digits (MNIST). We also construct a subset of the MNIST dataset with digit
3 and 5 only, denoted as MNIST35. Compared to the original dataset (70, 000 samples),
the small set will be more vulnerable to attack and the private learning will require larger
noise (see the 1/N factor in Eq. (2.1)). Following the preprocessing in [3], we project the
vectorized images into a 60-dimensional subspace extracted by PCA.
    Setup. The samples are first normalized so that N    n=1 xn = 0 and the standard deviation
                                                       P
is 1. Then the sample norms are scaled such that maxn kxn k = 10 (i.e., data scales). Upon
the scaled data, we train a 2-layer Deep Neural Network (DNN) with 1000 hidden units by
logistic regression. We fix the learning rate to 0.1 based on the corresponding experiments of
non-private training (same setting without noise). The total privacy budget is (4, 10−8 )-DP,
equal to 0.1963-zCDP, which implies R = 0.3927. To control the sensitivity of the gradients,
we clip gradients by a clipping norm fixed at 4. Formally, we scale down the sample gradients
to length 4 if its norm is larger than 4. Because the schedule highly depends on the iteration
number T , we grid search the best T in range [50, 150] for compared methods. Therefore,
we ignore the privacy cost of such tuning in our experiments which protocol is also used in
previous work [3, 148]. All the experiments are repeated 100 times and metrics are averaged
afterwards.
    We first show the estimated influence of step noise qt (by retraining the private learn-
                                                27


     120
                                                                     0.65                                                            86       uni
               data scale=1                                                                                  uni
                                                                                                             exp                              exp
     100       data scale=5
               data scale=10                                         0.60
               data scale=15                                                                                                         84
                                                                                                                     Test accuracy
      80
                                                        Train loss
      60                                                             0.55
qt
                                                                                                                                     82
      40
                                                                     0.50
      20
                                                                                                                                     80
       0                                                             0.45
           0        20         40       60   80   100                       0   10000    20000       30000   40000                        0         10000    20000       30000   40000
                                    t                                                   Train size                                                          Train size
Figure 2.1: Comparison of dynamic schedule and uniform schedule on different data scale.
Left pane is the influence by iteration estimated by retraining. The rest two panes are
performance of DNN trained on the MNIST35 dataset with a varying total number of training
samples, when the exponential influence is estimated on a randomly-generated auxiliary
dataset.
ing algorithms) in Fig. 2.1 Left. We see the trends of influence are approximately in an
exponential form of t. By Eq. (2.10), the resultant schedule on noise scale σt will be a nor-
malized exponential decay. This observation motivates the use of exponential decay schedule
in practice.
      To estimate the influence without extra privacy costs, we use an auxiliary set, which
is randomly sampled from Gaussian distribution, to pick the proper influence curvature
parameterized by an exponential function. We use auxiliary synthesized datasets of the
same size as the corresponding private datasets to tune the parameters. We vary the size of
training data to examine the data efficiency of the dynamic schedule denoted as exp. For a
fair comparison, we also choose the hyper-parameters of uniform schedule (uni) on the same
auxiliary dataset. We show that as the training size increases, exp outperforms uni both on
the training loss and the test accuracy. The result verifies our theoretic conclusion: dynamic
schedule is more data efficient than the static schedule.
                                                                                        28


                                        CHAPTER 3
  FEDERATED ADVERSARIAL DEBIASING FOR TRANSFERABLE AND
                               FAIR REPRESENTATIONS
Federated learning is a distributed learning framework that is communication efficient and
provides protection over participating users’ raw training data. One outstanding challenge of
federate learning comes from the users’ heterogeneity, and learning from such data may yield
biased and unfair models for minority groups. While adversarial learning is commonly used in
centralized learning for mitigating bias, there are significant barriers when extending it to the
federated framework. In this work, we study these barriers and address them by proposing
a novel approach Federated Adversarial DEbiasing (FADE). FADE does not require users’
sensitive group information for debiasing and offers users the freedom to opt-out from the
adversarial component when privacy or computational costs become a concern. We show
that ideally, FADE can attain the same global optimality as the one by the centralized
algorithm. We then analyze when its convergence may fail in practice and propose a simple
yet effective method to address the problem. Finally, we demonstrate the effectiveness of
the proposed framework through extensive empirical studies, including the problem settings
of unsupervised domain adaptation and fair learning.
3.1     Introduction
The last decade witnessed the surging adoption of personal devices such as smartphones,
smartwatches, and smart personal assistants. These devices directly interface with the users,
collect personal data, conduct light-weighted computations, and use machine learning models
to offer personalized services. The challenges from privacy concerns of sensitive personal
data, limited computational resources, performance issues of localized learning all together
lead to the federated learning (FL) paradigm [100, 16]. FedAvg [96], for example, provides
an efficient and privacy-aware FL framework. Users train models locally, upload them to a
central server iteratively aggregated to form a global model. FL greatly alleviated privacy
concerns because the server can only access model parameters from the users instead of the
                                               29


raw data used for training.
    One major challenge of FL comes from the user heterogeneity where users provide statis-
tically different data for training local models [29, 41]. Such heterogeneity may come from
different sources. For example, the users may collect data under various conditions according
to preferential or usages differences. Consider the learning of handwashing behavior from
accelerometers of smartwatches, where patterns can drastically change when using different
basins worldwide. Such domain shift [66] can lead to negative impacts during knowledge
transfer among users [106]. Another common source of heterogeneity comes from the sensi-
tive group information such as age, gender, and social groups, which are variables typically
not to be identified during learning. Heterogeneity from this source is often associated with
critical fairness issues [36] after deploying the models, where groups with less resource or
smaller computation capability may be biased or even ignored during the learning [93], and
the resulting global model may perform worse in minority groups.
    Adversarial learning [47] has been a powerful approach to mitigate bias in centralized
learning, in which an adversarial objective minimizes the information extracted by an encoder
that can be maximally recovered by a parameterized model, discriminator. For example, it
has been applied to disentangle task-specific features that may cause negative transfer [88],
to perform unsupervised domain adaptation [46, 133], and recently to achieve fair learn-
ing [161]. However, there are significant barriers when applying adversarial techniques in
FL: 1) Most existing approaches follow a top-down principle. In the context of FL, the
adversarial objective requires the server to access the sensitive group variable (e.g., gender)
to construct an adversarial loss. This requirement directly violates the privacy consideration
design for FL, and users may not want to disclose their sensitive group variables. 2) adver-
sarial learning demands extra information from users for training the adversarial component
and imposes an additional computational burden on smart devices that may not be able to
afford. 3) besides, it remains unknown how the introduction of an adversarial component
would impact the distributed learning behavior (e.g., convergence property) of FL.
                                               30


                   (b) Federated with shared embeddings       (c) Federated Adversarial Debiasing
  (a) Central      [111]                                      (proposed)
Figure 3.1: Illustrations of different adversarial learning frameworks for debiasing. f , D and
G are classifier (task model), discriminator and encoder, respectively. C1 , C2 , C3 represents
the task supervisions, for example, ground-truth classes, in the corresponding users. g1 and
g2 represents the two groups of users. The encoders are adversarially trained such that
the embeddings are informative for distinguishing C1 , C2 , C3 but not g1 , g2 . The proposed
FADE tackles a more challenging problem than other two because of isolated and non-sharing
group/user data (or embeddings) and class-wise non-iid users within groups.
    To address the challenges mentioned above, we propose a novel adversarial framework for
debiasing federated learning following a bottom-up principle, called Federated Adversarial
DEbiasing (FADE). Besides the benefits from typical FL on communication efficiency and
data privacy, FADE aims to achieve the following goals:
    • Privacy-Protecting: The learning algorithm conforms to the privacy design of FL
      and does not require users’ group variable to achieve debiasing w.r.t. the group variable.
    • Autonomous: A user can choose to join and opt-out from the adversarial component
      anytime (e.g., due to computational budget or privacy budget) while still participate
      in the regular federated learning.
    • Satisfiable: Under above restrictions, the distributed learning should output a debi-
      ased and accurate model, despite the user heterogeneity and unpredictable user par-
      ticipation.
To achieve these goals, we first propose a generic algorithm for FADE and show that ideally,
it can attain the same global optimality as the one by the central algorithm. We then show
how its convergence may fail in practice and propose a simple yet effective method to address
                                               31


the problem. Finally, we demonstrate the effectiveness of the proposed framework through
extensive empirical studies on various applications.
3.2     Related Work
Federated Learning (FL) [96] is a distributed learning framework that allows users with
different capabilities to collaboratively train a model without sharing their own data. A
critical challenge in FL is the heterogeneity among users. Viewing the learning process of
FL as knowledge transfer among different users, heterogeneity in user data leads to negative
transfer between users and compromises generalization [13]. One idea to alleviate the nega-
tive effect from the heterogeneity during the training, is to find the consensus among users.
For example, in [49, 85, 77, 43], the consensus on task knowledge is achieved by distillation.
In this work, we seek an alternative and efficient approach by adversarial debiasing the users
of different groups.
Adversarial Learning has been widely applied in various domains, such as neural language
recognition [88], image-to-image (dense) prediction [95], image generation [47], and etc. Con-
ceptually, adversarial learning aims to solve a two-player (or multi-player) game between two
adversarial objectives, which typically leads to a min-max optimization problem. Existing
approaches can be briefly categorized as: 1) Sample-to-Sample (S2S) adversarial learning,
where the adversarial objective quantifies the difference between synthetic and real samples.
Examples include adversarial learning against adversarial attacks [94] and generative adver-
sarial networks [47]. 2) Group-to-Group (G2G) adversarial learning, which aims to reduce
the max discrepancy (bias) between group distributions, for example, adversarial domain
adaptation [46], adversarial fairness [161] and adversarial multi-task learning [88]. All these
variants assume the availability of adversarial groups in the same computation node, e.g.,
by aggregating data in Fig. 3.1a, and thus cannot be directly extended to federated learning
to the violation of privacy design (requiring access of the sensitive group information). A
recent effort is done by [111] where embeddings of different groups are shared (see Fig. 3.1b).
Nevertheless, both sharing data and embeddings could induce additional privacy risk and
                                               32


communication costs. The proposed FADE eliminated these requirements, leading to private
and efficient distributed collaboration between users/groups.
3.3      Federated Adversarial Debiasing
In this section, we first formulate the proposed Federated Adversarial Debiasing (FADE)
framework. We work on the standard federated learning problem setting which learns one
model from a set of distributed participating users. Users conduct local learning based on
their own data and send the parameters of learning models to a server periodically. The
server aggregates the local models to form a global model. We assume the users have non-iid
data and each user belongs to one of the E user groups as indicated by a group variable
(e.g., age, gender, race) that is not to be shared outside of the local learning.
    The model of each user consists of three components: a decoder f for the learning task
(e.g., classification target), an encoder G, and a group discriminator D, as illustrated in
Fig. 3.1c. In the two-group setting (a data point belongs to either group 0 and 1), D outputs
a scalar in (0, 1) approximating the probability of an input data point x belong to the group
0. More generally, for E groups, we use a softmax mapping in the last layer of D which
outputs an E-dimensional vector. The FADE objective learns f, D, G by:
                                              XE Xmg
                              min L(f, G) =                   Li,g (f, G),                      (3.1)
                               f,G                g=1    i=1
                          Li,g (f, G) = Ltask
                                          i   (f, G) + λ max Ladv i,g (G, D),                   (3.2)
                                                           D
where Ltask
          i  (f, G) is the task loss for the i-th user, Ladv i,g (G, D) is the adversarial loss, and
mg is the number of users in group g. Note that we absorb the variable model D into Li,g
in Eq. (3.2), and the objective is still an optimization over f, D, G. For classification tasks,
the task loss can be defined as Ltask
                                    i  (f, G) , E(x,y)∼pi (x,y) [E(f (G(x)), y)], where E denotes the
cross-entropy loss and pi is the data distribution of user i. The adversarial loss is defined
as Ladv
     i,g (G, D) , Ex∼pi (x) [log Dg (G(x))], where Dg (G(x)) is the g-th output of the softmax
vector. The optimal solution for the min-max problem is the adversarial balance when D is
unable to tell the difference of G(x) among groups. For the two-group case, the adversarial
                                                  33


loss can be modified as:
                        Ladv
                                                  
                          i,g (G, D) = Ex∼p i (x)   I(g = 0) log D(G(x))
                                                                                           (3.3)
                                                                         
                                         + I(g = 1) log(1 − D(G(x))) ,
where I(·) is the indicator function.
    One fundamental difference between traditional adversarial learning and FADE is that
FADE only has one group data in the loss function. Hence, users have no sense of what an
adversary (a user from other groups) looks like. Directly optimizing this objective may fail
in finding the right direction towards convergence. In the worst case, the optimal solution
may not be the adversarial balance. In the next section, we will provide principled analysis
to the adversarial balance that is achievable under appropriate conditions.
    We summarize the server and user update strategies in Algorithms 3.1 and 3.2. The
server is responsible for aggregating users’ models and dispatching the global models to
users. Meanwhile, users train the received global model and the adversarial component
using local data. Note that we use the reversal gradient strategy to implement the min-max
optimization in Algorithm 3.1. Our algorithm enjoys the two nice properties:
Autonomous: Different from vanilla FL, FADE allows the users to decide whether or not to
join the learning of the discriminator D at each iteration. A user can opt-in the discriminator
learning at a low frequency or completely opt-out when privacy becomes a concern or learn
with restrictive computational resources. For example, in the adversarial domain adaptation
setting [111] where some users have supervision and some others not, some supervised user
may not want to help unsupervised users. FADE will significantly reduce the communication
cost and privacy risk overhead involved by cutting down the interactions form these users.
Privacy: In the proposed FADE framework, the group label g will be restricted to local
learning and the group debiasing is done through the discriminator model D. Thus, users
will not be able to obtain the other users’ sensitive attributes including the group variable.
Moreover, following [98], the privacy of FADE can be strictly protected by directly injecting
Differential-Privacy noise during the gradient descent procedure.
                                                   34


Algorithm 3.1: FADE User Update
Require: f , G, D received from server, learning rate η, adversarial parameter λ, user data
     distribution pi
 1: f0 , D0 , G0 = f, D, G
 2: for t = 1, . . . , K do
 3:      Sample a batch by x ∼ pi (x) or (x, y) ∼ pi (x, y)
 4:      z ← G(x)
                 ∂Ltask            ∂Ladv
 5:      ∇f ← ∂fi , ∇D ← ∂D          i
 6:      if adversarial game D is accepted      by user i then
                       ∂z   ∂Ltask     ∂Ladv
 7:           ∇G ← ∂G          i
                              ∂z
                                   + λ ∂zi
 8:           Dt+1 ← Dt + η∇D
 9:           Gt+1 ← Gt − η∇G
10:      else               task
                       ∂z ∂Li
11:           ∇G ← ∂G      ∂z
12:           Dt+1 ← Dt
13:           Gt+1 ← Gt − η∇G
14:      ft+1 ← ft − η∇f
     return fK+1 , GK+1 , DK+1
Algorithm 3.2: FADE Server Aggregation
Require: Initial models f, D, G, momentum parameter β
 1: for t ∈ 1, · · · , Tmax do
 2:      Select m active users uniformly at random into A
 3:      Broadcast θt = (ft , Gt , Dt ) to m users
 4:      for user i in A in parallel do
 5:           User updates by Algorithm 3.1
 6:                                          i=1 and average
         Aggregate {θtk = (fti , Git , Dti )}m
                                                  Xm ni
                                       θt+1 ← β            θti + (1 − β)θt
                                                     i=1 N
     return ft , Gt , Dt
3.4      Optimality Analysis
Despite the fact that FADE enables autonomous and improves privacy in learning, it is
critical to ask if the algorithm gives a satisfiable solution and what is the optimal solution
of Eq. (3.1). Remarkably, FADE differs from traditional adversarial learning by Eq. (3.3),
where only one group is used to evaluate the adversarial objective. This imposes a unique
challenge in learning as it may compromise the convergence of learning. Below we give
formal analysis of the optimality when Algorithm 3.2 is iterated with users from two groups
                                                     35


in non-zero probability. Since most of multi-group adversarial problems can be transformed
into two-group problems, we focus on discussing the two-group case for the ease of analysis.
    Consider the case when each group only has one user. The data distributions for the two
users are p1 and p2 , respectively. We single out the min-max optimization in Eq. (3.1) as:
                      min max Ep1 [log D(G(x))] + Ep2 [log(1 − D(G(x)))].
                       G     D
For simplicity, we denote G(x) by z and slightly abuse p1 (x) by p1 (z) in our discussion.
Hence, we can define:
                        Dp1 ,p2 = max Ep1 [log D(z)] + Ep2 [log(1 − D(z))],
                                     D
which is the maximal discrepancy between p1 (z) and p2 (z) that D can characterize. Now,
we can rewrite the min-max problem as minG Dp1 ,p2 (G) which minimizes the distribution
distance over z. Alternatively, we can formulate it by minp1 ,p2 Dp1 ,p2 since p1 and p2 are
parameterized by G.
    Because users may participate federated learning at varying frequencies, we use an auxil-
iary random variable ξi ∈ {0, 1} for i = 0, 1 to denote whether the user is active for training.
We assume ξi is subject to the Bernoulli distribution, B(1, αi ). Plug ξi into Dp1 ,p2 to obtain
Dp1 ,p2 = maxD Ep1 [ξ1 log D(z)] + Ep2 [ξ2 log(1 − D(z))] and take expectation:
                    D̃p1 ,p2 , Eξ1 ,ξ2 [Dp1 ,p2 ]
                             = max Ep1 [α1 log D(z)] + Ep2 [α2 log(1 − D(z))].             (3.4)
                                 D
Therefore, our problem is transformed as minimizing D̃p1 ,p2 .
    Note that with p1 and p2 given, the solution of the maximization in D̃p1 ,p2 is:
                                                         α1 p1 (z)
                                  Dα∗ 1 ,α2 (z) =                       ,                  (3.5)
                                                  α1 p1 (z) + α2 p2 (z)
with which we can derive the optimality sufficiency as below.
Theorem 13. The condition p1 (z) = p2 (z) is a sufficient condition for minimizing D̃p1 ,p2
and the minimal value is α1 log α1 + α2 log α2 + (α1 + α2 ) log(α1 + α2 ).
                                                   36


    Theorem 13 shows that even if some users are inactive, the distribution matching, p1 = p2 ,
remains a sufficient optimality condition. We remark that the above result can be generalized
to multiple users when all users are iid and ξi represent the ratio of group i in users. In
addition, we notice Theorem 13 does not guarantee a stable convergence or exclude other
undesired solutions. We discuss these issues in the following.
3.4.1    The effect of imbalanced groups
    Although Theorem 13 shows the optimality of the matched distribution, the optimization
may still fail to converge especially when one group of users are relatively inactive, e.g.,
α1  α2 . When α1  α2 or reverse, we call the situation as imbalanced groups. The
imbalanced groups happens because the users are free to quit or joint the training. From
Eq. (3.5), we observe that D∗ (x) will be less sensitive to changes of p1 (x) if α1  α2 , and
vice versa. Meanwhile, log D∗ (x) → −∞ and Dp1 ,p2 approaches the minimum even if p1 and
p2 are quite different.
Theorem 14. Let  be a positive constant. Suppose | log p1 (x) − log p2 (x)| ≤  for any x in
the support of p1 and p2 . Then we have D̃p1 ,p2 = O(α1 /(α1 + α2 )) when α1  α2 .
    Theorem 14 reveals that the imbalance between groups could greatly reduce the sen-
sitivity of the discrepancy  between p1 and p2 . A less sensitive discriminator will ignore
the minor differences between groups. The importance of discrepancy sensitivity for the
adversarial convergence was also discussed in [8]. It is easy to see the negative impact of
the low sensitivity: 1) higher communication cost incurs due to more communication rounds
are required to check the discrepancy; 2) the optimization possibly fails to converge due to
vanished gradients (scaled by α1 ).
3.4.2    Squared adversarial loss
    In Eq. (3.4), when α1 → 0 and α2 → 1, we notice that D̃p1 ,p2 approaches 0 while
Ep1 [log D(z)] → −∞. In other words, the large value of Ep1 [log D(z)] is neglected due to its
coefficient α1 . To re-emphasize the value, a heuristic method is to increase the weight when
                                              37


|Ep1 [log D(z)]| is large. Thus, we propose to replace Ladv      i,g (G, D) by:
                                                       1 adv            2
                                 Ladv
                                    i,g,2 (D, G) = −      Li,g (G, D) ,                         (3.6)
                                                       2
                                                                                                 (2)
which we call squared adversarial loss. We can write the corresponding discrepancy D̃p1 ,p2
as:
                           min α1 E2p1 [log D(z)] + α2 E2p2 [log(1 − D(z))].
                            D
    Though we derive the squared adversarial loss in a heuristic manner, the loss can be ex-
plained in the view of resource-fair federated learning [79]. Because the adversarial objective
pays more attention to the frequent group, we can interpret the problem as the unfairness
between groups. Following [79], we generalize our adversarial loss function as:
                                                      q−1 1
                            Ladv
                              i,g,2 (D, G) , (−1)           Ex [`qk (D, G; x)] ,                (3.7)
                                                          q
where q ≥ 1. If q = 1, the loss degrades to the vanilla one.
3.4.3     The effect of non-iid users
    It is well-known that typical federated learning approaches suffer from very heterogeneous
users since they sample data from very different distributions. The adversarial objective
captured and decreases the group heterogeneity by design. Another kind of heterogeneity is
related to the users’ tasks. We argue that the heterogeneity is natural and could be essential
for the task discriminability but may be accidentally eliminated by adversarial learning. For
example, three users are non-iid by three classes. After FADE training, the non-iid users
collapse to the similar distributions due to the wrong sense of the group discrepancy.
    To prove the existence of user-collapsed solution for FADE, we consider z ∼ p(z|T = t), or
simply z ∼ p(z|t), where t is a discrete hidden variable related to users’ tasks. For example,
each user has one class of samples in classification tasks. Then t is the corresponding class.
In addition, we define p̂1 (z) = m1 m      t=1 p(z|t) which is a p.d.f. For simplicity, we assume all
                                        P
users always participate the learning, i.e., αi = 1 for all users. Hence, we can obtain Dp1 ,p2
                                                     38


as
                                   Xm
                            max             Ep(z|t) [log D(z)] + Ep2 [log(1 − D(z))]
                              D         t=1
                           = max mEp̂1 (z) [log D(z)] + Ep2 [log(1 − D(z))],
                              D
whose maximizer is given by: D∗ (z) =                mp̂1 (z)
                                                mp̂1 (z)+p2 (z)
                                                                . Use similar derivations as in Theorem 13,
we can show that p̂1 (z) = p2 (z) is a sufficient optimality condition, which implies:
                                            Xm
                                                     p(z|t) = mp2 (z).                                    (3.8)
                                               t=1
First, we can still obtain p1 (z) m         t=1 p(t|z)/p(t) = mp2 (z) from Eq. (3.8) where we use
                                          P
p(z|t) = p1 (z) p(t|z) . If mt=1 p(t) = m, then we can get the vanilla solution, p1 (z) = p2 (z).
                                   p(t|z)
                           P
                 p(t)
    Except for the vanilla solution, a trivial solution to Eq. (3.8) is p(z|t) = p2 (z). However,
the solution could hurt the task utility since it may eliminate the inherent difference be-
tween tasks. For instance, if t represents the classification label, the solution will vanish the
discriminability of the representation z. We call the scenario as the user collapse. It worth
noticing that user collapse could happen even if the p1 and p2 are matched.
3.4.4    Mitigate user collapse
    Since there are arbitrarily many solutions to                            = m, we need to constraint the
                                                                Pm    p(t|z)
                                                                  t=1 p(t)
feasible solutions such that the collapsed solution will be eliminated. Notice                 p(t|z)
                                                                                                p(t)
                                                                                                      =  p(t,z)
                                                                                                        p(z)p(t)
is related to the mutual information between t and z. Conceptually, we can modify the
adversarial loss to:
                               L̂adv                  adv
                                 i,g,2 (D, G) = Li,g,2 (D, G) + I(G(x); t|i),
where I(G(x); t|i) is the mutual information conditioned on user i. Because mutual informa-
tion is hard to estimate in practice (especially given few samples), we provide some surrogate
solutions.
    If the t represents the class labels and supervision is available, then I(G(x); t|i) is already
encouraged by Ltask . If supervision is not available, we may maximize the entropy of the
                                                          39


output of classifier f such that the correlation between user’s tasks and representations will
not disappear during training. Useful techniques were previously exploited for unsupervised
domain adaptation, e.g., [92], and we defer the technique details to Section 3.5.2.
3.4.5   Privacy risks from malicious FADE users
    Our analysis suggests the feasibility of using adversarial training in the federated setting.
The distribution matching is achievable under variety of cases including imbalanced groups,
although the success rate may vary. But such power also implies potential privacy overhead
associated with FADE. Consider a malicious user i who wants to steal data from others,
FADE can match pi (x) with a victim’s data pj (x). The empirical study in [53] also discussed
the risk where a malicious attacker may take advantage of the discriminator to steal other
users’ data. Our results in Theorem 13 theoretically show that the attack is possible in
general. During the learning of the adversarial discriminator, injecting predefined noise is
known to be effective to defend such attacks [131]. Meanwhile, users could quit or frequently
opt-out the federated communication when the privacy budget (quantified by noise and
Differential Privacy metric [38]) is low. Based on Theorem 14, when more and more users
opt-out the communication, the adversary’s discriminator can hardly sense one victim’s
distribution.
3.5    Experiments on Unsupervised Domain Adaptation
In this section, we evaluate the FADE algorithms on Unsupervised Domain Adaptation
(UDA) [111, 82, 43]. UDA aims to mitigate the domain shift between supervised and unsu-
pervised data such that the trained classifiers can generalize to unlabeled data. We call the
supervised user (domain) as the source user (domain). Each domain may include multiple
users.
    Related work. [111] is among the first to discuss the adversarial UDA under federated
constraint, through sharing the embedding of samples. However, we consider a more chal-
lenging problem, a federated adversarial learning without sharing data. Recently, learning
                                                40


without access to the source data has gained increasing attention. [82] (SHOT) consid-
ered the domain adaptation only using the source-domain model which surprisingly outper-
formed most traditional UDA with source supervisions. However, its success relies on the
pre-matched representation distribution (but not well discriminated) by batch normalization
(BN) layers. In the FADE setting, the BN layers will fail to match representations since the
local estimate of their mean will be easily biased. In addition, in [43], distillation is used
to avoid directly passing data. Differing from [43], FADE is more efficient since it does not
need to upload all models from source domain to target domain. For example, if Ms users
(Mt ) in source (target) domain take part in training, sending models will involves Ms Mt
communication. Instead, FADE only use Ms + Mt times to communicate between domains.
Figure 3.2: Comparison of vanilla adversarial loss versus the squared adversarial loss on
MNIST-to-USPS (top) and USPS-to-MNIST (bottom) UDA. We vary the probability of
target users. For both UDA experiments, the SOTA central methods [81] can achieve over
98% accuracies. From left to right, the columns are target domain accuracies, classification
losses and adversarial losses of target domain users.
    Network architectures. We adopt the same network architecture as the state-of-the-
art of UDA [81]. As presented in Fig. B.1, we first use a backbone network to extract
sample features. Specifically, we use modified LeNet [92] for digit recognition, ResNet50
[51] for Office and Office-Home datasets, and ResNet101 for the VisDA-C dataset. We use
an one-layer bottleneck to reduce the feature dimension. After the bottleneck, we get a
representation of 256-dimension. A single fully-connected layer is used for classification at
                                             41


Table 3.1: Averaged classification UDA accuracies (%) on Office and OfficeHome dataset with
3 non-iid target users and 1 source user. Underlines indicate the occurrence of non-converged
results. Standard deviations are included in brackets.
     Method       A→D       A→W        D→A        D→W        W→A       W→D       Re→Ar      Re→Cl    Re→Pr    Avg.
                                             Federated methods
     Source only   79.5       73.4      59.6       91.6       58.2      95.8       67.0      46.5     78.2    72.2
                     non-iid target users w/ 20 (Office) or 45 (OfficeHome) classes per user
     FADE-DANN 85.4 (1.9) 81.8(1.8) 43.1(33)     97.7(0.5) 64.8(0.5) 99.7(0.2) 46.4(37)    34.9(27) 78.8(0.1) 70.3
     FADE-CDAN 92.3(1.2) 91.6(0.5) 65.9(9.3) 98.9(0.2) 70.2(0.8) 99.9(0.1) 70.3(1.6) 54.9(4.6)      82.2(0.1) 80.7
     FedAvg-SHOT 83.6(0.5) 83.1(0.5) 64.7(1.4) 91.7(0.2) 64.7(2.2) 97.4(0.5) 70.7(0.5) 55.4(0.5)    80.1(0.3) 76.8
                                                 iid target users
     FADE-DANN 84.2(1.5) 81.3(0.4) 66.3(0.3) 97.5(1.2) 59.4(10.6) 99.9(0.2) 67.3(0.9) 51.3(0.4)     79.0(0.6) 76.2
     FADE-CDAN 93.6(0.8) 92.2(1.3) 71.2(1.0) 98.7(0.4) 71.3(0.7) 100(0.0) 70.6(1.3) 55.1(1.0)       82.3(0.2) 81.7
     FedAvg-SHOT96.3(0.5) 94.3(1.1) 70.9(2.0) 98.4(0.4) 72.7(0.9) 99.8(0.0) 74.8(0.3) 60.0(0.1)     84.9(0.2) 83.6
                                               Central methods
     ResNet [51]   68.9       68.4      62.5       96.7       60.7      99.3       53.9      41.2     59.9    67.9
     Source [81]   80.8       76.9      60.3       95.3       63.6      98.7       65.3      45.4     78.0    73.8
     DANN [46]     79.7       82.0      68.2       96.9       67.4      99.1       63.2      51.8     76.8    76.1
     CDAN [92]     92.9      94.1       71.0       98.6       69.3      100        70.9      56.7     81.6    81.7
     SHOT [81]    94.0        90.1      74.7       98.4       74.3      99.9      73.3       58.8     84.3    83.1
the end. The discriminators are small-scale networks to match the capability of the classifiers.
The networks and algorithms are implemented using PyTorch 1.7. The ResNet backbones
pre-trained on ImageNet are retrieved from the torchvision 0.8 package.
3.5.1    Digit recognition with imbalanced groups
    As discussed in Section 3.4.1, group imbalance could result in the mismatch of group dis-
tributions. Here, we empirically evaluate the effect of the imbalanced groups on convergence,
adversarial losses and utility performance.
    Digit dataset is a standard UDA benchmark built on digit images collected from dif-
ferent environments. 10 digits, from 0 to 9, are included. We follow the UDA protocol
of [54] and use two subsets: MNIST and USPS. MNIST dataset contains 60, 000 training
images and 10, 000 testing 28 × 28 gray-scale images. USPS consists of 7291 training and
2007 testing 16 × 16 gray-scale images. We augment the USPS training set by resizing and
random rotation.
    Setup. We assume 2 users from source and target domain, respectively. In each round,
we select one user with predefined probability. For example, the case that source and target
users are of 0.05 and 0.95 probability implies severe imbalance. If a user/group has high
                                                          42


probability, that means the user/group will actively participate in the adversarial learning
and the other will activate less. The experiment can also be generalized to multiple users
in same frequency while one domain has more users. Both situations imply the imbalance
between two groups. In experiments, we fix the batch size to 32 and run one user per
communication round. In total, we train the users for global 8600 rounds. In each global
round, the users will train locally for 10 iterations. Experiments are repeated 3 times with
three fixed seeds. At the beginning, we train the models with adversarial coefficient λ = 0
when all source users are involved until the classification loss converges. Then, we follow
[46, 81] to use the decaying schedule of learning rates and schedule the adversarial coefficient
λ from 0 to 1.
    Results are reported in Fig. 3.2. Left two figures show the negative impact of imbal-
anced groups. When the imbalance is severe (large or small target probability), the drop in
target accuracies is more obvious. In the middle pane, the convergence curves of imbalanced
groups fluctuate more significantly and fail to converge. In the last pane, the imbalanced
cases have large adversarial losses which barely decrease by federated iterations. It explains
why the corresponding classification tasks fail to converge. When using the squared adver-
sarial losses, the ignored adversarial losses of low-frequent users are reduced during federated
learning. Meanwhile, the convergence of utility losses are faster. Thus, the negative impact
of imbalanced groups is mitigated.
3.5.2    Object recognition with non-iid users
    In Section 3.4.3, we prove that the non-iid distribution of users will lead to a trivial
solution which may lose the natural discrepancy between users. For federated classification
learning where each user only has a partial set of classes, the loss of user discrepancy will
make the representations non-discriminative to classes. Here, we conduct experiments to
reveal the impact of the non-iid users.
    Dataset. We adopt three object recognition datasets, Office [121] (small size), Office-
Home [135] (medium size) and VisDA-C [112] (large size), including image of office products.
                                                 43


The former two are standard benchmarks widely used for UDA. The Office dataset contains
three domains: Amazon (A), DSLR (D) and Webcam (W) with 2817, 498, 795 images,
respectively. 31 object classes of images are taken under different office environments (corre-
sponding to domains). The Office-Home datasets have 65 categories and 4 domains: Artistic
images (Ar), Clip Art (Cl), Product images (Pr), and Real-World images (Re) with 2427,
4365, 4439 and 4357 images, respectively. The VisDA dataset is a challenging large-scale
benchmark. The source domain comprises 12-way synthetic classification data. In total,
1.5 × 105 images are synthesized by rendering 3D models and are adapted to 55, 000 unla-
beled real-world images.
    Setup. In total, 4 users are generated from two domain datasets. First, we let the single
source domain user with all classes. Second, we generate 3 non-iid target domain users with
partial set of classes following the standard federated setting [96]. For Office dataset, each
user has 20 classes and adjacent users have consecutive classes with 10-class stride. For
instance, user 1 has class 0 to 20 and user 2 has class 10 to 30. For OfficeHome dataset, each
user has 45 classes with 20-class stride. For VisDA-C dataset, each user has 5 classes with
4-class stride. All users in the same domain will have the same number of samples. We select
2 users per communication round when training on OfficeHome. For VisDA-C dataset, we
adopt 1 user per round. In this case, the major difficulty comes from non-iid distributions of
users conditioned on the subset of classes. In experiments, the parameters for SHOT follows
[81]. Details of network architectures and learning setup are discussed in Section B.2.
    Baselines. We compare different UDA methods extended by FADE upon the presence of
non-iid users. DANN [46] is the first work on adversarial domain adaptation based on which
many recent methods are developed. CDAN [92] is the first to condition the discriminator
prediction on the estimated classes, which aligns with our purpose to maximize the mutual
information between user (related to classes) and representation. SHOT [81] (extended by
FedAvg [96]) is the current state-of-the-art method in domain adaptation which does not use
source data, assuming approximately mitigated domain shift.
                                               44


               Table 3.2: Comparison of target accuracies on Visda-C dataset.
                     Methods  Source only    DANN     SHOT          CDAN
                      Central     46.6         57.6    82.9           73.9
                      FADE        54.3         56.4    69.2     73.1 (+SHOT)
    Results. We summarize the results in Tables 3.1 and 3.2. Note that the straightforward
extension of DANN without constraints will suffer from the user heterogeneity. Therefore, we
observe catastrophic failures by DANN, for example, D→A with only a low accuracy. This
kind of failures happens when both D (498) is of less samples than A (2817). The possible
reason is that the discriminators fail to sense the position of target domain batches which is a
small ratio of all target-domain samples and changes frequently by iterations. In comparison,
when regulated by estimated classes, SHOT and methods combined with SHOT perform
better. Notably, because SHOT relies on BN states to mitigate domain shift, its accuracies
are much worse than its central version. Since SHOT can provide pseudo supervisions
which conditions on the estimated users’ local classes, DANN+SHOT outperforms DANN. In
reverse, DANN helps SHOT to mitigate the domain shift. We further explore CDAN+SHOT,
which conditions group discrimination on local classifier outputs (correlated to users’ classes).
As a result, CDAN+SHOT outperforms other methods and is close to the central version
of CDAN. Plus, CDAN+SHOT achieves the best average accuracies when the number of
users per round varies from 1 to 4. Remarkably, in the hardest case where only one user is
trained per round, CDAN+SHOT gains the best accuracies on 8 out of 9 tasks. In a more
challenging large-scale VisDA-C dataset, CDAN+SHOT also shows its advantage against
other baselines (see Table 3.2). We note that adversarial methods are more robust to the
non-iid users.
3.6    Experiments on Fair Federated Learning
The fair federated learning is motivated by the imbalanced groups in training. For example,
when vendor rallies people to use their software and train model with locally collected data,
the global model may be biased by the majority, e.g., male users. When a user from another
                                                45


(a) Results on Adult by varying number of male (b) Results on MCI data by varying number of
users.                                            users per round.
Figure 3.3: FADE with/without adversarial losses. In each subfigure, left is fairness measured
by ∆EO where smaller values indicates better fairness; right is the trade off between fairness
and utility where left-top is the preferred balance.
gender uses the software, she/he may find that the model performs poorly. As a result, the
minority group vanishes while majority continues to dominate. Thus, a method actively
debiasing w.r.t. the groups will be essential to defend the tendency.
    Related work. The fairness in federated learning was first discussed in [79] where
users are thought to have different capability for computation. Fairness was enforced by
increasing the weights of large loss, which was less optimized. In this experiment, we consider
the unfairness brought by the difference of group distributions. With FADE, we use a
discriminator locally to justify whether the user’s representations are biased from the other
group. Related central algorithms have been exploited [93, 137? ]. To the best of our
knowledge, we are the first to encourage such group-based fairness in federated setting.
Importantly, our method preserve the privacy of group variables. The concerns of the privacy
of group variables was previously discussed [48]. In [48], Hashimoto et al.assumes the group
membership and number of groups are unknown to the central learning server, when users
interact with the system and contribute data. Our FADE extends the setting to a distributed
framework where the private group information is still unknown to other parties including
the aggregation server.
    We utilize the Equalized Odds (∆EO) to evaluate the degree of fairness. Consider a
binary classifier f : Z → {0, 1} predicting label variable y when representations (z ∈ Z) are
sampled from two groups. We denote the conditional p.d.f. p(z|g, y) as zg,y which shapes
                                               46


the probability of z at group g and class y. An algorithm is said to be fair if the positive
∆EO (defined below) is close to 0.
               ∆EO , Ez0,0 [f (z)] − Ez1,0 [f (z)] + Ez0,1 [1 − f (z)] − Ez1,1 [1 − f (z)] (3.9)
which comprises the absolute difference in false positive rates and the absolute difference in
false negative rates.
3.6.1     Fair adult income prediction
    Dataset. We evaluate our algorithm on the UCI Adult dataset1 which is a standard
benchmark for fair classification. The dataset consists of over 40,000 vector samples from
the 1994 US Census. Each sample includes 14 attributes predicting if his/her income is over
50,000 dollars.
    Setup. We adversarially disentangle the unfair representations from the gender. When
keeping the total data size fixed, we construct one female user and vary the number of male
users. Each synthesized user evenly split the samples in the group. We run FADE for 8,000
communication rounds. In every round, 2 users are selected to train for 1 local iteration on a
batch of 64 samples. The accuracies and fairness are evaluated on the left-out 10% samples.
The network structure is in Fig. B.2. We set hyper-parameters as β = 0.5 and the initial
learning rate as 10−3 .
    Results are depicted in Fig. 3.3a. Without adversarial training, the unfairness is aggra-
vated when the imbalance between groups worsens. When more male users are involved, the
squared adversarial loss is able to further improve the fairness. Instead, the vanilla adver-
sarial learning performs better when the two groups are balanced. Both adversarial losses
will maintain the utility performance close to the non-adversarial method.
3.6.2     Fair MCI detection
    Dataset. Mild Cognition Impairment (MCI) is the pre-symptom of Alzheimer’s Disease
(AD) which typically happens on elders. Early detection of MCI is important for prevention
    1
      https://archive.ics.uci.edu/ml/datasets/adult
                                                    47


of AD occurrence and treatment [5, 134]. Details of the dataset is comprised in Section B.2.3
where females forms the majority group (over 94%). The prediction task here is to classify
the disease condition, Normal Cognition (NC) or MCI, based on the daily activities (walking
speed, etc.).
    Setup. In the original dataset, there are 88 users with different number of samples. We
notice the imbalance between NC and MCI users will greatly degrade the model quality. To
focus on our fairness task, we manually select 26 users such that 13 users was diagnosed as
NC at least once and the other 13 ones are stable MCI patients. Because male users are much
fewer than female ones, we prefer to select male users when balancing the two classes. After
downsampling, users have 39 samples on average. Among the 26 users, there are 6 males and
20 females in total. Details of features, preprocessing and network architectures are deferred
to Section B.2.3. We set hyper-parameters as β = 0.5, the initial learning rate as 10−2 and
batch size as 16. In the 700 communication rounds, we first train without adversarial losses
for 400 rounds and then schedule the λ and learning rates as the Adult experiments.
    Results. We compare the convergence of the training F1 -score (utility) and ∆EO (fair-
ness) by varying the number of users per round. As shown in Fig. 3.3b, the unfairness is
obvious with ∆EO over 0.2 when no adversarial losses are used. We see that the vanilla ad-
versarial loss failed to debias in most cases. In contrast, the squared adversarial loss stably
debias the unfair performance in all cases. When the number of users per round is less than
10, even the non-adversarial loss is more fair. The natural debiasing could be attributed to
the random selection of users, which breaks the domination of one group in a short span.
                                               48


                                        CHAPTER 4
  EFFICIENT FEDERATED LEARNING FOR ON-DEMAND AND IN-SITU
                                    CUSTOMIZATION
Federated learning (FL) provides a distributed learning framework for multiple participants
to collaborate learning without sharing raw data. In many practical FL scenarios, partici-
pants have heterogeneous resources due to disparities in hardware and inference dynamics
that require quickly loading models of different sizes and levels of robustness. The hetero-
geneity and dynamics together impose significant challenges to existing FL approaches and
thus greatly limit FL’s applicability. In this paper, we propose a novel Split-Mix FL strategy
for heterogeneous participants that, once training is done, provides in-situ customization
of model sizes and robustness. Specifically, we achieve customization by learning a set of
base sub-networks of different sizes and robustness levels, which are later aggregated on-
demand according to inference requirements. This split-mix strategy achieves customization
with high efficiency in communication, storage, and inference. Extensive experiments demon-
strate that our method provides better in-situ customization than the existing heterogeneous-
architecture FL methods.
4.1     Introduction
Federated learning (FL) [72] is a distributed learning paradigm that leverages data from
remote participants and aggregates their knowledge without requiring their raw data to be
transferred to a central server, thereby largely reducing the concerns from data security and
privacy. FedAvg [96] is among the most popular federated instantiations, which aggregates
knowledge by averaging models uploaded from different participants.
     When deploying federated learning, one challenge in real-world applications is the run-
time (i.e., test-time) dynamics: The requirements on model properties (e.g., inference ef-
ficiency, robustness, etc.) can be constantly changing during the run-time, depending on
the status of the devices or the outside environment. One common and specific type of
dynamics is resource dynamics: For each application, the allocated on-device resources (e.g.,
                                                49


run-time memory, CPU bandwidth, etc.) may vary drastically during run-time, depending
on how the resource allocation of the running programs are prioritized on a participant’s de-
vice [153]. Another type of dynamics is the robustness dynamics: The constantly changing
outside environment can make different requirements on the safety (or robustness) level of
the model [140]. For instance, the quality of real-time videos captured by autonomous cars
can suddenly degrade, e.g., on entering a poor-lighted alley or tunnel from a well-lighted
avenue, on entering a section of bumpy road which leads to a sudden burst of blurring in the
videos, etc. In such cases, a more robust model should be quickly switch in and replace the
one used on benign conditions, in order to prevent catastrophic accidents caused by wrong
recognition under poor visual conditions. Such dynamic run-time requirements demand the
flexibility to customize the model. However, as illustrated in Fig. 4.1a, we show that conven-
tional static-model FL methods, represented FedAvg, cannot provide such customization. A
naive solution is to train multiple models with different desired properties and keep them
all on device. However, this leads to extra training and storage costs proportional to the
number of models. Moreover, since it is not practical to keep all models simultaneously in
run-time memory on a resource-limited device, it also introduces inference overhead to swap
the models into and out of the run-time memory [33].
    To effectively and efficiently train models for on-demand an in-situ customization, new
challenges will be raised by the ubiquitous heterogeneity of federated learning participants.
Fist, the participants can have resource heterogeneity: Different participants have differ-
ent hardware resources available, such as memory, computing power, and network band-
width [63]. For example, in a learning task for face recognition, clients may use different
types of devices (e.g., computers, tablets or smartphones) to participate in learning. To ac-
commodate different hardware, one can turn to more resource-flexible architectures trained
by distillation from ensemble [85], partial model averaging [28], or directly combining predic-
tions [125]. Specifically, [28] is the first heterogeneous-width solution (HeteroFL) allowing
in-situ model-size switching. Nevertheless, it suffers from under-training in its large models
                                                50


(a) Illustration of FedAvg [96] with a device-incompatible model and a heterogenous-architecture
variant (HeteroFL) [28] with under-trained wide models.
(b) The proposed Split-Mix framework provides in-situ customization of widths and adversarial
robustness to address heterogeneity and dynamics, enabling efficient training and inference. In this
example, we use a subnet with 1/4 channels (or widths) per layer as a base model for model-width
customization. For simplicity, we denote it ×0.25 net, and a ×1 net can be split into 4 ×0.25 base
models.
                 Figure 4.1: Comparison of traditional and proposed methods.
due to local budget constraints as shown in Fig. 4.1a. The degradation could be worsened as
facing data heterogeneity: The training datasets from participants are not independent and
identically distributed (non-i.i.d.) [80, 41, 59, 169]. When one device with a unique data
distribution cannot afford training a large model, the global large model may not transfer to
the unseen distribution [106]. Thus, HeteroFL may not provide effective customization such
that more parameters brings in higher accuracy and how to train an effectively customizable
model still remains unknown.
    To address the aforementioned challenges from heterogeneity and dynamics, we study a
novel Split-Mix approach to enable FL on heterogeneous devices and achieve in-situ model
customization for resource efficiency and robustness: The size and robustness of the resultant
model can be efficiently customized at run-time. Specifically, we first split the complete
knowledge in a large model into several small base sub-networks (shards) according to model
                                                51


widths and robustness levels. To complete the knowledge, we let the base models be fully
trained on all clients. To provide customized models, we mix selected base models to
construct the desired model size and robustness. We illustrate the main idea in Fig. 4.1b.
Overall, our contributions can be summarized in three folds:
• Within the domain of heterogeneous federated learning, we are the first to study training
   a model with the capability of in-situ customization with heterogeneous local computation
   budgets, which cannot be resolved by existing methods yet as shown in Fig. 4.1a.
• To address the challenge, we propose a novel Split-Mix framework that aggregates knowl-
   edge from heterogeneous clients into a width- and robustness-adjustable model structure.
   Remarkably, due to fewer parameters and modular nature, our framework is not only effi-
   cient in federated communication and flexibly adaptable to various client budgets during
   training, but also efficient and flexible in storage, model loading and execution during
   inference.
• Empirically, we demonstrate that the performance of the proposed method is better than
   other FL baselines under heterogeneous budget constraints. Moreover, we show its effec-
   tiveness when facing the challenge of data heterogeneity.
4.2     Related Work
Heterogeneous Federated Learning. As increasing concerns have been gained on data
privacy leakage in machine learning [35, 147, 149, 57], federated learning (FL) protects data
privacy by training the model locally on users’ own devices without sharing data. In real-
world applications, FL with budget-insufficient devices (e.g., mobile devices) has attracted a
great amount of attention. For example, FedDistill [85] used logit averages to distill a network
of the same size or several prototypes, which will be communicated with users. FedDistill
made an assumption that the central server has access to a public dataset of the same learning
task, which is impractical for many applications. [49] introduced a distillation-based method
after aggregating private representations from all participants. The method closely resembles
centralized learning because all encoded samples are gathered, and however it is less efficient
                                               52


when local clients have large data dimensions or sample sizes. Importantly, the method may
not transfer adversarial robustness knowledge through the intermediate representations due
to the decoupling of the input and prediction. On the other hand, HeteroFL [28] avoids dis-
tillation, allowing participants to train different prototypes and sharing parameters among
prototypes to reduce redundant parameters. However, HeteroFL also reduces the samples
available for training each prototype, which leads to degraded performance. Considering
the high cost of training robust models, [56] proposed an efficient way to transfer model ro-
bustness from budget-sufficient devices to insufficient ones. FedEnsemble [125] is technically
related to the proposed approach, which uses ensemble of diverse models to accommodate
non-i.i.d. heterogeneity. The authors showed that combining multiple base models trained
simultaneously in FL can outperform a single base model in testing. A critical difference be-
tween the proposed approach and FedEnsemble comes from the challenging problem setting
of constrained heterogeneous computation budgets. For the first time, we show that base
models can be trained adaptively under budget constraints and used to customize efficient
inference networks that can outperform a single model of the same width but trained in a
heterogeneous-architecture way.
Customizable Models. To our best knowledge, this is the first paper discussing in-situ
customization in federated learning and here we review similar concepts in central learning.
First, customization of robustness and accuracy was discussed by [140], where an adjustable-
conditional layer together with decoupled batch-normalization were used. The conditional
layer enables continuous trade-off but also brings in more parameters. In comparison, a
simple weighted combination without additional parameters is used in our method and is very
efficient in both communication and inference. In terms of model complexity, a series of work
on dynamic neural networks were proposed to provide instant, adaptive and efficient trade-
off between accuracy and latency (mainly related to model complexity) of neural networks
at the inference stage. Typically, sub-path networks of different complexity [87] or sub-
depth networks [60, 150, 143, 163] are trained together. However, due to the large memory
                                                53


footprint brought by a constant number of channels per layer, the memory footprint at
inference is barely reduced. To address the challenge, slimmable neural network (SNN) [158]
was proposed to train networks with adaptive layer widths. Distinct from SNN, we consider
a more challenging scenario with distributed and non-sharable data and heterogeneous client
capabilities.
4.3      Problem Setting
The goal of this work is to develop a heterogeneous federated learning (FL) strategy, that
yields a global model, which can be efficiently customized for dynamic needs. Formally,
FL minimizes the objective P 1|Dk | K               (x,y)∈Dk L(f (x; W ), y), where L is loss function
                                          P P
                                     k
                                             k=1
(e.g., cross-entropy loss), f is a model parameterized by W , and {Dk }K          k=1 are the training
datasets on K participants with the image-label pairs (x, y). Following the standard FL
setting [96], only model parameters can be shared to protect privacy. We require efficient
run-time customization on model f for resource dynamics (Section 4.4.2) and robustness
dynamics (Section 4.4.3).      1
    When training customizable/adjustable models, significant challenges arise from hetero-
geneity among clients. In this paper, we consider the following: 1) Heterogeneous com-
putational budgets during training of clients constrain the maximal complexity of local
models. The complexity of deep neural networks can be defined in multiple dimensions, like
depths or widths [160, 129]. In this paper, we consider the width of hidden channels, be-
cause it can significantly reduce not only the number of model parameters but also the layer
output cache. Thus, we assume clients have confined width capabilities {Rk ∈ (0, 1]}K              k=1 ,
defined by width ratios w.r.t. the original model, i.e., ×Rk net as presented in Fig. 4.1b.
Similar to FedAvg and HeteroFL, the same architecture is used by clients and therefore the
model width can be tailored according to local budgets. In many applications, there are
usually a significant number of devices with insufficient computational budgets [40]. For ex-
    1
      Customization of other properties can also be applied within our framework. We consider model size
and robustness given their practical importance.
                                                    54


ample, we may assume exponentially distributed budgets in uniformly divided client groups:
Rk = (1/2)d4k/Ke , i.e., the first group with 1/4 clients is capable for a 1-width and the rest
are for 0.5, 0.25, 0.125-widths respectively. The budget distribution simulate a real scenario
where most federated mobile phones prefer a low-energy solution with smaller models for
training probably in the background and maintain more resources for major tasks. Other
budget distributions, e.g., log-normal distributions, are discussed in Section C.1.5. 2) Het-
erogeneous data distributions, e.g., non-i.i.d. features, Dk ∼ Di for ordered domain i,
induces additional challenges with a skewed budget distribution, since training a single model
on multiple domains [80] or models in a single domain (due to budget constraints) [59, 31]
are known to be suboptimal on transfer performance.
    In summary, training various model sizes has challenges from 1) resource heterogeneity,
where disparity in computation budgets will significantly affect the training of large models
because they can only be trained on scarce budget-sufficient clients; 2) data heterogeneity,
where the under-trained large models may perform poorly on unseen in-domain or out-of-
domain samples.
4.4     Method
To provide efficient in-situ customization, we introduce a simple yet powerful principle, Split-
Mix: shatter complete knowledge into smaller pieces, and customize by flexible formations of
pieces. Based on the principle, we propose customizations of model size and robustness in
the following.
4.4.1    Case Study: Customizable Networks from Budget-Constrained FL
    For motivation, we set up a standard non-i.i.d. FL benchmark [80] using DomainNet
dataset (refer to experimental details in Section 5.4) and study how the budget constraint
impedes effective training of customizable models.
    Case 1: FL without budget constraint. First, we individually train networks of
different widths by FedAvg. We see that the slimmer networks converge slower and are less
                                               55


                                                      0.70
                                                      0.65
                                validation accuracy
                                                      0.60
                                                      0.55                      width
                                                                                    0.125
                                                      0.50                          0.25
                                                                                    0.5
                                                      0.45                          1.0
                                                                                mode
                                                      0.40                          FedAvg
                                                                                    SHeteroFL
                                                      0.35
                                                             0   100     200        300
                                                                  communication round
             Figure 4.2: Convergence of different-width models on DomainNet.
generalizable, even though they are trained on all clients (solid lines in Fig. 4.2). The results
justified the motivation of training wider networks than slimmer ones.
   Case 2: FL with budget constraint. Following the above budget constraint, i.e.,
Rk = (1/2)d4k/Ke , we deploy HeteroFL to train budget-compatible prototype models locally.
Since HeteroFL was not designed for model customization, data are not fully used: each
model prototype is only trained by 1/4 of clients, when clients of the Rk = 1 budget can
actually afford all slimmer models. Therefore, we extend HeteroFL to a slimmable version
(SHeteroFL) by training all affordable prototypes locally. As shown in Fig. 4.2, wider models
(e.g., ×1 net) converge to validation accuracy lower than not only the FedAvg counterparts,
but also the slimmer ×0.25 net, showing that the widest model may not be a good candidate
model. Therefore, switching models to wider configurations lowers efficiency but does not
improve accuracy, which is not a valid customization.
   From the perspective of data allocation, it is not surprising that SHeteroFL exhibits a
non-monotonous relation between the model width and accuracy. For ×1 nets, only 1/4
clients and data are accessible for training. In comparison, 3/4 data are accessible by ×0.25
nets and the more data empowers ×0.25 nets better generalization ability than ×1 nets.
                                                                      56


4.4.2    Customize Model Size
     Motivated by the above observations, we propose to increase accessible training data by
splitting wide networks into universally-budget-compatible sub-networks and re-mix after-
ward. The overall FL algorithm is summarized in Algorithm 4.1.
     More accessible data by splitting wide networks. Since a wide network cannot
fit into budget-insufficient clients, we split it into budget-compatible sub-networks by chan-
nels (width) while maintaining the the total width. In terms of memory limitations, each
sub-network can be painlessly and individually trained in all clients. However, sequentially
training multiple slim base networks could be much slower than training a single integrated
one and increases blocking time on communication. Noticing that all base models can be
evaluated independently, we instead efficiently train ×r base models in parallel (see Algo-
rithm 4.3). Despite the benefit, a client’s budget constraint (R) will limit the number of
paralleled base models within bR/rc, as wider channels result in larger intermediate layer
cache (activations), and excludes the rest base models from the client. Fortunately, we can
select different sets of base models for a budget-limited client per round, which is inspired
by FedEnsemble [125] and is presented in Algorithm 4.2. Hence, all base models can be
ever trained on the client for multiple communication rounds, though not continuously ev-
ery round. Note that since the federated training processes of base models are independent
without interference, the training could be as stable as FedAvg with partial participants. In
addition, the combination of slimmer base models is flexible and can conform a variety of
client budgets.
     Boost accuracy by mixture of subnet experts. To craft a wide model, we combine
the outputs of multiple ×r base models until the size of the ensemble reaches the same as the
number of channels, e.g., bR/rc bases for an ×R net. We randomly initialize base models
independently such that the diverse bases could extract different features from the same
image [6]. Therefore, the ensemble can predict based on a variety of features, resembling
an integrated wide network. We follow the common practice, Kaiming’s method [50], for
                                                 57


initializing base networks with ReLU layers. As Kaiming’s method is width-dependent, we
parameterize the initialization based on the width of ×1 net instead of the ×r one, which
leads to smaller initial convolutional weights. In Section C.1.3, intensive ablation studies
show that the rescaled initialization is critical for improving test accuracy of wider networks.
Algorithm 4.1: Federated Split-Mix Learning
Input: Client datasets {Dk }K      k=1 , the number of total communication rounds T , M = b1/rc
randomly initialized base ×r nets parameterized by {wi,r }M                                        K
                                                                         i=1 , client budgets {Rk }k=1 , the number
of local epochs E, learning rates {ηt }t=1      T
                   0 }M , model indexes P = Shuffle([1, · · · , M ]) and current index p = 1
 1: Initialize {wi,r    i=1
 2: for round t ∈ {1, · · · , T } do
 3:     Initialize W t ← {wi,r  t ← 0}M and aggregation weights c = 0 for i ∈ {1, . . . , M }
                                          i=1                                     i
 4:     for k ∈ {1, · · · , K} do
 5:         Sample base models Wkt−1 , p ← SampleBaseModels(P, p, bRk /rc, W t−1 )
 6:         Send Wkt−1 to client k and train Ŵkt ← LocalTrain(Wkt−1 , Dk , E, ηt )
                            t,k                             t ← w t + ŵ t,k |D |, c ← c + |D |
 7:         Aggregate ŵi,r     ∈ Ŵkt to the server: wi,r          i,r       i,r   k    i   i     k
 8:     Server update W t ← {wi,r     t ← w t /c }M
                                                  i,r i i=1
 9: (Optional) Sort W T = {wi,r     T }M by the descending order of the validation accuracy of w T
                                        i=1                                                                   i,r
10: Output the customizable model W T = {wi,r             T }M
                                                              i=1
                                                                    PKR
11: Customize an R-width model by F (x; W T ) = K1                                   T
                                                                        i=1 f (x; wi,r ) where KR = bR/rc
                                                                  R
Algorithm 4.2: SampleBaseModels(P, p, n, W )
 1: if p > |P | then Shuffle P and p ← 1
 2: Initialize W = {wP [p],r }
 3: if n>1 then
 4:     Uniformly sample n − 1 values into S from P \{P [p]} without replacement
 5:     W ← W ∪ {wi,r , ∀i ∈ S}
 6: Return W, p + 1
Algorithm 4.3: LocalTrain(Wk , Dk , E, η)
 1: Initialize models Ŵk by Wk
 2: for e ∈ 1, · · · , E do
 3:     for batch data (x, y) in Dk do
 4:         for ŵi,r ∈ Ŵk in parallel do
                                   ∂L(f (x;ŵi,r ),y)
 5:             ŵi,r ← ŵi,r − η        ∂ ŵi,r
 6: Return Ŵk
                                                            58


                                                    Noised data flow
                           Training    Loss                               𝑦$ Inference
                                                    Clean data flow
                                         …          Unseen data flow      …
                           DBN                                   DBN       +
                                                                    1−𝜆         𝜆
                                BNc             BNn                   BNc    BNn
                                   clean   noised
                                         …                                …
                                     𝑥      Attack                         𝑥
Figure 4.3: Illustration of dual batch-normalization (DBN) in training and inference. The
BNc and BNn are for clean and noised samples.
4.4.3    Extension to Adversarial Robustness Customization
    In this section, we extend the customization from one dimension to two dimensions,
by jointly customizing model size and model robustness under adversarial attacks [? ].
Model robustness has gained increasing interest [? 141], especially in high-stakes federated
learning applications [? ]. Adversarial training (AT) [94] is arguably the most popular
and effective defense strategy against adversarial attacks. Specifically, it uses on-the-fly
adversarial samples as augmentation to improve robustness. Formally, AT minimizes the
following augmented loss:
               L(f ) = (1 − λn )LCE (f (x), y) + λn maxkδk∞ ≤ LCE (f (x + δ), y),      (4.1)
where δ is a subtle -constrained adversarial noise and transfers a clean sample x into an
adversarial sample x + δ. In Eq. (4.1), LCE is the cross-entropy loss the hyper-parameter
and λn trades off accuracy (the 1st term) and robustness (the 2nd term). When λn = 0
or 1, the optimization yields a standard-training (ST) model or an AT model, respectively.
Since an AT model is commonly less accurate in predicting standard images [132], there is
usually no such a sweet point of λn simultaneously maximizing robustness and accuracy, and
one typically needs to carefully gauge the trade-off according to the demand of robustness
in specific application context.
    Splitting and sharing parameters. Since standard performance and adversarial ro-
bustness are irreconcilable, we can directly use two separated ST and AT models to max-
imally capture the each property. But do we really need two totally separated models?
                                                        59


Intuitively, the two models share some common knowledge, given that an adversarial image
share a large part of common features with its original version. As introduced by [151],
sharing all parameters except the batch-normalization (BN) layers can maximize robustness
and accuracy by expertised BNs, respectively. Accordingly, we propose to split BN layers
(instead of the whole model) into two sub-components: one for standard performance and
the other for robustness. At training time, the first loss of Eq. (4.1) is computed with clean
BN (BNc ) merely and the second adversarial loss is computed with noised BN (BNn ). The
FL local training is elaborated in Algorithm C.1. As the two BNs are decoupled, there is no
more trade-off in loss Eq. (4.1) and thereby we choose λn = 0.5 to balance their effects on
parameter updates.
    Customizable layer-wise mixing. After training, the problem is how to mix the two
models (with different BNs) for prediction. A straightforward solution is averaging their
outputs. However, forwarding memory footprint will be doubled in this way, as the two
models have to be both executed separately. To avoid the doubled intermediate outputs, a
gate function can be used to adaptively choose BNs like [86]. Inspired by the method, we
further propose a simple parameter-free method by weighted-averaging outputs of each BN
layer (see Fig. 4.3):
                            DBN(x) = (1 − λ)BNc (x) + λBNn (x),                          (4.2)
given a BN-layer input x. We note that the averaging strategy is entirely training-free and
does not use extra parameters, and the customization weight λ is intuitive for trade-offs
between SA and RA.
    Lastly, it is remarkable that the DBN structure is rather lightweight in terms of model
complexity. As investigated in [158] (Table 2), the parameters of BN is no more than 1%
in popular deep architectures, e.g., ResNet or MobileNet. Therefore, we can plug DBN into
base models in place of BN, replace Algorithm 4.3 with Algorithm C.1, and jointly customize
robustness and model widths.
                                             60


4.5     Empirical Studies
We design experiments to compare the proposed method against FL classification bench-
marks. For class non-i.i.d configuration, we use CIFAR10 dataset [73] with preactivated
ResNet (PreResNet18) [51]. CIFAR10 contains over 50, 000 32×32 images of 10 classes. The
CIFAR10 data are uniformly split into 100 clients and distribute 3 classes per client. For (fea-
ture) non-i.i.d. configuration, we use Digits with a CNN defined and DomainNet datasets
[80] with AlexNet extended with BN layers after each convolutional or linear layer [80]. The
first dataset is a subset (30%) of Digits, a benchmark for domain adaption [111]. Digits
has 28 × 28 images and serves as a commonly used benchmark for FL [19, 96, 78]. The
dataset includes 5 different domains: MNIST [74], SVHN [102], USPS [62], SynthDigits
[46], and MNIST-M [46]. The second dataset is DomainNet [110] processed by [80], which
contains 6 distinct domains of large-size 256 × 256 real-world images: Clipart, Infograph,
Painting, Quickdraw, Real, Sketch. Each domain of Digits (or DomainNet) are split into
10 (or 5) clients, and therefore 50 (or 30) clients in total. We defer other details such as
hyper-parameters to Section D.2, and focus on discussing the results.
4.5.1    Customize model sizes
    In this section, we evaluate the proposed Split-Mix on tasks of customizing model sizes
through adjusting model widths. Recall that in Section 4.3, we assume a specific hetero-
geneous training budget to facilitate our discussion, such that one client can only train
models within a maximal width, and the resource distribution is imbalance among clients:
Rk = (1/2)d4k/Ke . Following the common FL setting, we do not consider algorithms that use
public data or representation sharing in our baselines.
    Baselines. As an ideal upper bound but a memory-incompatible baseline, we (re-)train
networks from scratch by FedAvg to obtain individual models with different widths. The
state-of-the-art heterogeneous-architecture FL method is HeteroFL [28], which trains dif-
ferent slim models in different clients. For a fair comparison, we extend HeteroFL with
bounded-slimmable training in clients who can afford the larger models, named SHeteroFL.
                                             61


For example, if a client in HeteroFL can afford ×0.5 net, then the client meanwhile trains
×0.25 net and other smaller subnets in slimmable training manner [158] by SHeteroFL.
                                  CIFAR10 class non-i.i.d                                                Digits                                                       DomainNet
                        0.6                                                             0.9                                                            0.7
                                                                                                                                                       0.6
  validation accuracy                                             validation accuracy                                            validation accuracy
                        0.5                                                             0.8
                                                                                                                                                       0.5
                        0.4                                                             0.7
                                                    mode                                                           mode                                0.4                          mode
                        0.3                         FedAvg                              0.6                        FedAvg                                                           FedAvg
                                                    SHeteroFL                                                      SHeteroFL                           0.3                          SHeteroFL
                                                    Split-Mix                                                      Split-Mix                                                        Split-Mix
                        0.2                                                             0.5                                                            0.2
                              0   100     200       300     400                               0   50      100       150    200                               0   100      200       300     400
                                   communication round                                             communication round                                             communication round
                        Figure 4.4: Validation accuracy of budget-compatible full-wdith nets by iterations.
              Convergence. In Fig. 4.4, we compare the convergence curves of full-width models on
three datasets. For FedAvg, the budget-compatible width is ×0.125. Both for SHeteroFL
and Split-Mix, the full-width is ×1. Compared to baselines, the proposed Split-Mix con-
verges faster, largely due to the splitting strategy. Specifically, because all base models are
independently trained, the convergence of each base model mainly depends on how frequently
they are trained. When enough clients participate FL, the training frequency of the ×1 net
is more than one, as all bases will be selected at least once in a communication round.
              Performance. In Table 4.1, we compare the test accuracy of different model widths. We
measure the latency in terms of MACs (number of multiplication-and-addition operations)
and model size in parameter numbers. With the same width, our method outperforms
SHeteroFL, using much fewer parameters and thus conducts inference in much lower latency.
Remarkably, compared to the best model by SHeteroFL, e.g., 81.8% ×1 net in CIFAR10,
Split-Mix uses only 1.8% parameters and 1.6% MACs (×0.125 net) to reach a similar level
of test accuracy. We notice that SHeteroFL has a much lower test accuracy with ×0.125
net on the CIFAR10 dataset. By investigating the loss curves, we find that the inference
between parameter-shared different prototypes results in unstable convergence of the ×0.125
net. Attributed to the independent splitting, our method is more stable on convergence
with different widths. Remarkably, Split-Mix only requires 12.7% parameters and 19.8%
MACs (by ×1 ensemble) to achieve the comparable accuracy as the ×1 individual network
                                                                                                       62


Table 4.1: Test results of customizing model width. MACs and the number of parameters
are counted at inference time. Grey texts indicate that the training cannot conform the
predefined budget constraint. The ‘M’ after metric values means ×106 .
                Individual FedAvg               SHeteroFL                Split-Mix (ours)
 width       Acc MACs #Params           Acc MACs #Params              Acc   MACs #Params
                                      CIFAR10 class non-i.i.d FL
 ×0.125    43.4%    0.9M       0.2M  49.1%       0.9M       0.2M    48.0%     0.9M        0.2M
 ×0.25     45.7%    3.5M       0.7M  51.4%       3.5M       0.7M    51.1%    1.8M       0.4M
 ×0.5      50.3%   14.0M       2.8M   51.5% 14.0M           2.8M   52.1%     3.6M       0.7M
 ×1        53.3%   55.7M      11.2M   49.9% 55.7M          11.2M   52.7%     7.2M       1.4M
                                       Digits feature non-i.i.d FL
 ×0.125    86.1%    0.1M       0.2M  86.8%       0.1M       0.2M    84.6%     0.1M        0.2M
 ×0.25     87.3%    0.4M       0.9M  87.9%       0.4M       0.9M    87.5%    0.2M       0.4M
 ×0.5      88.7%    1.3M       3.6M   86.9%      1.3M       3.6M 89.0%       0.5M       0.9M
 ×1        89.6%    4.8M      14.2M   81.3%      4.8M      14.2M 89.8%       0.9M       1.8M
                                    DomainNet feature non-i.i.d FL
 ×0.125    67.2%    2.5M       0.9M   66.9%      2.5M       0.9M 68.4%        2.5M        0.9M
 ×0.25     69.8%    7.5M       3.6M   67.8%      7.5M       3.6M 71.9%       5.0M       1.8M
 ×0.5      72.6%   25.5M      14.3M   66.9% 25.5M          14.3M 73.0%       9.9M       3.6M
 ×1        72.9%   92.5M      57.1M   58.7% 92.5M          57.1M 74.2% 19.8M            7.2M
on Digits. Results on CIFAR10 and DomainNet show a potential limitation of our method.
The accuracy of Split-Mix is not comparable with the wider individual models due to the
locally limited complexity. However, the limitation is more from the problem setting itself,
and will not undermine our advantage in budget-limited FL, since wide individual models
cannot be trained in this case.
    Client-wise evaluation. In addition to comparisons of inference on average, we also
demonstrate the statistics of test accuracy and the training and communication efficiency
in Fig. 4.5. Conforming the budget constraints, our method outputs more accurate full-
width models, transfer fewer parameters per round and execute fewer multiplication-and-
add operations for gradient descent than SHeteroFL, either in terms of average or variance.
Because only the individual ×0.125 net can fit into the budget constraint, FedAvg requires
the least MACs and parameters, which however significantly sacrifices the final accuracy.
    Domain-wise evaluation. To understand why Split-Mix outperforms SHeteroFL, we
                                              63


           1.0
                                                                        1011
                                                                                                                                      #parameters/round
           0.8
                                                                                                                                                          107
accuracy                                                   MACs/batch
           0.6                                                          1010
           0.4                   mode                                     109                                                                             106
                                 FedAvg
                                 SHeteroFL
           0.2                   Split-Mix
                                                                          108
                 Cifar10_cniid   Digits      DomainNet                               Cifar10_cniid    Digits       DomainNet                                    Cifar10_cniid    Digits     DomainNet
Figure 4.5: Client-wise statistics of test accuracy, training and communication efficiency by
budget constraints. The MACs quantify the complexity of one batch optimization in a client,
and the number of parameters per round round are the ones uploaded to (or downloaded
from) a server. Test accuracy is by the full-width networks. The results of FedAvg are from
budget-compatible ×0.125 nets.
                  percentage of trained parameters                                                    SHeteroFL                                                                 SplitMix
   100                            mode
                                  FedAvg                                                 2.0   -3.6 -1.1        1.0     0.7    2.5                               1.0    -2.7 -1.4 -1.9 -0.8 -1.3
                                                                           width                                                              width
      80
                                  SHeteroFL
      60                          SplitMix                                               10.8 4.6         0.3   -1.5    0.5    -2.9                              0.7    0.2     -0.9 -4.8 -1.3 -6.1
      40                                                                0.5 0.25 0.125   15.1 12.2 6.7          -1.3    0.8    0.7         0.5 0.25 0.125       -1.1    0.8     -1.2 -1.5    0.1    0.4
      20
                                                                             1.0         22.5 19.6 16.9 19.4 4.8               1.8                              -1.2    1.0     -4.0 -0.8 -0.7 -2.3
           0                                                                                                                                          1.0
                art                              l
                       ph     ng
            clip nfogra painti uickdr
                                      aw      rea sketch                               art ph ng aw                       l                                                                    l
                   i           q                                                   clip nfogra painti uickdr           rea sketch                               art ph ng aw
                                                                                                                                                            clip nfogra painti uickdr       rea sketch
                                                                                        i            q                                                           i            q
Figure 4.6: Per domain in DomainNet, the total percentage of parameters that are locally
trained (the left figure) and the accuracy (%) drops compared to FedAvg individual models
(right two figures).
investigate the total percentage of parameters that can be trained in each domain, in Fig. 4.6
(Left). We count the total parameters that were ever trained in clients of a domain during
the learning. Thanks to the base-model sampling strategy (i.e., Algorithm 4.2), Split-Mix
allows all base models, rather than a subset, to be trained on all clients. On the other
hand, varying client budgets greatly impacted SHeteroFL by limiting the width of models
trained in budget-insufficient clients, e.g., in the clipart, infograph and painting domains.
Hence, SHeteroFL leaves a large amount of parameters under-trained and suffers from larger
accuracy losses in the three domains and wider models. In comparison, Split-Mix not only
has less accuracy drop but is also more stable in all domains.
                                                                                                     64


4.5.2        Customize robustness
     Training and evaluation. For local AT, we use an n-step projected gradient de-
scent (PGD) attack [94] with a constant noise magnitude . Following [94], we set (, n) =
(8/255, 7), and attack inner-loop step size 2/255, for training, validation, and test. For
simplicity, we temporarily relax the budget constraint Rk and let the base model be ×1 net.
For comparison, we extend FedAvg with AT which yields individual models optimized with
different trade-off variables, i.e., λn ∈ {0, 0.1, 0.2, 0.3, 0.5, 1}. Also, we extend FedAvg with
state-of-the-art in-situ trade-off method, OAT [140], as a baseline. Split-Mix+DAT is an
extension of Split-Mix by the proposed DBN-based AT in Algorithm C.1. We evaluate and
contrast models in two metrics: standard accuracy (SA) on the clean test samples and ro-
bust accuracy (RA) on adversarial images generated from the clean test set by the PGD
attack. Both metric values are averaged by users. We evaluate the trade-off effectiveness
by comparing the RA-SA curves, which is better approaching the right-upper corner. To
plot the curves for FedAvg+OAT and Split-Mix+DAT, we vary their condition variable λ in
{0, 0.2, 0.5, 0.8, 1}.
                                  0.65                                                                         0.45
     0.4                                                                     0.4                               0.40     width
                                  0.60                                                                         0.35         0.125
     0.3                                                                     0.3                               0.30         0.25
RA                                0.55            mode                                                         0.25         0.5
     0.2                                                                     0.2                                            1.0
                                                 FedAvg+AT                                                              mode
     0.1                          0.50           FedAvg+OAT                  0.1
                                                 Split-Mix+DAT                                                              FedAvg
     0.0                                                                     0.0                                            SplitMixDAT
                                  0.45                                                                         0.05
            0.75 0.80 0.85 0.90          0.82   0.84    0.86     0.88              0.70   0.75   0.80   0.85           0.45 0.50 0.55 0.60
                  SA                               SA                                       SA                                 SA
           (a) CIFAR10                    (b) Digits                               (c) CIFAR10                        (d) DomainNet
Figure 4.7: Trade-off between robust accuracy (RA) and standard accuracy (SA) with full
width (a,b) and customizable widths (c,d).
     Trade-off curves are presented in Fig. 4.7 (a) and (b). Since the naive extension of OAT
with FedAvg adopts heterogeneous objectives and over-parameterization of conditional layers
which suffers from convergence issues, its adversarial training does not converge and RA is
incredibly poor in some cases. As a result, the trade-off curve is not smooth. Instead, the
proposed Split-Mix+DAT method has a smoother trade-off curve without heavy conditional
                                                                        65


training or over-parameterization. By training in one pass, Split-Mix+DAT even outperforms
and is more efficient than the FedAvg+AT baselines.
    Joint customization of width and robustness under budget constraints. Now
we consider the width customization and the training budgets as Section 4.5.1. Due to the
constraint, FedAvg can only train ×0.125 net. We omit OAT for its unstable convergence
and use SplitMixDAT as a short name of Split-Mix+DAT. In Fig. 4.7 (c) and (d), the trade-
off curves with different widths are depicted. As the width increases, both RA and SA of
SplitMixDAT are improved, when they are smoothly traded off. The results demonstrate
the flexibility and the modular nature of our method.
                                             66


                                       CHAPTER 5
         OUTSOURCING TRAINING WITHOUT UPLOADING DATA
As deep learning blooms with growing demand for computation and data resources, outsourc-
ing model training to a powerful cloud server becomes an attractive alternative to training at
a low-power and cost-effective end device. Traditional outsourcing requires uploading device
data to the cloud server, which can be infeasible in many real-world applications due to the
often sensitive nature of the collected data and the limited communication bandwidth. To
tackle these challenges, we propose to leverage widely available open-source data, which is
a massive dataset collected from public and heterogeneous sources (e.g., Internet images).
We develop a novel strategy called Efficient Collaborative Open-source Sampling (ECOS) to
construct a proximal proxy dataset from open-source data for cloud training, in lieu of client
data. ECOS probes open-source data on the cloud server to sense the distribution of client
data via a communication- and computation-efficient sampling process, which only commu-
nicates a few compressed public features and client scalar responses. Extensive empirical
studies show that the proposed ECOS improves the quality of automated client labeling,
model compression, and label outsourcing when applied in various learning scenarios.
5.1     Introduction
Nowadays, powerful machine learning services are essential in many devices that supports
our daily routines. Delivering such services is typically done through client devices that
are power-efficient and thus very restricted in computing capacity. The client devices can
collect data through built-in sensors and make predictions by machine learning models.
However, their stringent computing power often makes the local training prohibitive, es-
pecially for high-capacity deep models. One widely adopted solution is to outsource the
cumbersome training to cloud servers equipped with massive computational power, using
machine-learning-as-a-service (MLaaS). Amazon Sagemaker [83], Google ML Engine [15],
and Microsoft Azure ML Studio [10] are among the most successful industrial adoptions,
                                              67


where users upload training data to designated cloud storage, and the optimized machine
learning engines then handle the training. One major challenge of the outsourcing solution in
many applications is that the local data collected are sensitive and protected by regulations,
therefore prohibiting data sharing. Notable examples include General Data Protection Reg-
ulation (GDPR) [1] and Health Insurance Portability and Accountability Act (HIPPA) [4].
    On the other hand, recent years witnessed a surging amount of general-purpose and mas-
sive datasets authorized for public use, such as ImageNet [26], CelebA [90], and MIMIC [67].
Moreover, many task-specific datasets used by local clients can be well considered as biased
subsets of these large public datasets [110, 80]. Therefore, the availability of these datasets
allows us to use them to model confidential local data, facilitating training outsourcing
without directly sharing the local data. One approach is to use the private client dataset to
craft pseudo labels for a public dataset in a confidential manner [168, 109], assuming that
the public and local data are identically-and-independently-distributed (iid ). In addition,
Alon et al. showed that an iid public data can strongly supplement client learning, which
greatly reduces the private sample complexity [7]. However, the iid assumption can often
be too strong for general-purpose open-source datasets, since they are usually collected from
heterogeneous sources with distributional biases from varying environments. For example,
a search of ‘digits’ online yields digits images from handwriting scans, photos, to artwork of
digits.
    In this paper, we relax the iid assumption in training outsourcing and instead consider the
availability of an open-source dataset. We first study the gap between the iid data and the
heterogeneous open-source data in training outsourcing, and show the low sample efficiency
of open-source data. We show that in order to effectively train a model from open-source
data that is transferable to the client data, the open-source data needs to communicate more
samples than those of iid data. The main reason behind such low sample efficiency is that
we accidentally included out-of-distribution (OoD) samples, which poison the training and
significantly degrade accuracy at the target (client) data distribution [14]. We propose a
                                                68


Figure 5.1: Illustration of the proposed ECOS framework. Instead of uploading local data
for cloud training, ECOS downloads the centroids of clustered open-source features to effi-
ciently sense the client distribution, where the client counts the local neighbor samples of the
centroids as the coverage score. Based on the the scores of centroids, the server adaptively
samples proximal and diverse data for training a transferable model on the cloud.
novel framework called Efficient Collaborative Open-source Sampling (ECOS) to tackle this
challenge, which filters the open-source dataset through an efficient collaboration between
the client and server and does not require client data to be shared. During the collaboration,
the server sends compressed representative features (centroids) of the open-source dataset
to the client. The client then identifies and excludes OoD centroids and returns their pri-
vately computed categorical scores to the server. The server then adaptively and diversely
decompresses the neighbors of the selected centroids. The main idea is illustrated in Fig. 5.1.
    Our major contributions are summarized as follows:
• New problem and insight: Motivated by the strong demands for efficient and confidential
outsourcing, using public data in place of the client data is an attractive solution. However,
the impact of heterogeneous sources of the public data, namely open-source data, is rarely
studied in existing works. Our empirical study shows the potential challenges due to such
heterogeneity.
• New sampling paradigm: We propose a new unified sampling paradigm, where the server
only sends very few query data to the client and requests very few responses that efficiently
and privately guide the cloud for various learning settings on open-source data. To our
best knowledge, our method enables efficient cloud outsourcing under the most practical
                                               69


assumption of open-source public data, and does not require accessing raw client data or
executing cumbersome local training.
• Compelling results: In all three practical learning scenarios, our method improves the
model accuracy with pseudo, manual or pre-trained supervisions. Besides, our method shows
competitive efficiency in terms of both communication and computation.
5.2     Related Work
There are a series of efforts studying how to leverage the data and computation resources
on the cloud to assist client model training, especially when client data cannot be shared
[156, 142]. We categorize them as follows: 1) Feature sharing: Methods like group knowledge
transfer [49], split learning [136] and domain adaptation [32, 30] transfer edge knowledge
by communicating features extracted by networks. To provide a theoretical guarantee of
privacy protection, [105] proposed an advanced information removal to disentangle sensitive
attributes from shared features. In the notion of rigorous privacy definition, Liu et al. leverage
public data to assist private information release [89]. Earlier, data encryption was used for
outsourcing which however is too computation-intensive for a client and less applicable for
large-scale data and deep networks [21, 76]. Federated Learning (FL) considered the same
constraint on data sharing but the private knowledge is shared via models trained locally [97].
2) Private labeling: PATE and its variants were proposed to generate client-approximated
labels for unlabeled public data, on which a model can be trained [108, 109]. Without training
multiple models by clients, Private kNN was a more efficient alternative which explored the
private neighborhood of public images for labeling [168]. These approaches are based on
a strong assumption of the availability of public data that is iid as the local data. This
paper considers a more practical yet challenging setting where public data are from multiple
agnostic sources with heterogeneous features.
    Sampling from public data has been explored in central settings. For example, Xu et
al. [154] used a few target-domain samples as a seed dataset to filter the open-domain
datasets by positive-unlabeled learning [91]. In self-supervised contrastive learning, model-
                                              70


aware K-center (MAK) used a model pre-trained on the seed dataset to find desired-class
samples from open-world dataset [65]. Though these methods provided effective sampling,
they are less applicable when the seed dataset is placed at the low-energy edge, because the
private seed data at the edge cannot be shared with the cloud for filtering and the edge device
is incapable of computation-intensive training. To address these challenges, we develop a
new sampling strategy requiring only light-weight computation at the edge.
5.3      Outsourcing Model Training With Open-Source Data
5.3.1     Problem Setting and Challenges
     Motivated in Section 5.1, we aim to outsource the training process from computation-
constrained devices to the powerful cloud server, where a proxy public dataset without
privacy concerns is used in place of the client dataset for cloud training. One solution
is (private) client labeling by k-nearest-neighbors (kNN) [168], where the client and cloud
server communicate the pseudo-label of a public dataset privately and the server trains a
classifier by the labeled and unlabeled samples in a semi-supervised manner. The success
of this strategy depends on the key assumption that public data in the cloud and private
data in the client are iid, which is rather strong in practice and thus prevents it from many
real-world applications. In this work, we make a more realistic assumption that the public
Table 5.1: Test accuracy (%) with different client domains (columns). Cloud data are
identically distributed as the client data (ID) or including more data from 5 distinct domains
(ID+OoD) without overlapped samples. We first label a number of randomly selected cloud
examples (i.e., sampling budget) privately by client data [168], and then train a classifier
to recognize digit images. The privacy cost  is accounted for in the notion of differential
privacy. Larger budgets imply more privacy and communication costs. More results on
different settings are enclosed in Section D.2.3.
  Cloud   Sampling    MNIST         SVHN           USPS       SynthDigits   MNIST-M        Average
   Data    Budget Acc (%) ↑  ↓ Acc (%) ↑  ↓ Acc (%) ↑  ↓ Acc (%) ↑  ↓ Acc (%) ↑  ↓ Acc (%) ↑  ↓
    ID      1000   84.3±2.4 4.48 51.6±1.4 4.08 87.1±0.5 4.51 73.2±1.5 4.57 55.5±1.0 4.46  70.4    4.42
            1000   78.0±3.5 4.30 40.6±1.6 3.75 82.2±2.7 4.32 62.1±1.6 4.41 49.1±1.0 4.27  62.4    4.21
            8000   82.2±4.1 5.89 47.9±1.8 5.89 85.4±0.5 5.89 64.4±3.6 5.89 53.3±2.2 5.89  66.6    5.89
 ID+OoD
           16000   82.6±1.4 7.17 48.5±1.7 7.17 86.7±1.9 7.17 67.5±2.3 7.17 52.0±3.0 7.17  67.4    7.17
           32000   84.1±1.6 9.32 49.4±0.2 9.32 86.8±2.0 9.32 68.5±0.1 9.32 53.0±2.7 9.32  68.4    9.32
                                                 71


datasets are as accessible as open-source data. The open-source data consists of biased
features from multiple heterogeneous sources (feature domains), and therefore includes not
only in-distribution (ID) samples similar to the client data but also multi-domain OoD
samples.
    The immediate question is how the OoD samples affect the outsourced training. In
Table 5.1, we empirically study the problem by using a 5-domain dataset, Digits, where 50%
of one domain is used on the client and the remained 50% together with the other 4 domains
serve as the public dataset on the cloud. To conduct the cloud training, we leverage the client
data to generate pseudo labels for the unlabeled public samples, following [168]. It turns out
that the presence of OoD samples in the cloud greatly degrades the model accuracy. The
inherent reason for the degradation is that the distributional shift of data [117] compromised
the transferability of the model to the client data [146].
    Problem formulation by sampling principles. Given a client dataset Dp and an
open-source dataset Dq , the goal of open-source sampling is to find a proper subset S from
Dq , whose distribution matches Dp . In [65], Model-Aware K-center (MAK) formulated the
sampling as a principled optimization:
                             minS⊂Dq ∆(S, Dp ) − H(S ∪ Dp ; Dq ),                           (5.1)
where ∆(S, Dp ) := Ex0 ∈Dp [minx∈S kx − x0 k2 ] measures proximity as the set difference between
S and Dp , and the latter H(S ∪ Dp ; Dq ) := maxx0 ∈Dq minx∈S∪Dp kx − x0 k2 measures diversity
by contradicting S ∪ Dp and Dq (suppose Dq is the most diverse set). Solving Eq. (5.1)
results in an NP-hard problem that is intractable [23], and MAK provides an approximated
solution by a coordinate-wise greedy strategy. It first pre-trains the model representations
on Dp and finds a large candidate set with the best proximity to extracted features. Then,
it incrementally selects the most diverse samples from the candidate set until the sampling
budget is used up.
    Though MAK is successful in the central setting, it is not applicable when Dp is isolated
from cloud open-source data and is located at a resource-constrained client for two reasons:
                                                72


1) Communication inefficiency. Uploading client data may result in privacy leakage, sending
public data to the client is a direct alternative but the cost can be prohibitive. 2) Compu-
tation inefficiency. Pre-training a model on Dp or proximal sampling (which computes the
distances between paired samples from Dq and Dp ) induces unaffordable computation over-
heads for the low-energy client.
5.3.2    Proposed Solution: Efficient Collaborative Open-Source Sampling (ECOS)
    To address the above challenges, we design a new strategy that 1) uses compressed queries
to reduce the communication and computation overhead and 2) uses a novel principled objec-
tive to effectively sample from open-source data with the client responses of the compressed
queries.
    Construct communication-efficient and an informative query set D̂q at cloud.
Let d be the number of pixels of an image, the communication overhead of transmitting Dq
to the client is given by O(d|Dq |). For communication efficiency, we optimize the following
two factors:
i) Data dimension d. First, we transmit extracted features instead of images to reduce the
communication overhead to O(de |Dq |) where de is a much smaller embedding dimension. For
accurate estimation of the distance ∆, a discriminative feature space is essential. Depending
on resources, one may consider hand-crafted features such as HOG [25], or an off-the-shelf
pre-trained model such as ResNet pre-trained on ImageNet.
ii) Data size |Dq |. Even with compression, sending all data for querying is inefficient due to
the large size of open-source data |Dq |. Meanwhile, too many queries would cast unacceptable
privacy costs to the client. As querying on similar samples leads to redundant information
in querying, we propose to reduce such redundancy by selecting informative samples. We
use the classic clustering method KMeans [44] for compressing similar samples by clustering
them, and collect the R mean vectors or centroids into D̂q = {cr }R r=1 . We denote R as the
compression size. The selected centroids can be decompressed by the cluster assignment into
corresponding original samples with rich features.
                                               73


     New sampling objective. We propose to optimize a communication-efficient surrogate
loss:
                           minŜ⊂D̂q ,S⊂Dq ∆(Ŝ, D̂p ) + ∆(Ŝ, S) − H(S; Dq ),                (5.2)
                                            |         {z        } | {z }
                                                  proximity          diversity
where Ŝ (or D̂q ) is the compressed centroid substitute of S (or Dq ). D̂p consists of the
features of Dp . Different from Eq. (5.1), we decompose the proximity term into two in order
to facilitate communication efficiency leveraging D̂q . We solve the optimization problem in
a greedy manner by two steps at the client and the cloud, respectively:
i) At the client step, we optimize ∆(Ŝ, D̂p ) to find a set of centroids (Ŝ ⊂ D̂q ) that are
proximal to the client set D̂p . Noticing that D̂q contains the cluster centroids, we take
advantage of the property to define a novel proximity measure of the cluster r: Centroid
Coverage (CC), denoted as vr . Upon receiving centroids from the cloud, the client use them
to partition the local data into {Crp }R   r=1 where Cr denotes the r-th cluster partition of local
                                                         p
data. We compute the CC score by the cardinality of the neighbor samples of the centroid
r, i.e., vr = |Crp |. To augment the sensitivity to the proximal clusters, we scale the CC score
by a non-linear function vr0 = ψs (|Crp |), where the scale function ψs (x) = xs is parameterized
by s.
ii) At the cloud step, we optimize the proximity of S w.r.t. the proxy set Ŝ, i.e., ∆(Ŝ, S),
and remove redundant and irrelevant samples from the candidate set to encourage diversity,
i.e., −H(S; Dq ). As samples among clusters are already diversified by KMeans, we only
need to promote the in-cluster diversity. To this end, we reduce the sample redundancy per
cluster at cloud by K-Center [122], which heuristically find the most diverse samples. Such
design transfers the diversity operation to cloud and thus reduces the local computation
overhead. To maintain the proximity, the K-Center is applied within each cloud cluster
and the sampling budget per cluster is proportional to their vote numbers and the original
cluster sizes. With the normalized scores, we compute the sampling budget per cluster which
is upper bounded by the ratio of the cluster in the cloud set.
                                                   74


Algorithm 5.1: Efficient collaborative open-source sampling (ECOS)
Require: Client dataset Dp , cloud query dataset Dq , sampling budget B, compression size
    R, distance function ∆(x, S) = miny∈S kx − yk, feature extractor φ(·), initial sample set
    S = ∅, score scale function ψs (x) = xs .
 1: Cloud creates a compressed dataset D̂ q = KMeansR ({φ(x)|x ∈ D q });             . Compress R
    Centroids
 2: . . . Client End . . .
 3: Download the feature extractor φ and D̂ q ;
 4: Use centroids D̂ q to partition D̂ p = {φ(x)|x ∈ D p } into clusters {Crp }R
                                                                               r=1 ;
 5: Compute the Centroid Coverage (CC) scores: vr = |Cr |, ∀r ∈ {1, . . . , R};
                                                                 p
                                                 r=1 ;
 6: Upload scaled cluster scores {vr0 = ψs (vr )}R                                     . Proximity
 7: / / / Cloud End / / /
 8: Partition D q into clusters {Crq }R
                                      r=1 by centroids D̂
                                                       n ; q
                                                          q
                                                                        o
                                                            |C |   v0
 9: Compute per-cluster sampling budget br = min P r|C q | , P rv0 · B;
                                                             j   j j  j
10: for r in {1, . . . , R} do                                            . Decompress Centroids
11:     Initialize S 0 = {x} by a random sample from Cr ;
12:     while |S 0 | < br do                                                   . Diverse Sampling
13:         u = arg maxx∈Crq ∆(x, S 0 );
14:         S 0 = {u} ∪ S 0 ;
15:     S = S 0 ∪ S;
16: return S
    We summarize our algorithm in Algorithm 5.1, which enables the clients to enjoy bet-
ter computation efficiency than local training and better communication efficiency than the
centralized sampling (e.g., MAK). 1) Computation efficiency. Since our method only re-
quires inference operations on the client device, which should be efficiently designed for the
standard predictive functions of the device, and is training-free for the client, the major com-
plexity of ECOS is on computing centroid coverage and is much lower than gradient-based
algorithms whose complexity scales with the model size and training iterations. As comput-
ing the CC scores only requires the nearest centroid estimation and ranking, the filtering can
be efficiently done. The total time complexity is O(Cφ |Dp | + (de + 1)R|Dp |), dominated by
the first term, where Cφ is the complexity of extracting features depending on the specific
method. The second term (de + 1)R|Dp | is for computing the pair-wise distances between
D̂p and D̂q and estimating the nearest centroids per sample (or partitioning client data).
In a brief comparison, the complexity of local T -iteration gradient-descent training could be
                                                75


approximately O(T Cφ |Dp |) which is much more expensive since typically T Cφ  de . To
complete the analysis, the space complexity is O(Cφ0 + (d + de )|Dp | + de R + |Dp |R) for the
memory footprint of φ, the images (d) or features (de ) of client and cloud centroid data,
and the distance matrix. 2) Communication efficiency. The downloading complexity
will be O(de R) for R de -dimensional centroid features and the uploading complexity is O(R)
including indexes of samples. Thus, the data that will be communicated between the client
and cloud is approximately of O(de R + R) complexity in total. In comparison, download-
ing the whole open-source dataset by central sampling (e.g., MAK) requires O(d|Dq | + B)
complexity. As R < de R  de |Dq |  d|Dq |, our method can significantly scale down the
computation cost.
    Privacy protection and accountant. When the cloud server is compromised by an
attacker, uploading CC scores leak private information of the local data samples, for example,
the presence of an identity [126]. To mitigate the privacy risk, we protect the uploaded scores
by a Gaussian noise mechanism, i.e., ṽr = vr + N (0, σ 2 ), and account for the privacy cost
in the notion of differential privacy (DP) [38]. DP quantifies the numerical influence of the
absence of a private sample on [v1 , · · · , vR ], which is connected to the chance of exposing
the sample to the attacker. To obtain a tight bound on the privacy cost, we utilize the
tool of Rényi Differential Privacy (RDP) [99] and leverage the Poisson sampling to further
amplify the privacy [167]. With the noise mechanism governed by σ, the resultant privacy
cost in the sense of (, δ)-DP can be accounted as  = O(γ log(1/δ)/σ) where γ is the
                                                                  p
Poisson subsampling rate and δ is a user-specified parameter. A larger  implies higher risks
of privacy leakage in the probability of δ. Formal proofs can be found in Section D.1.3.
5.4     Empirical Results
Datasets. We use datasets from two tasks: digit recognition and object recognition. Dis-
tinct from prior work [168], in our work, the open-source data contains samples out of the
client’s distribution. With the same classes as the client dataset, we assume open-source data
are from different environments and therefore include different feature distributions, for ex-
                                                  76


ample, DomainNet [110] and Digits [80]. DomainNet includes large-sized 244 × 244 everyday
images on 6 domains and Digits consists of 28 × 28 digit images on 5 domains. Instead of
using an overly large volume of data from a single domain like [168], we tailor each-domain
subset to contain fewer images than standard digit datasets, for instance, MNIST with 7438
images, which was previously adopted in the distributed learning setting [80] and mitigates
the hardness of collecting enormous data. In practice, collecting tens of thousands of images
by a single client, e.g., 50000 images from MNIST domain, will be unrealistic. Similarly,
DomainNet will be tailored to only include 10 classes with 2000-5000 images per domain.
Splits of client and cloud datasets. For Digits, we use one domain for the client and
the rest domains for the cloud as open-source set. For DomainNet, we randomly select 50%
samples from one domain for the client and leave the rest samples together with all other
domains to the cloud. The difference of configurations for the two datasets is caused by
their different domain gaps. Even without ID data, it is possible for Digits to transfer the
knowledge across domains.
Baselines. For a fair comparison, we compare our method to baselines with the same
sampling budget. Each experiment case is repeated for three times with seed {1, 2, 3}. We
account for the privacy cost by Poisson-subsampling RDP [167] and translate the cost to
the general privacy notion, (, δ)-DP when δ = 10−5 . Here, we use the random sampling
as a naive baseline. We also adopt a coreset selection method, K-Center [122], to select
informative samples within the limited budget. Both baselines are perfectly private with-
out accessing private information from clients. Details of hyper-parameters are deferred to
Section D.2.1.
5.4.1    Evaluations on Training Outsourcing
    To demonstrate the general applicability of ECOS, we present three practical use case
of outsourcing, categorized by the form of supervisions. The conceptual illustrations are
in Fig. 5.2. Per use case, we train a model on partially labeled cloud data (outsourced
training) and accuracy on the client test set are reported together with standard deviations.
                                              77


                                               Outsource
                                                Labeling                                                                            Adapt
                                                                                                  Large model 𝑓"                                   Small model 𝑓!
    Unlabaled                                   Labeled          Semi-supervised                      Labaled                      Selected
                           Proximal Data
   cloud data                                proximal data           Learning                       cloud data                   proximal data
    Unlabaled                                                                                       Unlabaled
                     ECOS                                                                                                 ECOS
   client data                                                                                      client data
                  (a) Selective manual labeling.                                                    (b) Adaptive model compression.
                                Unlabaled                    Unlabaled                        Labeled            Semi-supervised
                                cloud data                 proximal data                   proximal data            learning
                                  Labaled                                     Private kNN
                                                  ECOS
                                client data
                                                 (c) Automated client labeling.
Figure 5.2: Our method is applicable to various cloud training cases, where ECOS filters
the open-source samples to improve the model performance trained on (a) manual, (b) pre-
trained model (teacher), and (c) pseudo supervisions.
Table 5.2: Test accuracy (%) by selective labeling on Digits (top) and DomainNet (bottom).
                 Sampling               MNIST            SVHN                 USPS               SynthDigits            MNIST-M Average
             Budget Method           Acc ↑  ↓         Acc ↑  ↓          Acc ↑  ↓               Acc ↑  ↓             Acc ↑  ↓ Acc ↑
                       Ours 97.3±0.1 0.22 68.7±0.3 0.22 90.8±0.1 0.22 84.4±0.6 0.22 70.4±0.6 0.22                                            82.3
               2000 K-Center 96.7±0.3 0.00 65.1±1.3 0.00 90.1±0.7 0.00 80.2±1.1 0.00 70.1±0.3 0.00                                            80.4
                     Random 96.5±0.3 0.00 64.0±0.3 0.00 91.6±1.0 0.00 83.8±0.3 0.00 70.9±0.6 0.00                                             81.4
                       Ours 98.1±0.2 0.22 74.6±1.0 0.22 93.5±0.3 0.22 91.2±0.4 0.22 74.5±0.5 0.22                                            86.4
               5000 K-Center 97.9±0.2 0.00 72.3±0.6 0.00 92.7±0.9 0.00 89.6±0.3 0.00 74.0±0.5 0.00                                            85.3
                     Random 97.6±0.3 0.00 70.0±0.3 0.00 93.0±0.6 0.00 89.7±0.4 0.00 73.9±0.7 0.00                                             84.8
       Sampling            Clipart          Infograph            Painting               Quickdraw                 Real               Sketch        Average
  Budget Method           Acc ↑  ↓         Acc ↑  ↓          Acc ↑  ↓                Acc ↑  ↓             Acc ↑  ↓           Acc ↑  ↓ Acc ↑
                Ours 88.4±1.5 0.58 52.6±0.9 0.58 90.4±1.7 0.58 84.3±1.6 0.58 92.1±1.2 0.58 87.2±0.5 0.58                                             82.5
   1000 K-Center 86.8±0.3 0.00 50.5±0.9 0.00 89.1±1.4 0.00 27.2±1.8 0.00 92.5±0.1 0.00 85.6±1.4 0.00                                                 72.0
               Random 86.9±0.8 0.00 47.4±2.7 0.00 88.6±0.1 0.00 77.9±2.4 0.00 91.4±0.3 0.00 86.9±0.5 0.00                                            79.9
                Ours    93.2±0.4 0.58 58.1±0.6 0.58 92.5±1.1 0.58 89.2±0.9 0.58 94.4±0.2 0.58 92.8±0.2 0.58                                          86.7
   3000 K-Center 93.5±1.1 0.00 56.3±0.3 0.00 92.9±0.3 0.00 60.5±8.7 0.00 94.1±0.2 0.00 92.1±0.7 0.00                                                 81.6
               Random 92.5±0.6 0.00 53.6±1.4 0.00 91.7±0.8 0.00 86.1±0.4 0.00 93.5±0.3 0.00 93.0±0.2 0.00                                            85.1
We present the results in Tables 5.2 to 5.4 case by case, where we vary the domain of the
client by columns. In each column, we highlight the best result unless the difference is not
statistically significant.
     Case 1: Selective manual labeling. We assume that the cloud will label the filtered
in-domain samples by using a third-party label service, e.g., Amazon Mechanical Turk [107],
or by asking the end clients for manual labeling. As the selected samples are non-private,
they can be freely shared with a third parity. But the high cost of manual labeling or service
                                                                         78


is the pain point, which should be carefully constrained within a finite budget of demanded
labels.
    Given a specified sampling budget, we compare the test accuracy of semi-supervised learn-
ing (FixMatch) on sampled data in Table 5.2. Since the ECOS tends to select in-distribution
samples, it ease the transfer of cloud-trained model to the client data. On the Digits dataset,
we find that our method attains more accuracy gains as budget increases, demonstrating that
more effective labels are selected. On the DomainNet dataset, our method outperforms base-
lines on 5 out of 6 domains and is stable in most domains given a small budget (1000) and
is superior on average. Given more budgets, the accuracy of all methods is improved, when
our method is outstanding on Infograph and Quickdraw domains and is comparable to the
baselines on other domains.
Table 5.3: Evaluate adaptive model compression on DomainNet by test accuracy, Acc (%)
and differential-privacy cost .
      Sampling       Clipart     Infograph      Painting     Quickdraw        Real         Sketch     Average
   Budget Method    Acc ↑  ↓    Acc ↑  ↓     Acc ↑  ↓     Acc ↑  ↓     Acc ↑  ↓     Acc ↑  ↓ Acc ↑
            Ours 82.9±0.7 0.58 48.5±2.7 0.58 85.4±1.0 0.58 81.3±2.5 0.58 91.4±0.6 0.58 82.4±1.1 0.58   78.6
    1000 K-Center 81.2±0.9 0.00 44.6±0.9 0.00 84.5±2.2 0.00 41.7±1.9 0.00 92.4±0.5 0.00 80.1±2.1 0.00  70.8
          Random 83.8±0.7 0.00 44.4±2.1 0.00 83.8±1.6 0.00 76.3±2.2 0.00 90.1±0.6 0.00 80.1±0.7 0.00   76.4
            Ours 90.6±0.6 0.58 51.4±1.9 0.58 89.6±1.2 0.58 87.6±0.3 0.58 93.6±0.7 0.58 88.4±1.5 0.58   83.5
    3000 K-Center 88.9±2.3 0.00 51.2±0.4 0.00 89.4±0.9 0.00 57.6±4.4 0.00 94.5±0.5 0.00 86.9±0.6 0.00  78.1
          Random 88.4±0.8 0.00 47.6±1.9 0.00 89.4±1.0 0.00 84.7±0.2 0.00 93.0±1.0 0.00 86.3±1.2 0.00   81.6
    Case 2: Adaptive model compression. Due to the large volume of the open-source
dataset, a larger model is desired for better capturing the various features, which however
is so inefficient to fit into the resource-constrained client devices or specialize for the data
distribution of the client. Confronting this challenge, model compression [17] is a conven-
tional idea to forge a memory-efficient model by transferring knowledge from large models to
small ones. Specifically, we first pre-train a large teacher model ft on all cloud data by the
supervised learning, assuming labels are available in advance. Still, we use an ImageNet-pre-
trained model to initialize the feature extractor φ(·) of a student model fs . Then the client
will use the downloaded feature extractor φ to filter samples. Here, we utilize knowledge
                                                     79


distillation [52] to finetune the fs with an additional classifier head upon the φ. On the
selected samples, we train a linear classifier head for 30 epochs under the supervision of true
labels and the teacher model ft , and then fine-tune the full network fs for 500 epochs. The
major challenge comes from distributional biases between the multi-source open-source data
and the client data. Leveraging ECOS, we may sample data near the client distribution and
reduce the bias in the follow-up compression process.
    We simulate the case on the large-sized image dataset, DomainNet, which is demanding
for large-scale networks, e.g., ResNet50 here, to effectively learn the complicated features.
Here, we compress ResNet50 into a smaller network, ResNet18, by using an adaptively
selected subset of the cloud dataset. We omit the experiment for digit images where a large
model may not be necessary for such a small image size. In Table 5.3, we present the test
accuracy of compressed ResNet18 using 3000 samples from DomainNet in finetuning. With
a small portion of privacy cost ( < 0.6), our method improves the accuracy on Clipart and
Real domains against the baselines. Note that the model accuracy here is lower than label
outsourcing in Table 5.2, and reason is that the supervisions from the larger models are just
an approximation of the full dataset. Without using the full dataset for compression, the
training can be completed fast and responsively on the demand of a client.
Table 5.4: Test accuracy (%) and privacy cost epsilon of client labeling on two datasets:
Digits (top) and DomainNet (bottom).
            Sampling         MNIST         SVHN          USPS      SynthDigits    MNIST-M Average
         Budget Method     Acc ↑  ↓    Acc ↑  ↓     Acc ↑  ↓ Acc ↑  ↓         Acc ↑  ↓ Acc ↑
                 Ours 84.2±2.3 5.35 47.9±3.1 5.32 86.1±1.0 5.35 68.6±1.6 5.35 58.4±1.9 5.35     69.0
          5000 K-Center 81.9±3.4 5.34 48.4±1.2 5.33 82.1±1.2 5.34 69.4±1.9 5.34 55.4±2.0 5.34   67.4
                Random 81.8±4.1 5.34 45.3±3.0 5.29 81.2±2.3 5.34 65.9±2.7 5.34 55.5±2.6 5.34    65.9
     Sampling        Clipart     Infograph      Painting     Quickdraw        Real         Sketch     Average
  Budget Method    Acc ↑  ↓     Acc ↑  ↓     Acc ↑  ↓     Acc ↑  ↓     Acc ↑  ↓     Acc ↑  ↓ Acc ↑
           Ours   33.2±5.9 4.46 23.8±2.2 3.50 47.4±3.3 4.51 39.8±7.7 4.87 62.9±1.9 4.92 51.7±1.2 4.94  43.2
   3000 K-Center 39.3±3.6 4.57 18.2±2.7 3.61 46.6±2.5 4.52 36.1±2.9 4.94 63.8±1.9 4.96 47.7±2.8 4.96   42.0
          Random 30.7±2.2 4.41 21.6±4.9 3.43 44.0±3.1 4.53 39.0±5.6 4.82 59.9±3.6 4.75 47.2±3.4 4.94   40.4
    Case 3: Automated client labeling. When the client obtained labeled samples, for
example, photos labeled by phone users, the cloud only needs to collect an unlabeled public
                                                     80


dataset. Therefore, we may automate the labeling process leveraging the client supervision
knowledge to reduce cost or users’ efforts on manual labeling. To be specific, we let the
client generate pseudo labels for the cloud data based on their neighbor relation, as previ-
ously studied by [168] (private kNN). However, the private kNN assumes that the client and
the cloud follow the same distribution, which weakens its applicability confronting the het-
erogeneity and the large scale of open-source data. Thus, we utilize ECOS as a pre-processing
of open-source data before being labeled by private kNN. Therefore, we have two rounds of
communication for transferring client knowledge: proximal-data sampling by the ECOS and
client labeling by the private kNN [168]. To compose the privacy costs from these two steps,
we utilize the analytical moment accountant technique to get a tight privacy bound [144].
Interested readers can refer to Section D.1.1 for a brief introduction to private kNN and our
implementations.
    In experiments, we use the state-of-the-art private pseudo-labeling method, private kNN [168],
to label the subsampled open-source data with the assistance from the labeled client dataset.
To reduce the sensitivity of private kNN w.r.t. the threshold, we instead enforce the number
of the selected labels to be 600 and balance the sizes by selecting the top-60 samples with the
highest confidence per class. Then, we adopt the popular semi-supervised learning method,
FixMatch [127], to train the classifier. In Table 5.4, we report the results when cloud features
are distributionally biased from the client ones but they share the same class set. For each
domain choice of client data, we will use the other domains as the cloud dataset. Distinct
from prior studies, e.g., in [168] or [109], we assume 80-90% of the cloud data are out of
the client distribution and are heterogeneously aggregated from different domains, casting
greater challenges in learning. On the Digits, we eliminate all ID data from the cloud set to
harden the task. Both on Digits and DomainNet datasets, our method consistently outper-
forms the two sampling baselines under the similar privacy costs. The variance of privacy
costs is mainly resulted from the actual sampling sizes. Though simply adopting K-Center
outperforms the random sampling, it still presents larger gaps compared to our method in
                                               81


Figure 5.3: Demonstration of the centroids qualified by private CC. We use blue dots to
represent the client data from Real domain of DomainNet. For the data of the cloud do-
mains, larger circles represent centroids with higher CC and orange crosses are rejected OoD
centroids.
multiple domains. For instance, in Quickdraw domain, given 5000 sampling budget, the
K-Center method performs poorly and is even worse than a random sampling.
5.4.2    Qualitative Study
    To better understand the proposed method, we conduct a series of qualitative studies
on client labeling. We use two DomainNet datasets as a representative benchmark in the
studies. 1) Ablation study. In Table 5.5, we conduct ablation studies to evaluate the effect
of different objectives introduced by ECOS, following the client labeling benchmark with a
3000 sample budget. Without proximity and diversity objectives, we let the baseline be the
naive random sampling. We first include the proximity objective, where we greedily select
samples from top-scored clusters until the budget is fulfilled. However, we find that the naive
proximity objective results in a quite negative effect compared to the random baseline. The
failure can be attributed to the nature of clustering that will include more similar samples,
namely lacking diversity. When diversity is encouraged and combined with the proximal
votes, we find the performance is improved significantly in multiple domains and on average.
Now with diversity, the proximity objective can further improve the sampling in Painting,
Quickdraw, and Sketch domains significantly. 2) Visualize cluster selection. In Fig. 5.3,
we demonstrate that the CC can effectively reject OoD centroids. Also, it is interesting to
                                               82


observe that when multiple centroids are distributed closely, then they will compete with
each other and reject the redundant ones consequently. 3) Efficiency. In Section 5.3, we
studied the communication efficiency theoretically. Empirically, ECOS only need to upload
100 bytes of the CC scores in all experiments, while traditional outsourcing needs to upload
the dataset, which is about 198MB for the lowest load in DomainNet (50% of Sketch domain
data for client). Even counting the download load, ECOS only need to download 45MB of
the pre-trained ResNet18 feature extractor together with 51KB data of centroid features,
which is much less than data uploading. More detailed evaluation of the efficiency is placed
at Section D.2.4.
Table 5.5: Ablation study of the proposed method on DomainNet. Test accuracy of the
client labeling case is reported.
              Proximity Diversity Clipart Infograph Painting Quickdraw  Real     Sketch Average
                  7        7      30.7±2.2  21.6±4.9 44.0±3.1 39.0±5.6 59.9±3.6 47.2±3.4 40.4
                  3        7      25.5 ±5.7 21.1±0.5 41.4±5.3 31.3±1.3 60.3±2.3 31.9±2.9 35.2
                  7        3      39.3±3.6  18.2±2.7 46.6±2.5 36.1±2.9 63.8±1.9 47.7±2.8 42.0
                  3        3      33.2±5.9  23.8±2.2 47.4±3.3 39.8±7.7 62.9±1.9 51.7±1.2 43.2
                                                      83


                                          CHAPTER 6
                                          OVERVIEW
In this chapter, we summarize our contributions to data-centric privacy-preserving learning.
    Centralized learning. When a privacy budget is provided for a certain learning task,
one has to carefully schedule the privacy usage throughout the learning process. Uniformly
scheduling the budget has been widely used in literature whereas increasing evidence suggests
that dynamical schedules could empirically outperform uniform ones. Our theoretical work
on dynamic differentially-private learning provided a principled analysis of the problem of
optimal budget allocation and connected the advantages of dynamic schedules to both the
loss structure and the learning behavior. We further validated our results through empirical
studies.
    Federated learning with heterogeneous devices. In this work, we proposed a novel
federated learning approach for in-situ and on-demand customization to address challenges
arising from resource heterogeneity and inference dynamics. We proposed a Split-Mix strat-
egy that efficiently transfer clients’ knowledge to collaboratively learn a customizable model.
Extensive experiments demonstrate the effectiveness of the principle in adjusting model
widths and robustness when much fewer parameters are used compared to baselines.
    Outsourcing training without uploading data. In this work, we explore the pos-
sibility of outsourcing model training without access to client data. To reconcile the data
shortage from the target domain, we propose to find a surrogate dataset from the source-
agnostic public dataset. We find that the heterogeneity of the open-source data greatly
compromises the performance of trained models. To tackle this practical challenge, we
propose a collaborative sampling solution, ECOS, that can efficiently and effectively filter
open-source samples and thus benefits follow-up learning tasks. We envision this work as a
milestone for private and efficient outsourcing from low-power and cost-effective end devices.
We also recognize open questions of the proposed solution for future studies. For example,
the public dataset may require additional data processing, e.g., aligning and cropping for
                                               84


improved prediction accuracy. In our empirical studies, we only consider the computer vision
tasks, though no assumption was made on the data structures. We expect the principles to
be adapted to other data types with minimal effort. More data types, including tabular and
natural-language data, will be considered in the follow-up works.
                                             85


                                     BIBLIOGRAPHY
[1]  General data protection regulation.
[2]  Facebook–cambridge analytica data scandal. Wikipedia, October 2022.
[3]  Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal
     Talwar, and Li Zhang. Deep learning with differential privacy. In CCS: Proceedings of
     the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS
     ’16, pages 308–318, New York, NY, USA, 2016. ACM.
[4]  Accountability Act. Health insurance portability and accountability act of 1996. Public
     law, 104:191, 1996.
[5]  P. S. Aisen, S. Andrieu, C. Sampaio, M. Carrillo, Z. S. Khachaturian, B. Dubois, H. H.
     Feldman, R. C. Petersen, E. Siemers, R. S. Doody, S. B. Hendrix, M. Grundman,
     L. S. Schneider, R. J. Schindler, E. Salmon, W. Z. Potter, R. G. Thomas, D. Salmon,
     M. Donohue, M. M. Bednar, J. Touchon, and B. Vellas. Report of the task force on
     designing clinical trials in early (predementia) ad. Neurology, 76(3):280–286, January
     2011.
[6]  Zeyuan Allen-Zhu, Yuanzhi Li, and Zeyuan Allen-Zhu. Towards understanding en-
     semble, knowledge distillation and self-distillation in deep learning. arXiv:2012.09816
     [cs.LG], December 2020.
[7]  Noga Alon, Raef Bassily, and Shay Moran. Limits of private learning with access to
     public data. Advances in neural information processing systems, 32, 2019.
[8]  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. January 2017.
[9]  Borja Balle and Yu-Xiang Wang. Improving the gaussian mechanism for differential
     privacy: Analytical calibration and optimal denoising. In International Conference on
     Machine Learning, pages 394–403, July 2018.
[10] Jeff Barnes. Azure machine learning. Microsoft Azure Essentials. 1st ed, Microsoft,
     2015.
[11] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Effi-
     cient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on
     Foundations of Computer Science, pages 464–473, October 2014.
[12] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private
     stochastic convex optimization with optimal rates. In Advances in Neural Information
     Processing Systems 32, pages 11282–11291. Curran Associates, Inc., 2019.
                                              86


[13] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and
     Jennifer Wortman Vaughan. A theory of learning from different domains. Machine
     Language, 79(1-2):151–175, May 2010.
[14] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support
     vector machines. In ICML, 2012.
[15] Ekaba Bisong. Google cloud machine learning engine (cloud mle). In Building machine
     learning and deep learning models on google cloud platform, pages 545–579. Springer,
     2019.
[16] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan
     McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical
     secure aggregation for privacy-preserving machine learning. In CCS: Proceedings of
     the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS
     ’17, pages 1175–1191, New York, NY, USA, 2017. ACM.
[17] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression.
     In Proceedings of the 12th ACM SIGKDD international conference on Knowledge dis-
     covery and data mining, KDD ’06, pages 535–541, New York, NY, USA, August 2006.
     Association for Computing Machinery.
[18] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications,
     extensions, and lower bounds. In Theory of Cryptography, volume 9985, pages 635–
     658. Springer Berlin Heidelberg, Berlin, Heidelberg, 2016.
[19] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný,
     H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for
     federated settings. arXiv:1812.01097 [cs, stat], December 2019.
[20] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private
     empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–
     1109, 2011.
[21] Fei Chen, Tao Xiang, Xinyu Lei, and Jianyong Chen. Highly efficient linear regression
     outsourcing to a cloud. IEEE transactions on cloud computing, 2(4):499–508, 2014.
[22] Junhong Cheng, Wenyan Liu, Xiaoling Wang, Xingjian Lu, Jing Feng, Yi Li, and Chao-
     fan Duan. Adaptive distributed differential privacy with sgd. Workshop on Privacy-
     Preserving Artificial Intelligence, page 6, 2020.
[23] William J. Cook, W. H. Cunningham, W. R. Pulleyblank, and A. Schrijver. Combi-
     natorial optimization. Oberwolfach Reports, 5(4):2875–2942, 2009.
[24] Rachel Cummings, Sara Krehbiel, Kevin A Lai, and Uthaipon Tantipongpipat. Dif-
                                              87


     ferential privacy for growing databases. In Advances in Neural Information Processing
     Systems 31, pages 8864–8873. Curran Associates, Inc., 2018.
[25] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.
     In 2005 IEEE computer society conference on computer vision and pattern recognition
     (CVPR’05), volume 1, pages 886–893. Ieee, 2005.
[26] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A
     large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision
     and Pattern Recognition, pages 248–255, June 2009.
[27] Damien Desfontaines and Balázs Pejó. Sok: Differential privacies. arXiv:1906.01337
     [cs], June 2019.
[28] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication
     efficient federated learning for heterogeneous clients. In International Conference on
     Learning Representations, 2021.
[29] Canh T. Dinh, Nguyen H. Tran, and Tuan Dung Nguyen. Personalized federated learn-
     ing with moreau envelopes. In Advances in Neural Information Processing Systems,
     June 2020.
[30] Jiahua Dong, Yang Cong, Gan Sun, Zhen Fang, and Zhengming Ding. Where and how
     to transfer: Knowledge aggregation-induced transferability perception for unsupervised
     domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
     pages 1–1, 2021.
[31] Jiahua Dong, Yang Cong, Gan Sun, Zhen Fang, and Zhengming Ding. Where and how
     to transfer: Knowledge aggregation-induced transferability perception for unsupervised
     domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
     pages 1–1, 2021.
[32] Jiahua Dong, Yang Cong, Gan Sun, Bineng Zhong, and Xiaowei Xu. What can be
     transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In
     IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages
     4022–4031, June 2020.
[33] Alexey Dosovitskiy and Josip Djolonga. You only train once: Loss-conditional training
     of deep networks. In International Conference on Learning Representations, September
     2019.
[34] Cynthia Dwork. Differential privacy. In Automata, Languages and Programming,
     Lecture Notes in Computer Science, pages 1–12. Springer Berlin Heidelberg, 2006.
[35] Cynthia Dwork. Differential privacy: A survey of results. In Theory and Applications
                                             88


     of Models of Computation, Lecture Notes in Computer Science, pages 1–19. Springer
     Berlin Heidelberg, 2008.
[36] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel.
     Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Com-
     puter Science Conference, ITCS ’12, pages 214–226, Cambridge, Massachusetts, Jan-
     uary 2012. Association for Computing Machinery.
[37] Cynthia Dwork, Alan Karr, Kobbi Nissim, and Lars Vilhuber. On privacy in the age
     of covid-19. Journal of Privacy and Confidentiality, 10(2), June 2020.
[38] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise
     to sensitivity in private data analysis. In Theory of Cryptography, Lecture Notes in
     Computer Science, pages 265–284. Springer Berlin Heidelberg, 2006.
[39] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy.
     Foundations and Trends® in Theoretical Computer Science, 9(3-4):211–407, 2013.
[40] Eric      Enge.                Mobile      vs.    desktop      usage      in     2020.
     https://www.perficient.com/insights/research-hub/mobile-vs-desktop-usage,       March
     2021.
[41] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learn-
     ing: A meta-learning approach. In Advances in Neural Information Processing Systems,
     June 2020.
[42] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimiza-
     tion: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT
     Symposium on Theory of Computing, STOC 2020, pages 439–449, New York, NY,
     USA, June 2020. Association for Computing Machinery.
[43] Hao-Zhe Feng, Zhaoyang You, Minghao Chen, Tianye Zhang, Minfeng Zhu, Fei Wu,
     Chao Wu, and Wei Chen. Kd3a: Unsupervised multi-source decentralized domain
     adaptation via knowledge distillation. arXiv:2011.09757, 2021.
[44] Edward W Forgy. Cluster analysis of multivariate data: efficiency versus interpretabil-
     ity of classifications. Biometrics. Journal of the International Biometric Society,
     21:768–769, 1965.
[45] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that
     exploit confidence information and basic countermeasures. In CCS: Proceedings of the
     22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS
     ’15, pages 1322–1333, New York, NY, USA, 2015. ACM.
[46] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backprop-
                                             89


     agation. In International Conference on Machine Learning, pages 1180–1189. PMLR,
     June 2015.
[47] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
     Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.
     arXiv:1406.2661 [cs, stat], June 2014.
[48] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fair-
     ness without demographics in repeated loss minimization. In International Conference
     on Machine Learning, pages 1929–1938. PMLR, July 2018.
[49] Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge trans-
     fer: Federated learning of large cnns at the edge. Advances in Neural Information
     Processing Systems, November 2020.
[50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:
     Surpassing human-level performance on imagenet classification. In Proceedings of the
     IEEE International Conference on Computer Vision, pages 1026–1034, 2015.
[51] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
     image recognition. Computer Vision and Pattern Recognition, 2016.
[52] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural
     network. arXiv:1503.02531 [cs, stat], March 2015.
[53] Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. Deep models under the
     gan: Information leakage from collaborative deep learning. In Proceedings of the 2017
     ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, pages
     603–618, New York, NY, USA, 2017. ACM.
[54] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko,
     Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adap-
     tation. In International Conference on Machine Learning, pages 1989–1998. PMLR,
     July 2018.
[55] Junyuan Hong, Jeffrey Kaye, Hiroko H. Dodge, and Jiayu Zhou. Detecting mci using
     real-time, ecologically valid data capture methodology: How to improve scientific rigor
     in digital biomarker analyses. Alzheimer’s & Dementia, 16(S5):e044371, 2020.
[56] Junyuan Hong, Haotao Wang, Zhangyang Wang, and Jiayu Zhou.                      Feder-
     ated robustness propagation: Sharing adversarial robustness in federated learning.
     arXiv:2106.10196 [cs, stat], June 2021.
[57] Junyuan Hong, Haotao Wang, Zhangyang Wang, and Jiayu Zhou. Learning model-
     based privacy protection under budget constraints. In AAAI, page 9, 2021.
                                             90


[58] Junyuan Hong, Zhangyang Wang, and Jiayu Zhou. Dynamic privacy budget alloca-
     tion improves data efficiency of differentially private gradient descent. In 2022 ACM
     Conference on Fairness, Accountability, and Transparency, FAccT ’22, pages 11–35,
     New York, NY, USA, June 2022. Association for Computing Machinery.
[59] Junyuan Hong, Zhuangdi Zhu, Shuyang Yu, Zhangyang Wang, Hiroko H. Dodge, and
     Jiayu Zhou. Federated adversarial debiasing for fair and transferable representations.
     In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data
     Mining, KDD ’21, pages 617–627, New York, NY, USA, August 2021. Association for
     Computing Machinery.
[60] Gao Huang, Danlu Chen, Tianhong Li, and Felix Wu. Multi-scale dense networks for
     resource efficient image classification. International Conference on Learning Represen-
     tations, page 14, 2018.
[61] Xixi Huang, Jian Guan, Bin Zhang, Shuhan Qi, Xuan Wang, and Qing Liao. Differ-
     entially private convolutional neural networks with adaptive gradient descent. In 2019
     IEEE Fourth International Conference on Data Science in Cyberspace (DSC), pages
     642–648, June 2019.
[62] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Trans-
     actions on Pattern Analysis and Machine Intelligence, 16(5):550–554, May 1994.
[63] Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and
     Luc Van Gool. Ai benchmark: Running deep neural networks on android smartphones.
     In Proceedings of the European Conference on Computer Vision (ECCV) Workshops,
     pages 0–0, 2018.
[64] Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of
     sgd information theoretically optimal. In Conference on Learning Theory, pages 1752–
     1755, June 2019.
[65] Ziyu Jiang, Tianlong Chen, Ting Chen, and Zhangyang Wang. Improving contrastive
     learning on imbalanced data via open-world sampling. In Advances in Neural Infor-
     mation Processing Systems, May 2021.
[66] Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D.
     Lawrence. Dataset Shift in Machine Learning | The MIT Press. MIT Press, 2008.
[67] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mo-
     hammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G
     Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9,
     2016.
[68] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and
                                               91


     proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learn-
     ing and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pages
     795–811, Cham, 2016. Springer International Publishing.
[69] Jeffrey A. Kaye, Shoshana A. Maxwell, Nora Mattek, Tamara L. Hayes, Hiroko
     Dodge, Misha Pavel, Holly B. Jimison, Katherine Wild, Linda Boise, and Tracy A.
     Zitzelberger. Intelligent systems for assessing aging changes: Home-based, unobtru-
     sive, and continuous assessment of aging. The Journals of Gerontology: Series B,
     66B(suppl_1):i180–i190, July 2011.
[70] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk
     minimization and high-dimensional regression. In Proceedings of the 25th Annual Con-
     ference on Learning Theory, COLT ’12, page 40, 2012.
[71] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
     the 3rd International Conference for Learning Representations, San Diego, CA, 2015.
[72] Jakub Konečný, Brendan McMahan, and Daniel Ramage. Federated optimiza-
     tion:distributed optimization beyond the datacenter. arXiv:1511.03575 [cs, math],
     November 2015.
[73] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical
     report, 2009.
[74] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
     document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
[75] Jaewoo Lee and Daniel Kifer. Concentrated differentially private gradient descent
     with adaptive per-iteration privacy budget. In Proceedings of the 24th ACM SIGKDD
     International Conference on Knowledge Discovery & Data Mining, KDD ’18, pages
     1656–1665, New York, NY, USA, 2018. ACM.
[76] Xinyu Lei, Xiaofeng Liao, Tingwen Huang, Huaqing Li, and Chunqiang Hu. Outsourc-
     ing large matrix inversion computation to a public cloud. IEEE Transactions on cloud
     computing, 1(1):1–1, 2013.
[77] Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model
     distillation. arXiv:1910.03581 [cs, stat], October 2019.
[78] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and
     Virginia Smith. Federated optimization in heterogeneous networks. In Conference on
     Systems and Machine Learning Foundation (MLSys), April 2020.
[79] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair resource alloca-
     tion in federated learning. In International Conference on Learning Representations,
                                             92


     September 2019.
[80] Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Fed-
     erated learning on non-iid features via local batch normalization. In International
     Conference on Learning Representations, September 2020.
[81] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data?
     source hypothesis transfer for unsupervised domain adaptation. ICML, October 2020.
[82] Jian Liang, Dapeng Hu, Yunbo Wang, Ran He, and Jiashi Feng. Source data-absent
     unsupervised domain adaptation through hypothesis transfer and labeling transfer.
     ArXiv, 2020.
[83] Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh
     Nallapati, Julio Delgado, Amir Sadoughi, Yury Astashonok, Piali Das, et al. Elastic
     machine learning algorithms in amazon sagemaker. In Proceedings of the 2020 ACM
     SIGMOD international conference on management of data, pages 731–737, 2020.
[84] Ming Lin, Pinghua Gong, Tao Yang, Jieping Ye, Roger L. Albin, and Hiroko H. Dodge.
     Big data analytical approaches to the nacc dataset: Aiding preclinical trial enrichment.
     Alzheimer Disease & Associated Disorders, 32(1):18–27, 2018.
[85] Tao Lin, Lingjing Kong, Sebastian U. Stich, and Martin Jaggi. Ensemble distillation
     for robust model fusion in federated learning. In Advances in Neural Information
     Processing Systems, June 2020.
[86] Aishan Liu, Shiyu Tang, Xianglong Liu, Xinyun Chen, Lei Huang, Zhuozhuo Tu,
     Dawn Song, and Dacheng Tao. Towards defending multiple adversarial perturbations
     via gated batch normalization. arXiv:2012.01654 [cs], December 2020.
[87] Lanlan Liu and Jia Deng. Dynamic deep neural networks: Optimizing accuracy-
     efficiency trade-offs by selective execution. arXiv:1701.00299 [cs, stat], March 2018.
[88] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Adversarial multi-task learning for
     text classification. In Proceedings of the 55th Annual Meeting of the Association for
     Computational Linguistics (Volume 1: Long Papers), pages 1–10, Vancouver, Canada,
     July 2017. Association for Computational Linguistics.
[89] Terrance Liu, Giuseppe Vietri, Thomas Steinke, Jonathan Ullman, and Steven Wu.
     Leveraging public data for practical private query release. In International conference
     on machine learning, pages 6968–6977. PMLR, 2021.
[90] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes
     in the wild. In Proceedings of international conference on computer vision (ICCV),
     December 2015.
                                              93


[91] Mohammad Reza Loghmani, Markus Vincze, and Tatiana Tommasi. Positive-
      unlabeled learning for open set domain adaptation. Pattern Recognition Letters, 136,
      June 2020.
[92] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Conditional
      adversarial domain adaptation. arXiv:1705.10667 [cs], December 2018.
[93] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adver-
      sarially fair and transferable representations. In Proceedings of the 35th International
      Conference on Machine Learning, pages 3384–3393. PMLR, July 2018.
[94] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
      Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In-
      ternational Conference on Learning Representations, 2018.
[95] Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-
      tasking of multiple tasks. In Proceedings of the IEEE/CVF Conference on Computer
      Vision and Pattern Recognition, pages 1851–1860, 2019.
[96] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera
      y Arcas. Communication-efficient learning of deep networks from decentralized data.
      In Artificial Intelligence and Statistics, pages 1273–1282, April 2017.
[97] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera
      y Arcas. Communication-efficient learning of deep networks from decentralized data.
      In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
[98] H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differ-
      entially private recurrent language models. In International Conference on Learning
      Representations, February 2018.
[99] Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foun-
      dations Symposium (CSF), pages 263–275, Santa Barbara, CA, USA, August 2017.
      IEEE.
[100] P. Mohassel and Y. Zhang. Secureml: A system for scalable privacy-preserving machine
      learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages 19–38, May
      2017.
[101] Yurii Nesterov and B.T. Polyak. Cubic regularization of newton method and its global
      performance. Mathematical Programming, 108(1):177–205, August 2006.
[102] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y.
      Ng. Reading digits in natural images with unsupervised feature learning. In NIPS
      Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
                                                 94


[103] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug
      & play generative networks: Conditional iterative generation of images in latent space.
      In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
      pages 4467–4477, 2017.
[104] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Syn-
      thesizing the preferred inputs for neurons in neural networks via deep generator net-
      works. In Advances in Neural Information Processing Systems 29, pages 3387–3395.
      Curran Associates, Inc., 2016.
[105] Seyed Ali Osia, Ali Taheri, Ali Shahin Shamsabadi, Kleomenis Katevas, Hamed Had-
      dadi, and Hamid R. Rabiee. Deep private-feature extraction. IEEE Transactions on
      Knowledge and Data Engineering, 32(1):54–66, January 2020.
[106] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions
      on Knowledge and Data Engineering, 22(10):1345–1359, October 2010.
[107] Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. Running experiments
      on amazon mechanical turk. Judgment and Decision making, 5(5):411–419, 2010.
[108] Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, and Kunal Talwar.
      Semi-supervised knowledge transfer for deep learning from private training data. In
      International conference on learning representations, 2016.
[109] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar,
      and Úlfar Erlingsson. Scalable private learning with pate. International Conference
      on Learning Representations, February 2018.
[110] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Mo-
      ment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF
      International Conference on Computer Vision, pages 1406–1415, 2019.
[111] Xingchao Peng, Zijun Huang, Yizhe Zhu, and Kate Saenko. Federated adversarial do-
      main adaptation. In International Conference on Learning Representations, September
      2019.
[112] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate
      Saenko. Visda: The visual domain adaptation challenge. arXiv:1710.06924 [cs],
      November 2017.
[113] Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X. Yu, Sashank J. Reddi,
      and Sanjiv Kumar. Adaclip: Adaptive clipping for private sgd. arXiv:1908.07643 [cs,
      stat], October 2019.
[114] B. T. Polyak. Gradient methods for the minimisation of functionals. USSR Computa-
                                             95


      tional Mathematics and Mathematical Physics, 3(4):864–878, January 1963.
[115] B. T. Polyak. Some methods of speeding up the convergence of iteration meth-
      ods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, January
      1964.
[116] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural
      Networks, 12(1):145–151, January 1999.
[117] Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D
      Lawrence. Dataset shift in machine learning. Mit Press, 2008.
[118] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola.
      Stochastic variance reduction for nonconvex optimization. In International Confer-
      ence on Machine Learning, pages 314–323, June 2016.
[119] Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth
      Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contribu-
      tions to the Theory of Statistics. The Regents of the University of California, 1961.
[120] Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft. Learning
      in a large function space: Privacy-preserving mechanisms for svm learning. Journal of
      Privacy and Confidentiality, 2012.
[121] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category
      models to new domains. In Proceedings of the 11th European conference on Com-
      puter vision: Part IV, ECCV’10, pages 213–226, Berlin, Heidelberg, September 2010.
      Springer-Verlag.
[122] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A
      core-set approach. In International conference on learning representations, 2018.
[123] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From
      Theory to Algorithms. Cambridge University Press, May 2014.
[124] Shai Shalev-Shwartz, Nathan Srebro, and Karthik Sridharan. Stochastic convex opti-
      mization. In Proceedings of the 22nd Annual Conference on Learning Theory, COLT
      ’09, page 11, 2009.
[125] Naichen Shi, Fan Lai, Raed Al Kontar, and Mosharaf Chowdhury.                      Fed-
      ensemble: Improving generalization through model ensembling in federated learning.
      arXiv:2107.10663 [cs, stat], July 2021.
[126] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks
      against machine learning models. In 2017 IEEE Symposium on Security and Privacy
                                              96


      (SP), pages 3–18, May 2017.
[127] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D.
      Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-
      supervised learning with consistency and confidence. Advances in Neural Information
      Processing Systems, November 2020.
[128] Kunal Talwar, Abhradeep Guha Thakurta, and Li Zhang. Nearly optimal private lasso.
      In Advances in Neural Information Processing Systems 28, pages 3025–3033. Curran
      Associates, Inc., 2015.
[129] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional
      neural networks. In Proceedings of the 36th International Conference on Machine
      Learning, pages 6105–6114. PMLR, May 2019.
[130] Om Thakkar, Galen Andrew, and H. Brendan McMahan. Differentially private learning
      with adaptive clipping. arXiv:1905.03871 [cs, stat], May 2019.
[131] Reihaneh Torkzadehmahani, Peter Kairouz, and Benedict Paten. Dp-cgan: Differen-
      tially private synthetic data and label generation. In Proceedings of the IEEE/CVF
      Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[132] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Alek-
      sander Madry. Robustness may be at odds with accuracy. International Conference
      on Learning Representations, September 2019.
[133] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discrimina-
      tive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision
      and Pattern Recognition, pages 7167–7176, 2017.
[134] B. Vellas, R. Bateman, K. Blennow, G. Frisoni, K. Johnson, R. Katz, J. Langbaum,
      D. Marson, R. Sperling, A. Wessels, S. Salloway, R. Doody, and P. Aisen. Endpoints
      for pre-dementia ad trials: A report from the eu/us/ctad task force. The journal of
      prevention of Alzheimer’s disease, 2(2):128–135, June 2015.
[135] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Pan-
      chanathan. Deep hashing network for unsupervised domain adaptation. In Proceed-
      ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
      5018–5027, 2017.
[136] Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. Split
      learning for health: Distributed deep learning without sharing raw patient data. ICLR
      2019 Workshop on AI for social good, December 2018.
[137] Christina Wadsworth, Francesca Vera, and Chris Piech. Achieving fairness through
                                              97


      adversarial learning: an application to recidivism prediction. arXiv:1807.00199 [cs,
      stat], June 2018.
[138] Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk min-
      imization with non-convex loss functions. In International Conference on Machine
      Learning, pages 6526–6535, May 2019.
[139] Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization
      revisited: Faster and more general. In Advances in Neural Information Processing
      Systems 30, pages 2722–2731. Curran Associates, Inc., 2017.
[140] Haotao Wang, Tianlong Chen, Shupeng Gui, Ting-Kuei Hu, Ji Liu, and Zhangyang
      Wang. Once-for-all adversarial training: In-situ tradeoff between robustness and ac-
      curacy for free. Advances in Neural Information Processing Systems, November 2020.
[141] Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, and
      Zhangyang Wang. Augmax: Adversarial composition of random augmentations for
      robust training. arXiv:2110.13771 [cs], October 2021.
[142] Xiaofei Wang, Yiwen Han, Chenyang Wang, Qiyang Zhao, Xu Chen, and Min Chen.
      In-edge ai: Intelligentizing mobile edge computing, caching and communication by
      federated learning. IEEE Network, 33(5):156–165, 2019.
[143] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. Skipnet:
      Learning dynamic routing in convolutional networks. ECCV, July 2018.
[144] Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled renyi
      differential privacy and analytical moments accountant. In The 22nd International
      Conference on Artificial Intelligence and Statistics, pages 1226–1235, April 2019.
[145] Michael W. Weiner, Dallas P. Veitch, Paul S. Aisen, Laurel A. Beckett, Nigel J. Cairns,
      Robert C. Green, Danielle Harvey, Clifford R. Jack, William Jagust, Enchi Liu, John C.
      Morris, Ronald C. Petersen, Andrew J. Saykin, Mark E. Schmidt, Leslie Shaw, Li Shen,
      Judith A. Siuciak, Holly Soares, Arthur W. Toga, and John Q. Trojanowski. The
      alzheimer’s disease neuroimaging initiative: A review of papers published since its
      inception. Alzheimer’s & Dementia, 9(5):e111–e194, September 2013.
[146] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning.
      Journal of Big data, 3(1):1–40, 2016.
[147] Martin A Weiss and Kristin Archick. US-EU data privacy: from safe harbor to privacy
      shield, 2016.
[148] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey
      Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based
                                             98


      analytics. In Proceedings of the 2017 ACM International Conference on Management
      of Data, SIGMOD ’17, pages 1307–1322, New York, NY, USA, 2017. ACM.
[149] Zhenyu Wu, Haotao Wang, Zhaowen Wang, Hailin Jin, and Zhangyang Wang. Privacy-
      preserving deep action recognition: An adversarial learning framework and a new
      dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[150] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis,
      Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual
      networks. arXiv:1711.08393 [cs], January 2019.
[151] Cihang Xie and Alan Yuille. Intriguing properties of adversarial training at scale.
      International Conference on Learning Representations, December 2019.
[152] Yun Xie, Peng Li, Chao Wu, and Qiuling Wu. Differential privacy stochastic gradient
      descent with adaptive privacy budget allocation. In 2021 IEEE International Confer-
      ence on Consumer Electronics and Computer Engineering (ICCECE), pages 227–231,
      January 2021.
[153] Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe
      Liu. A first look at deep learning apps on smartphones. arXiv:1812.05448 [cs], January
      2021.
[154] Yixing Xu, Yunhe Wang, Hanting Chen, Kai Han, Chunjing XU, Dacheng Tao, and
      Chang Xu. Positive-unlabeled compression on the cloud. In Advances in Neural In-
      formation Processing Systems, volume 32. Curran Associates, Inc., 2019.
[155] Zhiying Xu, Shuyu Shi, Alex X. Liu, Jun Zhao, and Lin Chen. An adaptive and fast
      convergent approach to differentially private deep learning. the Proceedings of IEEE
      International Conference on Computer Communications, 2020.
[156] Jiangchao Yao, Shengyu Zhang, Yang Yao, Feng Wang, Jianxin Ma, Jianwei Zhang,
      Yunfei Chu, Luo Ji, Kunyang Jia, Tao Shen, Anpeng Wu, Fengda Zhang, Ziqi Tan, Kun
      Kuang, Chao Wu, Fei Wu, Jingren Zhou, and Hongxia Yang. Edge-cloud polarization
      and collaboration: A comprehensive survey. arXiv:2111.06061 [cs], November 2021.
[157] Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. Gradient pertur-
      bation is underrated for differentially private convex optimization. In Proceedings
      of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pages
      3117–3123, Yokohama, Japan, July 2020. International Joint Conferences on Artificial
      Intelligence Organization.
[158] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable
      neural networks. arXiv:1812.08928 [cs], December 2018.
                                               99


[159] Lei Yu, Ling Liu, Calton Pu, Mehmet Emre Gursoy, and Stacey Truex. Differentially
      private model publishing for deep learning. proceedings of 40th IEEE Symposium on
      Security and Privacy, April 2019.
[160] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv:1605.07146
      [cs], June 2017.
[161] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases
      with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI,
      Ethics, and Society, AIES ’18, pages 335–340, New York, NY, USA, December 2018.
      Association for Computing Machinery.
[162] Jun Zhang, Xiaokui Xiao, Yin Yang, Zhenjie Zhang, and Marianne Winslett. Privgene:
      Differentially private model fitting using genetic algorithms. In Proceedings of the 2013
      ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pages
      665–676, New York, NY, USA, 2013. ACM.
[163] Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. Self-distillation: Towards efficient
      and compact neural networks. IEEE Transactions on Pattern Analysis and Machine
      Intelligence, pages 1–1, 2021.
[164] Xinyue Zhang, Jiahao Ding, Maoqiang Wu, Stephen T. C. Wong, Hien Van Nguyen,
      and Miao Pan. Adaptive privacy preserving deep learning algorithms for medical data.
      In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
      Vision, pages 1169–1178, 2021.
[165] Yingxue Zhou, Xiangyi Chen, Mingyi Hong, Zhiwei Steven Wu, and Arindam Banerjee.
      Private stochastic non-convex optimization: Adaptive algorithms and tighter general-
      ization bounds. arXiv:2006.13501 [cs, stat], August 2020.
[166] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In Advances
      in Neural Information Processing Systems 32, pages 14774–14784. Curran Associates,
      Inc., 2019.
[167] Yuqing Zhu and Yu-Xiang Wang. Poission subsampled rényi differential privacy. In
      International Conference on Machine Learning, pages 7634–7642. PMLR, May 2019.
[168] Yuqing Zhu, Xiang Yu, Manmohan Chandraker, and Yu-Xiang Wang. Private-knn:
      Practical differential privacy for computer vision. In Proceedings of the IEEE/CVF
      Conference on Computer Vision and Pattern Recognition, pages 11854–11862, 2020.
[169] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for
      heterogeneous federated learning. In Proceedings of the 38th International Conference
      on Machine Learning, June 2021.
                                               100


                                       APPENDIX A
                   DYNAMIC PRIVACY BUDGET ALLOCATION
A.1      Social Impact
The wide usage of personal data in training machine learning has led to huge successes in
many application domains but is also accompanied by rising concerns on privacy protec-
tion due to the sensitive information in the data. The development of privacy-preserving
algorithms has become one of critical research areas of machine learning, in which the key
challenge is to train high performance models under the constraint of a given privacy budget,
or how much sensitive information can be accessed during the training phase. Differential
privacy provided a principled framework to quantify the privacy budget, under which re-
searchers proposed various schemes to schedule the budget usage during a learning process,
yet there is a lack of systematical studies on when and why some schedules are better than
other. Our efforts in this paper are among the first to study and compare the effectiveness of
these schedules from a rigorous optimization perspective. Our theoretical results can benefit
any privacy-preserving machine learning practitioners to efficiently and effectively choose
proper privacy schedules tailored to their learning tasks.
A.2      Comparison of algorithms
Here we elaborate the meaning of algorithm names in Table 2.1. Asymptotic upper bounds
are achieved when sample size N approaches infinity. Both R and R,δ with R,δ < R are
the privacy budgets of corresponding algorithms. Specifically, R,δ = 2 / ln(1/δ) < R when
the private algorithm is (, δ)-DP with  ≤ 2 ln(1/δ).
    PGD+Adv. Adv denotes the Advanced Composition method [11]. The method assumes
that loss function is 1-strongly convex which implies the PL condition and optimized variable
is in a convex set of diameter 1 w.r.t. l2 norm.
    PGD+MA and the adjusted-utility version. MA denotes the Moment Accountant [3]
which improve the composed privacy bound versus the Advanced Composition. The im-
                                              101


provement on privacy bound lead to a enhanced utility bound, as a result.
    PGD+Adv+BBImp. The dynamic method assumes that the loss is 1-strongly convex
and data comes in stream with n ≤ N samples at each round. Their utility upper bound is
achieved at some probability p with any positive c.
    Adam+MA. The authors prove a convergence bound for the gradient norms which is
extended to loss bound by using PL condition. They also presents the results for AdaGrad
and GD which are basically of the same upper bound. Out theorems improve their bound
by using the recursive derivation based on the PL condition, while their bound is a simple
application of the condition on the gradient norm bound.
    GD, Non-Private. This method does not inject noise into gradients but limit the
number of iterations. With the bound, we can see that our utility bound are optimal with
dynamic schedule.
    GD+zCDP. We discussed the static and dynamic schedule for the gradient descent
method where the dynamic noise influence is the key to tighten the bound.
    Momentum+zCDP. Different from the GD+zCDP, momentum methods will have two
phase of utility upper bound. When T is small than some positive constant T̂ , the bound
is as tight as the non-private one. Afterwards, the momentum has a bound degraded as the
GD bound.
A.3      Preliminaries
A.3.1     Privacy
Lemma 2 (Composition & Post-processing). Let two mechanisms be M : Dn → Y and
M 0 : Dn × Y → Z. Suppose M satisfies (ρ1 , a)-zCDP and M 0 (·, y) satisfies (ρ2 , a)-zCDP
for ∀y ∈ Y. Then, mechanism M 00 : Dn → Z (defined by M 00 (x) = M 0 (x, M (x))) satisfies
(ρ1 + ρ2 )-zCDP.
                                             102


Definition 7 (Sensitivity). The sensitivity of a gradient query ∇t to the dataset {xi }N
                                                                                       i=1 is
                                       1 XN            (j) 1 XN      (j)
                     ∆2 (∇t ) = max                   ∇t −          ∇t
                                 n    N     j=1,j6 =n      N    j=1
                                                                         2
                                1         (n)
                              =    max ∇t                                               (A.1)
                                N   n          2
         (n)
where ∇t     denotes the gradient of the n-th sample.
Lemma 3 (Gaussian mechanism [18]). Let f : Dn → Z have sensitivity ∆. Define a
randomized algorithm M : Dn → Z by M (x) ← f (x) + N (0, ∆2 σ 2 I). Then M satisfies
 1
2σ 2
     -zCDP.
Lemma 4 ([18]). If M is a mechanism satisfying ρ-zCDP, then M is (ρ+2
                                                                           p
                                                                            ρ ln(1/δ), δ)-DP
for any δ > 0.
     By solving ρ + 2  ρ ln(1/δ) = , we can get ρ =  + 2 ln(1/δ) + 2 ln(1/δ)( + ln(1/δ).
                      p                                                p
A.4      Proofs
Detailed proofs please refer to [58].
                                              103


                                            APPENDIX B
         FEDERATED ADVERSARIAL DEBIASING FOR FAIR AND
                       TRANSFERABLE REPRESENTATIONS
B.1     Proofs
Proof of Theorem 9. Without loss of generality, we can write σg ξt + Gσt νt as σ̃t ζt where
     q
σ̃t , σg2 + (Gσt )2 and ζt is a random vector with Eζt = 0 and E kζt k2 ≤ D. Therefore, we
replace νt by ζt and σt2 by σ̃t2 /G2 = σg2 /G2 + σt2 . Now, we only need to update U3 (σ, T ) as
                                                           2 Xt
                             1 XT           T −t (1 − β)
                     Ũ3 =                γ                            β 2(t−i) σ̃i2
                             G2       t=1       (1 − β t )2       i=1
                            XT              (1 − β)2 Xt                      1
                         =           γ T −t         t  2
                                                                  β 2(t−i) ( 2 σg2 + σt2 )
                                 t=1        (1 − β )          i=1           G
                         = U3g + U3
where we define
                                                                   2 Xt
                                  1 2 XT           T −t (1 − β)
                         U3g ,       σ           γ                             β 2(t−i) .
                                 G2 g        t=1         (1 − β t )2      i=1
We can upper bound U3g by
                                                                       2          2t
                                      1 2 XT            T −t (1 − β) 1 − β
                             U3g =       σ            γ
                                     G2 g        t=1         (1 − β t )2 1 − β 2
                                                                             t
                                      1 2 XT            T −t 1 − β 1 + β
                                  =      σ            γ
                                     G2 g        t=1         1 − βt 1 + β
                                      1 2 XT
                                  ≤      σ            γ T −t
                                     G2 g        t=1
                                      1 2 1
                                  ≤      σ
                                     G2 g 1 − γ
                                      1
                                  =      κσ 2 .
                                     G2 g
Combine with the factors of U3 in the PGD bounds:
                 αR0 2 αR0 2                             Dσg2                             Dσg2
       αR0 U3g ≤      κσ   =        κσ    =                                ≤                              .
                  G2 g         G2 g           2µn2 (f (θ1 ) − f (θ∗ ))         2µN 2 R(f (θ1 ) − f (θ∗ ))
                                                      104


Proof of Theorem 13. Substitute Dα∗ 1 ,α2 (z) into Eq. (3.4):
                                                          α1 p1 (z)
                           D̃p1 ,p2 = Ep1 [α1 log                          ]
                                                  α1 p1 (z) + α2 p2 (z)
                                                             α2 p2 (z)
                                      + Ep2 [α2 log                          ]
                                                     α1 p1 (z) + α2 p2 (z)
                                                                   
                                                  α 1 p1 + α 2 p2
                                    = α1 KL p1
                                                      α1 + α2
                                                                      
                                                      α 1 p1 + α 2 p2
                                      + α2 KL p2
                                                         α1 + α2
                                      + α1 log α1 + α2 log α2
                                      + (α1 + α2 ) log(α1 + α2 )
                                    ≥ α1 log α1 + α2 log α2
                                      + (α1 + α2 ) log(α1 + α2 )
where the last inequality is from the non-negative property of KL divergence.
    Note when p1 = p2 , both KL divergence is 0. Thus, we can conclude that p1 = p2 is the
sufficient condition.
Proof of Theorem 14. For the ease of derivation, we assume α1 and α2 are normalized s.t.
α1 + α2 = 1. From | log p1 (x) − log p2 (x)| ≤ , we can get
                                     e− ≤ p1 (x)/p2 (x) ≤ e ,
                                     e− ≤ p2 (x)/p1 (x) ≤ e .
Thus,
                                                  Z                            
                                                                        p1
                      KL [p1 |α1 p1 + α2 p2 ] = p1 log
                                                                 α p + α 2 p2
                                                  Zx           1 1            
                                                                        1
                                                ≤ p1 log
                                                    x            α1 + α2 e−
                                                =  − log(α1 e + α2 ).
                                                 105


Similarly,
                                                   Z                         
                                                                      p2
                        KL [p2 |α1 p1 + α2 p2 ] = p2 log
                                                                 α p + α 2 p2
                                                   Zx           1 1         
                                                                      1
                                                 ≤ p2 log
                                                     x           α1 e− + α2
                                                 =  − log(α1 + α2 e ).
Therefore,
                                D̃p1 ,p2 = α1 [ − log(α1 e + α2 )]
                                           + α2 [ − log(α1 + α2 e )]
                                           + α1 log α1 + α2 log α2
                                         =  − α1 log(e + α2 /α1 )
                                           − α2 log(e + α1 /α2 )
                                         ≤ O((1 − α2 ))
                                         = O(α1 /(α1 + α2 ))
where we manually add (α1 + α2 ) to normalize α1 .
B.2      Experiment Details
B.2.1     Dynamic schedules
    We use dynamic schedules for learning rates and the adversarial parameter λ following
previous work [46]. Specifically,
                                                  1
                                  ηt =              t
                                         (1 + 10 Tmax  K
                                                         )0.75
                                                    2
                                  λt =                            −1
                                         1 + exp(−10t/Tmax )
where K is the number of local iterations and Tmax is the number of global rounds. Notably,
ηt is schedule locally and λt is scheduled globally.
                                                  106


                  classifier             discriminator          discriminator
                                          (digit data)          (object data)
                     WN
                      FC                       FC 1                  FC 1
                                              ReLU                 dropout
                   encoder                   FC 20                   ReLU
                 bottleneck                   ReLU                  FC 256
                   dropout                   FC 50
                    FC 256
                                               GRL
                  backbone
Figure B.1: Network architectures for digit and object datasets. WN denotes the weight-
norm layer [81] and FC 256 denotes fully-connected layer with 256 units. GRL is the gradient
reversal layer [46].
B.2.2    Network architectures
    Federated UDA. The network architectures are presented in Fig. B.1.
    Batch normalization in FADE. During training, we share the parameters of ResNet
between users. Notably, in ResNet, batch normalization (BN) layer is densely embedded in
different depth. The BN layer is known to be important for transferring between distinct
domains, because the hidden representations will be normalized with mean and variance
estimated from a batch. Because such estimation could be easily biased by a small batch,
running estimation by accumulating results from previous batches is a common practice.
Thus, it is also important for all users to get the global estimate of the mean and variance
by communication. However, sharing such a running estimate of representation mean and
standard variance may leak the private information [103, 104]. For example, given a feature
vector at a specific layer, the input image can be reverted using a conditional generative
network [103, 104]. Instead of sharing the mean and variance (BN states), we keep the
values the same as values pre-trained on ImageNet.
    Fair federated learning. We depict the network architectures for Adult and MCI
                                              107


datasets in Fig. B.2. For the Adult dataset, we aim to evaluate the performance of deep
networks. Thus, we use a deeper network other than a shallow one for central algorithms
[93]. Because of the small size of the MCI dataset, we adopt a small network architecture
where only two layers of LSTM are used for feature extraction and one layer for classification
or group identifying.
          classifier        discriminator
             FC 2                                     classifier         discriminator
                                 FC 1
              FC
                                  FC                     WN                    FC 1
          dropout
                               dropout                   FC 2
            ReLU
                                ReLU                                           GRL
            FC 32                                     encoder
                                FC 32
                                                       FC 100
          encoder                GRL                  dropout
         bottleneck
                                                     LeakyReLU
          dropout
           FC 100                                       FC 64
                                                      dropout
          dropout                                    LeakyReLU
            ReLU
                                                       FC 32
           FC 100
                                                      Average
Figure B.2: Network architectures for Adult and MCI datasets. LSTM 100 indicates a Long
Short-Term Memory (LSTM) cell with 100 hidden units.
B.2.3    Details of MCI datasets
    Dataset Due to the mild symptoms and expansive cost of clinic diagnosis, early detection
of MCI is a hard task. To address the challenge, MCI detection models is built on a MCI
dataset, which is collected with Intelligent Systems for Assessing Aging Change (ISAAC),
a longitudinal cohort study [69, 55]. A total of 152 participants were enrolled beginning in
2017. 12 variables are extracted from the participants’ sensor data and clinical diagnoses
was done once a year. Meanwhile, four kinds of demographic information are also recorded,
including age, gender, education, and ethnicity, which are potentially unfair features for each
patient.
                                             108


    Though prior work has shown the effectiveness of machine learning methods in diagnosis
prediction [84, 55], the possibility of training such a model fairly in a distributed framework
remains unknown. We assume the sensor data can be immediately trained locally and only
the trained models are sent to the server. The distributed framework brings in several new
challenges. First, users’ data are kept locally and many users only have one-class data which
makes the local model less discriminative. For example, 13 users are always diagnosed as
MCI during his/her recording. Second, it is difficult to do adversarial learning like Fig. 3.1b.
Because the users’ group information, e.g., gender, can not be revealed to others, the server
has no idea who will be the adversarial group. Therefore, we utilize the FADE framework to
tackle these issues as illustrated in Fig. 3.1c. As far as privacy is concerned, in the ISAAC
protocol, the sensor data were collected periodically by engineers such that the user data
are kept away from others. But we argue that our extension to federated setting is practical
because the data are not directly shared.
    Preprocessing. Since the records of some patients are missing due to occasionally
off-line of sensor systems, and these incomplete samples can introduce uncertainty in our
experiments, we choose to remove some samples according to a certain missing value. To
generate samples, hundreds of days of records for each patient will then be sliced by a moving
window, and each slice is used as a sample for training or to be predicted. The slicing is
done inside each person’s sequence without overlap. The time window is moved in a step
of 7 days. Only a subsequence of a small enough ratio of missing values will be maintained
for the current study. The number of sequences for each patient is related to the amount of
data the patient has. For some of the patients, they have only a small number of records.
We also remove the samples of those patients to avoid inaccurate prediction.
    We have 12 varaibles in total, including gender (Rsex), years of education (Ryrschool),
race/ethnicity (Rethnic), age at each date (ageyrs), total computer use (compuse), com-
puter sessions (numcsess), track sensor line (linenum), walks (numwalks), mean walking
speed (meanws), upper quartile of walking speed (wsq3), coefficient of var of walking speed
                                                109


(wscv) and std deviation of walking speed (wsstddev). We preprocess special variables in the
following specified methods. For linenum which is a sensor metric identity value, its integer
values are transformed into a one-hot encoding form that uses the position of a single one to
indicate the ID value. RSex and Rethic variables are encoded in the same way. The ages are
transformed by 3-bin discretization. All continuous variables are normalized within [−1, 1]
by min-max scaling such that no significant variance will occur between different variables
and their coefficients could be trained in a numerically robust way.
    All the data features are collected in a relatively redundant way, for which they should
be carefully selected for better prediction performance. We select features using mutual
information, which measures the dependency between the variables. It is equal to zero
if and only if two random variables are independent, and the higher value means higher
dependency. A special case is the linenum variable which only makes sense when other
walking speed features are used. As a result, when a walking feature is selected according
to the above metrics, the linenum variable is automatically included.
                                              110


                                             APPENDIX C
 EFFICIENT FEDERATED LEARNING FOR ON-DEMAND AND IN-SITU
                                           CUSTOMIZATION
C.1       Experiments
In this section, we provide more details about our experiments and additional evaluation
results.
C.1.1      Parallel implementation of Split-Mix and convergence
    In this section, we elaborate on the implementation and efficiency of Split-Mix. In
Fig. C.1, we conceptually compare the training by three methods, when two clients capable
of training ×1 and ×0.5 are considered. FedAvg can train the ×1 net on the ×1-capable
client but not on the ×0.5-capable client. Through parameter sharing among different model
widths, SHeteroFL can train multiple widths in a sequential manner and can fit into ×0.5-
capable clients. Unlike SHeteroFL, Split-Mix trains four base models in one parallel pass and
therefore is the most efficient and flexible method. Remarkably, in the worst case, training
×1 net of Split-Mix is as efficient as FedAvg and more efficient than SHeteroFL, if all the
base weights, w0,rl             l
                     , . . . , w3,r ∈ Rr×r (where r = 0.25 here), are embedded into a 4r ×4r weight
matrix1 . When base models are embedded into a full net, non-trainable parameters can be
masked out (grey areas) to avoid interference between base models.
    With the aforementioned implementation, we compare the convergence versus the wall-
clock time in Fig. C.2. We implement all algorithms in PyTorch 1.4.1 run on a single
NVIDIA RTX A5000 GPU and a 104-thread CPU. Fig. C.2 shows the elapsed computation
time from the initialization to a maximal number of iterations. The maximal number of
iterations is set to be the same for all methods, such that it is easier to compare the stopping
time. In Fig. C.2, Split-Mix is much more efficient than the SheteroFL. Note that FedAvg can
only train the ×0.125 net which includes much fewer parameters, For this reason, Split-Mix
    1
      For the simplicity of notations, we assume the width of layer l is r, though the width could vary by
layer in general.
                                                    111


                                                                 (a) FedAvg
                                                                (b) SHeteroFL
                                                                (c) Split-Mix
Figure C.1: Illustration of training weight matrices on a ×1-net-capable or ×0.5-net-capable
client. (1) Download the global weight matrix W l of layer l or a selected subset Wkl . (2)
Train weights on a batch data (x, y). (3) Upload trained weight matrix Ŵkl .
is slightly slower than FedAvg, but the degradation of efficiency trades in better accuracy
than FedAvg.
                                                                          Digits
                                                      0.9
                                validation accuracy
                                                      0.8
                                                      0.7
                                                                                          mode
                                                      0.6                                 FedAvg
                                                                                          SHeteroFL
                                                                                          SplitMix
                                                      0.5
                                                            0     10       20        30         40
                                                                       wall time (min)
Figure C.2: Validation accuracy of the budget-compatibly-widest nets by wall-clock time.
All algorithms are run with the same number of iterations (200).
                                                                       112


C.1.2     Experimental configurations
     In Algorithm C.1, we elaborate the local training of DBN. Specifically, each BN is trained
independent with different input samples. The maximization problem in Algorithm C.1
can be solved by an n-step projected gradient descent, and is commonly known as PGD
attack [94].
Algorithm C.1: LocalTrain(Wk , Dk , E, η) with DBN and adversarial training
 1: Initialize models Ŵk by Wk
 2: for e ∈ {1, · · · , E} do
 3:     for mini-batch B = {(x, y)} in Dk do
 4:         for ŵi,r ∈ Ŵk in parallel do
 5:            Set f to P        use clean BN
                            1
 6:            L ← |B|             (x,y)∈B LCE (f (x; ŵi,r ), y)
 7:            Set f to use noise BN
 8:            B̃ = ∅
 9:            for x ∈ B do
10:                   Perturb x̃ = x + δ with δ ← arg maxkδk∞ ≤ LCE (f (x + δ; ŵi,r )), y)
11:                   B̃ ←   n B̃ ∪ {(x̃, y)}                                        o
                           1           1    P
12:            L←          2    L+    |B̃|      (x̃,y)∈B̃ [LCE (f (x̃; ŵi,r ), y)]
                                            ∂L
13:            ŵi,r ← ŵi,r −         η ∂ ŵi,r
14: Return Ŵk
     Data. Both CIFAR10 and Digits are 10-way classification tasks. We follow the non-
i.i.d benchmark of [80] to extract 10 classes from DomainNet, which is publicly available in
FedBN codes. To illustrate the multiple-domain datasets, we sample several images from
Digits and DomainNet datasets in Fig. C.3.
                             mnist                  svhn               usps             syn      mnistm
                   clipart              infograph           painting          quickdraw     real        sketch
                      Figure C.3: Sample images from multiple domain datasets.
                                                                     113


                        Table C.1: Network architecture for Digits dataset.
                     Layer                        Details
                                       feature extractor
                     conv1    Conv2D(64, kernel size=5, stride=1, padding=2)
                      bn1   DBN2D, RELU, MaxPool2D(kernel size=2, stride=2)
                     conv2    Conv2D(64, kernel size=5, stride=1, padding=2)
                      bn2   DBN2D, ReLU, MaxPool2D(kernel size=2, stride=2)
                     conv3    Conv2D(128, kernel size=5, stride=1, padding=2)
                      bn3                    DBN2D, ReLU
                                            classifier
                      fc1                        FC(2048)
                      bn4                    DBN2D, ReLU
                      fc2                        FC(512)
                      bn5                    DBN1D, ReLU
                      fc3                         FC(10)
    Hyper-parameters. In general, for local optimization we use stochastic gradient de-
scent (SGD) with 0.9 momentum and 5 × 10−4 weight decay. Dataset specific settings are
stated as follows. CIFAR10: Following HeteroFL [28], we train with 5 local epochs and 400
global communication rounds. Globally, we initialize the learning rate as 0.01 and adjust
the learning rate at 150, 250 communication rounds with a scale rate of 0.1. Locally, we use
a larger batch size of 128, to speed up the training in simulation. Digits: We use a cosine
annealing learning rate decaying from 0.1 to 0 across 400 global communication rounds. SGD
is executed with one epoch for each local client. DomainNet: We use a constant learning
rate 0.01 and run 400 communication rounds in total. Similar to Digits, SGD is executed
with one epoch for each local client.
    Network architectures. Architectures of modified AlexNet (for DomaiNet) and CNN
(for Digits) can be found in [80] and public codes 2 . For reader’s reference, we provide the
layer details in in Tables C.1 and C.2. For the convolutional layer (Conv2D or Conv1D), the
first argument is the number channel. For a fully connected layer (FC), we list the number
of hidden units as the first argument. The implementation of the preactivated ResNet can
    2
      https://github.com/med-air/FedBN
                                               114


                    Table C.2: Network architecture for DomainNet dataset.
                    Layer                           Details
                                       feature extractor
                    conv1      Conv2D(64, kernel size=11, stride=4, padding=2)
                     bn1     DBN2D, ReLU, MaxPool2d(kernel size=3, stride=2)
                    conv2      Conv2D(192, kernel size=5, stride=1, padding=2)
                     bn2     DBN2D, ReLU, MaxPool2d(kernel size=3, stride=2)
                    conv3      Conv2D(384, kernel size=3, stride=1, padding=1)
                     bn3                      DBN2D, ReLU
                    conv4      Conv2D(256, kernel size=3, stride=1, padding=1)
                     bn4                      DBN2D, ReLU
                    conv5      Conv2D(256, kernel size=3, stride=1, padding=1)
                     bn5     DBN2D, ReLU, MaxPool2d(kernel size=3, stride=2)
                   avgpool                AdaptiveAvgPool2d(6, 6)
                                            classifier
                     fc1                          FC(4096)
                     bn6                      DBN1D, ReLU
                     fc2                          FC(4096)
                     bn7                      DBN1D, ReLU
                     fc3                            FC(10)
be found in the repository of HeteroFL [28].
     Batch-normalization for customizable model sizes. As pointed out by [28], Het-
eroFL relies on mini-batch batch-normalization to stabilize the training with multiple model
widths and compute the statistics afterwards. Another advantage of such strategy is on
federation of multi-domain clients. [80] showed that using local batch-normalization helps
model personalization for non-i.i.d local features. As min-batch BN statistics simulate the
local BN idea, it should enjoy the similar benefit. To eliminate the potential biases in per-
sonalization or convergence caused by different estimation methods of BN statistic, we let
all compared algorithms using the same mini-batch strategy. To reveal the effect of differ-
ent BN statistic solutions for Split-Mix, we provide a detailed ablation study regarding the
estimation of BN statistics in Section C.1.3.
     Loss function for class non-i.i.d FL. As some classes are missing locally in non-
i.i.d setting, class-specific parameters in the classifier head may be updated without proper
                                               115


supervision and results in random updates. To mitigate the effect of missing classes locally,
we use the same masked cross-entropy loss as introduced by HeteroFL, where absent classes
are masked out.
C.1.3     Ablation study of network scaling and BN statistics
     In this section, we evaluate how the network rescaling and BN statistics affect the per-
formance. For rescaling, we consider the parameter initialization (rescale init) and layer
outputs (rescale layer ). For BN statistics, we consider four options. The batch average one
will estimate the statistics by one batch of data. The post average one will use the batch
average strategy during training but re-estimate the statistics using client data afterward,
which was adopted by HeteroFL. To gain better BN statistics, we run the model on a train-
ing set for 20 epochs. The tracked one will track statistics during training. The locally
tracked one will also track statistics but the statistics will not be shared with the server for
averaging, which can benefit clients’ privacy and personalization [80].
     We report full ablation results in Table C.3. 1) First we compare the use of BNs without
in-training tracking. The post-average BN performs best compared to other BN choices.
With similar performance, the batch average BN does not need multiple rounds of evaluation
of BN statistics, which is more efficient for inference. 2) Then we compare the use of BNs
tracked during training. Consistent with the prior study [80], locally tracked BN performs
better than the globally averaged one. 3) Rescaling layer outputs barely affect the accuracy,
but it could be poisonous for tracked BN statistics. 4) The initialization rescaling greatly
improves the performance regardless of the choice of BN statistics.
     In conclusion, either locally tracked or batch averaged BN statistics can yield both ef-
ficient and accurate performance. Post-averaged BN statistics may be preferred if post
averaging is bearable for efficiency, especially for large-scale datasets. Rescaled initialization
is an essential ingredient for Split-Max to perform well. Layer rescale is not recommended
if batch average or tracked BN statistics are utilized.
                                               116


Table C.3: Ablation study of network scaling and BN statistics on Digits dataset. Accuracy
of different customized widths are presented.
              BN stat     rescale init rescale layer ×0.125  ×0.25    ×0.5    ×1
                               7             7       81.1%   84.3%   86.2%   87.3%
                               7            3        81.1%   84.2%   86.2%   87.2%
           batch average
                               3             7       84.5%   87.5%   88.9%   89.8%
                               3            3        84.6%   87.5%   89.0%   89.8%
                               7             7       81.2%   84.5%   86.2%   87.4%
                               7            3        81.2%   84.4%   86.2%   87.3%
           post average
                               3             7       84.5%   87.5%   89.0%   89.9%
                               3            3        84.9%  87.8%    89.3%   90.2%
                               7             7       79.6%   82.8%   84.8%   85.7%
                               7            3         9.4%   10.6%   10.6%   10.6%
              tracked
                               3             7       83.5%   86.4%   87.9%   88.7%
                               3            3         8.8%    8.8%    8.8%   10.6%
                               7             7       81.1%   84.3%   86.3%   87.3%
                               7            3         9.8%   10.6%   10.1%   11.7%
          locally tracked
                               3             7       84.9%   87.7%   89.1%   90.0%
                               3            3        10.5%   10.6%   12.1%   11.1%
C.1.4     Experiments with i.i.d FL
    In addition to non-i.i.d FL settings, we experiment with i.i.d FL where each client will
own data of 10 classes from the CIFAR10 dataset. Results using 100% and 50% training
data are included in Table C.4. We observe a great increase in the accuracy compared to the
non-i.i.d experiments, which is a common phenomenon that non-iid FL will perform worse
globally. The similar performance degradation was observed in [28], as well. In Table C.4,
our method performs better in larger widths with fewer training data, whose performance
approaches that of unconstrained individual FedAvg. Our method provide a monotonous
relation between model size and accuracy (larger models are more accurate) and uses fewer
parameters and MACs even compared to wider baseline networks, though performs worse
in smaller widths because the slimmest networks are updated less frequently in budget-
insufficient clients compared to the SHeteroFL or FedAvg. Worth to mention, our method
uses much fewer parameters and operation counts for the same accuracy. For example, Split-
Mix requires 7.2M MACs and 1.4M parameters for 81.1% accuracy while SHeteroFL needs
                                               117


twice of the complexity, given the 50% CIFAR10 i.i.d configuration.
Table C.4: Test results of customizing model width on the class non-i.i.d CIFAR10 dataset.
                 Individual FedAvg              SHeteroFL                 Split-Mix (ours)
  width       Acc MACs #Params             Acc MACs #Params            Acc MACs #Params
                                         CIFAR10 i.i.d FL (100%)
  ×0.125 82.2%        0.9M      0.2M     81.9%    0.9M       0.2M    80.9%     0.9M       0.2M
  ×0.25     86.1%     3.5M      0.7M    85.2%     3.5M       0.7M    83.4%   1.8M        0.4M
  ×0.5      89.8%   14.0M       2.8M    86.5% 14.0M          2.8M    85.2%   3.6M        0.7M
  ×1        91.0%   55.7M      11.2M     85.9% 55.7M       11.2M   86.0%     7.2M        1.4M
                                          CIFAR10 i.i.d FL (50%)
  ×0.125 77.3%        0.9M      0.2M     77.2%    0.9M       0.2M    74.2%     0.9M       0.2M
  ×0.25     79.9%     3.5M      0.7M    79.7%     3.5M       0.7M    77.9%   1.8M        0.4M
  ×0.5      83.2%   14.0M       2.8M    80.1% 14.0M          2.8M    79.5%   3.6M        0.7M
  ×1        84.6%   55.7M      11.2M     75.5% 55.7M       11.2M   81.1%     7.2M        1.4M
C.1.5     More budget distributions
    In our experiments, we generally use an exponential budget distribution: Rk = (1/2)d4k/Ke .
Though the distribution represents the imbalance between the budget-sufficient and budget-
insufficient clients, real-world applications may encounter a wider variety of budget dis-
tributions. Thus, we extend our problem assumption to budget distributions with more
budget-sufficient clients where we let more groups to have ×1 or ×0.5 net training capabil-
ity and with step-increase budgets where we increase budgets by a fixed step (e.g., ×0.25).
In addition, we consider a log normal distribution which concentrate around 0.45 budget
with few wider or extremely budget-insufficient clients. We partition the budget distribution
into 0.125-width bins and each client will only train the maximal compatible width vary-
ing from 0.125 to 1, which greatly increases the number of slimmable subnetworks (8 now
compared to previous 4). To reduce the overhead of slimmable training, we use HeteroFL
instead of SHeteroFL. Fig. C.4 reports the per-width accuracy for Split-Mix and SHeteorFL
on the Digits dataset. For SHeteroFL, we only report the evaluated performance on trained
widths. In other words, if the maximal width is ×0.5, we will not report results of the
×1 net. For Individually-trained FedAvg (Ind. FedAvg), models individually trained for
                                              118


each width are reported, which ignores the width constraints by users and therefore only
serves as reference upper bounds. Regardless of the budget distributions, for instance, more
budget-sufficient clients (Fig. C.4a) or non-exponential distributions (Fig. C.4b), Split-Mix
outperforms SHeteroFL with larger widths.
C.1.6    Effect of lower contact rates
    Because of varying communication conditions in deployment, the times that clients ac-
tively and successfully upload their models could be fewer than expected. To evaluate the
robustness of customization federated algorithms, we conduct federated experiments with
a varying number of active clients per round in Fig. C.5. Because of the limited number
of communication rounds (within 300 rounds), the test accuracy decreases by fewer contact
clients. This is a common phenomenon because lower contact rates requires more commu-
nication rounds to reach the same performance as the full-contact competitors do. Though
global performance generally decreases, we find that the wider networks are more accurate
if trained by SplitMix. One source for the advantage is the modular base models in Split-
Mix, which can be easily distributed into different rounds for training. Rather, the wider
integrated networks in SHeteroFL lower their chance to be aggregated globally and therefore
their global performance declines.
C.1.7    Negative impact of constrained budgets on wider networks
    In Fig. 4.2, we show how the SHeteroFL fails to train wider networks with budget-
constrained clients. To show the generality of such a problem, we extend the experiments
to Digits and CIFAR10 datasets in Fig. C.6. Though wider networks converges faster at the
beginning, they meanwhile overfit limited data in a few clients and therefore their validation
accuracy no longer improves.
                                             119


                                                                                                                                              1.0         budgets                                                                                                          14
                             1.0                           budgets                                                                                          4-2-4d3-1                                                                                                      12
                                                              8-4-2-1                                                                         0.8           8-4-8d3-2
                                                                                                                                                                                                                                                                           10
                                                                                                                  width constraint
                             0.8                              8-4-4-1
width constraint
                                                              8-2-2-1                                                                         0.6                                                                                                                           8
                             0.6                              8-2-1-1                                                                                                                                                                          #client
                                                              8-1-1-1                                                                         0.4                                                                                                                           6
                             0.4                                                                                                                                                                                                                                            4
                                                                                                                                              0.2
                             0.2                                                                                                                                                                                                                                            2
                                                                                                                                              0.0                                                                                                                           0
                             0.0                                                                                                                           1                                2                3                4                                                     0.1250.250.375 0.5 0.6250.750.875 1.0
                                                               1            2               3            4                                                                                        group                                                                                budget (maximal compatible width)
                                                                                 group
                                                                                                                                              90                                                                                                                           90
                             90                                                                  budgets
                                                                                                                  average test accuracy (%)                                                                                                    average test accuracy (%)
                                                                                                                                              88                                                                       budgets
                                                                                                 w/o constraint                                                                                                                                                            85
average test accuracy (%)
                                                                                                                                                                                                                       w/o constraint
                             88                                                                  8-4-2-1                                      86                                                                       4-2-4/3-1
                                                                                                 8-4-4-1                                                                                                               8-4-8/3-2                                           80
                             86                                                                  8-2-2-1                                      84                                                                       method
                                                                                                 8-2-1-1                                                                                                               Ind. FedAvg                                         75                                 method
                                                                                                 8-1-1-1                                      82
                             84                                                                                                                                                                                        SplitMix                                                                               Ind. FedAvg
                                                                                                 method                                       80                                                                       SHeteroFL                                           70                                 SplitMix
                             82                                                                  Ind. FedAvg                                                                                                                                                                                                  HeteroFL
                                                                                                 SplitMix                                     78                                                                                                                           65
                                                                                                 SHeteroFL                                          12.5 25.0                       50.0                  100.0                                                                 12.5 25.0         50.0                  100.0
                                                 12.5 25.0           50.0                100.0                                                                                      width (%)                                                                                                     width (%)
                                                                     width (%)
                                                                                                                  (b) Step-increase budget distribu- (c) Log-normal                                                                                                                                           distribu-
    (a) More budget-sufficient clients
                                                                                                                  tions                              tions
Figure C.4: Vary the budget distribution. The training budgets, i.e., width constraints, are
depicted in the upper figures by group. The budget distribution name, for example, 8-4-2-1,
means ×1/8, ×1/4, ×1/2 and ×1 width constraints for each group, respectively. The lower
figures compare the performance of trained models with customized withs.
                                                        90.0                                                                                                                                           Digits                                                                               CIFAR10 100%
                                                                                                                                                                                 0.90
                            average test accuracy (%)
                                                                                                              #user/round                                                                                                                                      0.55
                                                        87.5                                                  2                                                                  0.85
                                                        85.0                                                  5                                                                                                                                                0.50
                                                                                                              20
                                                                                                                                                           validation accuracy                                                          validation accuracy
                                                                                                                                                                                 0.80
                                                        82.5                                                  50                                                                                                                                               0.45
                                                                                                              method                                                             0.75                             width                                                                                 width
                                                        80.0                                                                                                                                                          0.125                                    0.40                                         0.125
                                                                                                              SplitMix                                                                                                0.25                                                                                  0.25
                                                        77.5                                                  SHeteroFL                                                          0.70                                 0.5                                      0.35                                         0.5
                                                                                                                                                                                                                      1.0                                                                                   1.0
                                                                                                                                                                                 0.65                             mode                                                                                  mode
                                                                                                                                                                                                                      FedAvg                                   0.30                                         FedAvg
                                                               12.5 25.0     50.0                100.0                                                                                                                SHeteroFL                                                                             SHeteroFL
                                                                             width (%)                                                                                           0.60                                                                          0.25
                                                                                                                                                                                        0       50      100       150         200                                               0        100     200        300
                                                                                                                                                                                                 communication round                                                                      communication round
                    Figure C.5: Vary the number of
                    clients uploading models per round. Figure C.6: Convergence of different-width models.
                                                                                                                                                                                        120


                                        APPENDIX D
          OUTSOURCING TRAINING WITHOUT UPLOADING DATA
D.1      More Method Details
In this section, we elaborate on additional technical details of our paper.
D.1.1     Automated client labeling via private kNN
     Here, we briefly introduce the main idea of client labeling by private kNN. Given C
classes, labeling is done by nearest neighbor voting:
                  f (x) = arg maxc∈[C] A (vc ), vc = |{(x0 , y 0 ) ∈ NK (x)|y 0 = c}|,
where NK (x) is a set including the K-nearest neighbors of x in the client dataset. A is a
privatization mechanism complying with the notion of -Differential Privacy (DP) [38]. In
brief, privatization is done by adding Gaussian noise to a value with finite sensitivity. To
filter out potential wrong labels, the client only returns high-confident samples by screening.
Let the confidence of a pseudo label be s(x) = maxc∈[C] A (vc ) which is also privatized by the
Gaussian noise mechanism. We find that the original version of screening may suffer from
a large imbalance of pseudo labels. Per class, we screen the pseudo labels by selecting the
top-60 confident samples given 600 labeling budget.
D.1.2     Improving client labeling
     Because of the noise mechanism for privacy protection, the client labeling may be quite
random if the selected samples are hard to discriminate. Thus, we propose to improve the
discrimination of selected samples in advance, when the ECOS selects samples for labeling.
First, we estimate the discrimination by the confidence in the ECOS. The ECOS confidence is
defined by the vote count of the highest-voted class subtracting the one of the second highest
one, denoted as vrconf . To merit the balancedness of samples, we filter the clusters to keep the
top-70% samples with the highest confidence per class. When decompressing the clusters on
the cloud, we incorporate the confidence into the sampling score by vr0 = ψ (vrconf + vr )/2
                                                                                               
                                                 121


where vr is the original score.
D.1.3     Privacy Accountant for ECOS
    To understand the privacy cost of ECOS, we review the techniques that are essential to
establish the privacy bound.
Definition 8 (Differential Privacy [38]). Suppose  and δ are two positive constants. A
randomized algorithm M : X → Y is (, δ)-DP if for every pair of neighboring datasets
X, X 0 ∈ X , and every possible measurable output set Y ⊂ Y the following inequality holds:
                           Pr[M(X) ∈ Y ] ≤ e Pr[M(X 0 ) ∈ Y ] + δ,
where Pr[·] denotes the probability of a given event.
    DP provides a way to quantify the privacy risk (termed as the difference between two
probability given a pair of similar but different inputs) in the probability of δ. Though DP
is a simple notion for risks, the estimation of a tight privacy bound is still challenging. For
this reason, RDP is proposed an alternative tool.
Definition 9. (Rényi Differential Privacy (RDP) [99]) A randomized algorithm M : X → Y
is (α, )-RDP with order α > 1 if for all neighboring datasets X, X 0 the following holds
                                   Dα (M(X)||M(X 0 )) ≤ ,
where Dα (·||·) is the Rényi divergence between two distributions.
    The RDP and DP can be connected by Lemma 5.
Lemma 5. If a mechanism M satisfies (α, )-RDP, then it also satisfies       (+log 1/δ
                                                                               α−1,δ)
                                                                                        -DP.
    To reveal the potential privacy risks, we give a theoretical bound on the privacy cost
based on DP in Lemma 6. The proof of Lemma 6 is similar to Theorem 8 in [168] without
confidence screening.
                                              122


Lemma 6. Suppose the subsampling rate γ and noise magnitude σ of the ECOS are positive
                                n                       o              √
values such that γ ≤ min 0.1, σ log(1/δ)/6 and σ ≥ 2 5. The total privacy bound of
                                        p
the ECOS scoring m = |D̂q | query samples with n = |D̂p | private client samples is (, δ)-DP
with δ > 0 and
                                                    γp
                                             = O(       log(1/δ)),                          (D.1)
                                                    σ
if vr in Algorithm 5.1 is estimated by using dγne samples randomly subsampled from D̂p with
replacement and is processed by ṽr = vr + N (0, σ 2 I).
Proof. When one sample is absent from the private client dataset, the scores [v1 , . . . , vR ]
will differ by 2 if without subsampling. By Lemma 11 of [168], the subsampled Gaussian
mechanism accounted by the RDP is
                                                        24γ 2 α
                                               (α) ≤
                                                           σ2
                     2                                  √
for all 0 < α ≤ σ log(1/γ)
                         2
                              , γ  ≤   0.1  and  σ  ≥  2   5. By Lemma 5, we can convert the RDP
inequality to the standard (, δ)-DP as
                                               24γ 2 α log(1/δ)
                                          =           +            .
                                                 σ2         α−1
               √
                 log(1/δ)
Let α be 1 +     q
                   24γ 2
                           . Thus,
                    σ2
                                 24γ 2       γp                   γp
                             =         +  4    6  log(1/δ)  ≤  8     6 log(1/δ),
                                  σ2         σ                    σ
where the last inequality is derived by the given range of γ. This thus completes the proof.
    The above bound implies that the privacy cost for our method is invariant w.r.t. the scale
of the query dataset D̂q , and only depends on the DP parameters. Note that the Lemma 6 is
an asymptotic bound which requires some strict conditions on γ or other variables. In prac-
tice, we leverage the tool of analytic privacy accountant through a numerical method [144],
with which we can relax the strict conditions.
                                                      123


D.1.4     Social Impacts and Limitations
    The conflict between the concerns on data privacy and demands for intensive computa-
tion resources for machine learning has composed the main challenge in training outsourcing.
In this work, we devote our efforts to outsourcing with uploading data by leveraging autho-
rized or public datasets. As the public datasets commonly available in many applications
are collected from multiple data sources and thus tend to be non-identically distributed
as the client data, it casts new challenges to use the public in place of the client dataset.
Our method addresses this problem with accountable privacy cost and low communication
and computation complexity. Therefore, the proposed ECOS provides a promising solution
to mitigate the aforementioned conflict between the privacy and computation desiderata.
Therefore, users from a broader spectrum can benefit from such a method to confidentially
conduct cloud training.
    We also recognize open questions of the proposed solution for future studies. For example,
the public dataset may require additional data processing, e.g., aligning and cropping for
better prediction accuracy. In our empirical studies, we only consider the tasks in the
computer vision area, though no assumption was made on the data structures, though we
expect the principles can be adapted to other data types with minimal efforts. More data
types, including tabular and text data, will be considered in the follow-up works.
D.1.5     Connection to Federated Learning
    Both our method and federated learning (FL) [97] consider protecting data privacy via
not sharing data with the cloud. The key difference between FL and our concerned problem
(training outsourcing) is that FL requires clients to do training while ECOS outsource the
training to the cloud server. Since ECOS does not require local training, ECOS can be ad-
hocly plugged into FL to obtain an auxiliary open-source dataset for enhancing the federated
training. The ECOS can be used either before federated training (when a pre-trained model
is required) or during federated training (when the pre-trained model can be replaced by the
on-training model).
                                             124


D.2      Experimental Details and More Experiments
Complementary to the main content, we provide the details of the experiment configurations
to merit the reproducibility. We also conduct more qualitative experiments to understand
the efficiency and effectiveness of the proposed method.
D.2.1     Experimental Details
    We organize the case-specified configurations into three cases and discuss the general
setups first.
    For Digits, we train the model for 150 epochs. We adopt a convolutional neural network
for Digits in Table D.1 and ResNet18 for DomainNet. To solve the learning problems includ-
ing FixMatch and distillation-based compression, we use stochastic gradient descent with
the momentum of 0.9 and the weight decay of 5 × 10−4 . We use s = 5 for the scale function
ψs on DomainNet and s = 1 on Digits. When not specified, we noise the ECOS query with
the magnitude as 25. In Case 1 and 2, we reduce the noise magnitude to 10 for DomainNet,
since the two queries can bear more privacy costs to trade for higher accuracy.
    Case 1: Selective manual labeling. We make use of the off-the-shelf ResNet18 is
pre-trained on the ImageNet, which is widely accessible online. We adopt FixMatch for
semi-supervised learning with the coefficient of 0.1 on the pseudo-labeled loss, the moving
average factor of 0.9, and the batch size of 64 for DomainNet and 128 for Digits. To avoid
feature distorting, we warm up the fine-tuning by freezing all layers except the last linear
layer with a learning rate of 0.01. After 30 epochs, we fine-tune the model end to end until
80 epochs to avoid overfitting biased data distributions.
    Case 3: Adaptive model compression. We first pre-train a ResNet50 using all
labeled open-source data for 100 epochs with a cosine-annealed learning rate from 0.1. The
same warm-up strategy as Case 1 is used here. To extract the knowledge from ResNet50,
we combine the knowledge-distillation (KD) loss LKD and cross-entropy loss LCE by 0.1 ×
LKD + 0.9 × LCE and calculate the losses on the selected samples only. The temperature in
the KD loss is set to be 10.
                                              125


    Case 2: Automated client labeling. For the cloud training, we adopt the same
configuration as the selective manual labeling. For private kNN, we let the client release
600 labels with class-wise confidence thresholds described in the last section. We noise the
labeling in the magnitude of 25 and the confidence in the magnitude of 75. For both datasets,
we subsample 80% client data per labeling query to reduce the privacy cost.
     Table D.1: The structure of the conventional neural network for the Digits dataset.
                    Layer name                   PyTorch pseudo code
                       conv1        Conv2d(1, 64, kernel_size=(5, 5), stride=(1, 1))
                        bn1         BatchNorm2d(64, eps=1e-05, momentum=0.1)
                    conv1_drop            Dropout2d(p=0.5, inplace=False)
                       conv2       Conv2d(64, 128, kernel_size=(5, 5), stride=(1, 1))
                        bn2         BatchNorm2d(128, eps=1e-05, momentum=0.1)
                    conv2_drop            Dropout2d(p=0.5, inplace=False)
                        fc1    Linear(in_features=2048, out_features=384, bias=True)
                        fc2     Linear(in_features=384, out_features=192, bias=True)
                        fc3     Linear(in_features=192, out_features=11, bias=True)
    We conduct our experiments on the Amazon Web Service platform with 4 Tesla T4 GPUs
with 16GB memory and a 48-thread Intel CPU. All the code is implemented with PyTorch
1.11. To account for the privacy cost, we utilize the open-sourced autodp package following
the private kNN.
D.2.2     Effect of Parameters
    To better understand the proposed method, we study the effect of the important hyper-
parameters. To this end, we consider the selective labeling task with Digits, keeping 50%
of the SVHN dataset at the client end. Both the ID+OoD and OoD cases are evaluated
to reveal the method’s effectiveness under circumstances with various hardness. Also, we
study how the score scale s affects the ID ratios (denoted as the ID TPR) in the selected
set and the number of effective samples. We only examine the ID TPR corresponding to the
proximity objective in Eq. (5.2) if known ID samples are present on the cloud, namely in
the ID+OoD case. In the middle panes of Figs. D.1 and D.2, we show that our method can
effectively improve the ID TPR against the inherent ratio of ID samples on the cloud.
    Effect of the compression size R. In Fig. D.1, we evaluate R in terms of the test
                                                 126


                                                                              26                                                                8000
                    85                                                                                                                                             Budget
                                                        Budget                24                                                                7000                  1000
                    80                                  1000                                                                                                          2000
Test accuracy (%)
                                                                              22
                                                                                                                       # of effective samples
                                                        2000                                                                                    6000                  5000
                    75                                                        20                                                                                      8000
                                                        5000                                                                                    5000
                    70                                  8000         ID TPR   18
                                                                                                                                                4000
                                                        Cloud data            16                           Budget
                    65                                  OoD                                                   1000                              3000
                                                        ID+OoD                14                              2000
                    60                                                                                        5000                              2000
                                                                              12
                    55                                                                                        8000                              1000
                         0    100       200       300                         10
                                                                                   0   50   100 150 200 250      300                                   0   50   100 150 200 250 300
                             Compression size R                                             Compression size R                                                  Compression size R
Figure D.1: Vary the compression size R and evaluate the test accuracy, ID ratios (%) in
selected samples and the number of effectively selected samples. The red horizontal line
indicates the ratio of ID samples in the whole cloud dataset.
accuracy. When the budget is small (1k and 2k budgets in the ID+OoD case), it is essential
for the cloud server to sense the client distribution with higher accuracy via more queries.
Therefore, a larger R is desired, which can increase the portion of ID samples in the selected
set, as shown in the middle pane of Fig. D.1. Considering that a higher burden on commu-
nication comes with a larger R, the value of 100 leads to a fair trade-off to the accuracy in
which case the ID TPR reaches a peak.
                     Given a larger budget, e.g., 8000, increasing R may lower the ID TPR. We attribute
the decline to the limited size of the client dataset and privacy constraints. Given more
clusters (i.e., R), the expected number of votes (proportional to the score) for each cluster
will be reduced and is badly blurred by the DP noise. Thus, the ID TPR will decrease
simultaneously, regardless of which the test accuracy is not significantly affected.
                     For the OoD case which is relatively harder for sampling due to the lack of true ID data,
the parameter sensitivity is weakened, though the compression size of 100 is still a fair choice,
for example, bringing in 1 − 2% gains in the 5k, 8k cases comparing the worst cases.
                     Effect of the score scale s. The score scale s decides the sensitivity of the sampling in
the sense of proximity. A larger s means that the ECOS will prioritize the proximity more
during sampling. In Fig. D.2, we present the ablation study of s. A larger s is preferred
when the budget becomes limited because it increases the ID TPR effectively. Though
not significantly, an overly large s has a significantly negative influence on the accuracy,
                                                                                       127


                                                                                Budget                                                       8000                        Budget
                    85                               Budget                60      1000                                                                                     1000
                                                                                   2000                                                      7000                           2000
                                                     1000
Test accuracy (%)
                    80
                                                                                                                    # of effective samples
                                                                                   5000                                                      6000                           5000
                                                     2000                  50      8000                                                                                     8000
                    75                               5000                                                                                    5000
                                                     8000         ID TPR   40
                    70                               Cloud data                                                                              4000
                                                     OoD                   30                                                                3000
                    65                               ID+OoD
                                                                           20                                                                2000
                    60
                                                                                                                                             1000
                         2     4     6      8   10                         10
                                                                                  2       4        6       8   10                                   2   4        6       8    10
                              Score scale s                                                Score scale s                                                 Score scale s
Figure D.2: Vary the score scale s in terms of test accuracy, ID ratios (%) in selected samples
and the number of effectively selected samples (which could be smaller than the budget).
The red horizontal line indicates the ratio of ID samples in the whole cloud dataset.
especially for the OoD case. The reason for the negative impact of s on a large budget can
be understood by probing the number of effective samples. For budgets larger than 2000,
the effectively selected samples are reduced with heavily scaled scores (e.g., s ≥ 3) where
the ECOS will concentrate its selection into very few clusters and eliminate the rest clusters
strictly.
D.2.3                        Evaluation of Sample and Privacy Efficiency
                     In Fig. D.3, we compare the sample efficiency in the selective labeling task with Digits,
keeping 50% of the SVHN dataset at the client end. We obtain the upper-bound accuracy in
the ideal case via random sampling when the cloud dataset distributes identically (ID) as the
client dataset. When OoD data are included in the open-source cloud dataset (ID+OoD), the
training becomes more demanding for the labeled samples. If none of the iid samples presents
in the cloud set (OoD), the accuracy decreases quickly with the same labeled samples. In
comparison, informative sampling by K-Center slightly improves the accuracy by different
budgets and the proposed ECOS significantly promotes the sample efficiency. With ECOS,
8 × 103 labeled samples in the ID+OoD case achieves comparable accuracy as the ideal case,
while baselines remain large gaps. Both in ID+OoD and OoD cases, our method yields
competitive accuracy (at the 4 × 103 budget) versus the best baseline results using only half
of the labeled data (at the 1 × 104 budget), dramatically cutting down the cost for manual
labeling.
                                                                                 128


                                               85
                                                                                      Cloud data
                                               80                                     OoD
                           Test accuracy (%)
                                               75                                     ID+OoD
                                                                                      ID
                                               70                                     Method
                                                                                      Ours
                                               65                                     Random
                                                                                      K-Center
                                               60
                                               55
                                                    2       4       6      8     10
                                                        Sampling budget (×103)
Figure D.3: Evaluation of the sample efficiency on selective labeling. The green horizontal
line implies the ideal case when all ID cloud data are labeled.
   On observing the gains in sample efficiency, readers may also notice that our method
induces additional costs at privacy, as compared to the baselines. We point out that the
cost is constant w.r.t. the sampling budget and is as neglectable as (0.22, 10−5 )-DP. It is
worth noticing that the cost is independent of the hyper-parameters of the ECOS because the
ECOS communication is a single query for each private sample (so as for the private dataset),
even if we increase the size of the query set (i.e., the compression size R). In practice, the
client can control the privacy risk (namely, the privacy cost) flexibly by adjusting the noise
magnitude and the subsampling rate.
D.2.4   Evaluation of Communication and Computation Efficiency
   When improving the sample efficiency, we also need to take care of the communication
and computation overheads brought by the ECOS. We examine the two kinds of efficiency
by the same experiment configurations as in the last section.
   Computation efficiency. In Fig. D.4, we compare the computation efficiency of our
method to the local training (LT). We utilize the multiplication-and-addition counts (MACs)
as the metric of computation (time) complexity, which is hardware-agnostic and therefore
is preferred here. For a fair comparison, we tune the learning rate in {0.1, 0.01, 0.001} with
the cosine annealing during training and the number of epochs in {20, 50, 100} of the LT to
achieve a fair trade-off between the computation cost and test accuracy. For ECOS, since
the computation cost linearly increases by the compression size (as shown in Fig. D.5), we
                                                                     129


                        90                                                                                                  90
                        80                      Method                                                                      85
    Test accuracy (%)                                                                                   Test accuracy (%)
                        70                      ECOS (ours)                                                                 80
                                                LT
                                                Cloud data                                                                  75
                        60                      OoD                                                                         70                                 Method
                                                ID+OoD
                        50                                                                                                  65                                 ECOS (ours)
                                                                                                                                                               LT (fine-tune)
                        40 0                                                                                                60
                          10   101   102    103 104 105                             106      107                                         102                  103
                                           Client MACs (×106)                                                                            Client MACs (×106)
                                           (a) Digits                                                                                  (b) DomainNet
Figure D.4: With the 5000 budget, we evaluate the computation efficiency. The efficiency
of LT on the DomainNet is enhanced by linear fine-tuning.
                                                                                   60
                                                                                   50
                                                              Client MACs (×106)
                                                                                   40
                                                                                   30
                                                                                   20
                                                                                   10
                                                                                   0
                                                                                        0   50     100 150 200 250               300
                                                                                                   Compression size R
Figure D.5: The linearly growing computation cost by increasing the compression size on
the Digits dataset.
vary the compression size to check the performance when increasing computation costs. On
Digits, we observe a large computation save by our method, even if the cost of our method
will gradually increase by the size R of the compressed query set.
   Similar experiments are also run on the large-sized images using the DomainNet dataset
(ID+OoD case), where the cost for extracting features is steeply increased by using a deep
network (ResNet18). Recently, the most popular strategy for cloud training is two-phase
learning: pre-training a model on the cloud using ImageNet and fine-tuning the linear clas-
sifier head on the client. Considering the large cost of feature extraction, we only let the
client pre-extract features once only. Thus, the local training is as efficient as training a
linear layer on extracted features. In Fig. D.4, our method outperforms the local training
a lot using much fewer MACs for data matching. Because all training is outsourced to the
                                                                                                    130


cloud, our method enables the end-to-end fine-tuning of the model resulting in better test
accuracy. Even if the LT trains longer with higher computation costs, the test accuracy of
the ECOS with the least MACs is comparable to the best performance of LT at around 109
MACs, where the ECOS only utilizes the 10% of MACs by LT.
                                            1010                                               Data
                                                                                               Digits
                                            109                                                DomainNet
                       Communicated bytes
                                            108
                                            107
                                            106
                                            105
                                                   Upload client data   Download cloud data   ECOS
                    Figure D.6: Evaluate the communication efficiency.
   Communication efficiency. We also compare the communication efficiency to the
full cloud training (via uploading the whole client dataset) and fully client training (via
downloading cloud dataset) in Fig. D.6. For the ECOS, we let the size of the query set be 100,
which is the default configuration in our experiments. Because the ECOS only communicates
a few low-dimensional features (for example, 512-dimensional ResNet-extracted features for
DomainNet and 72-dimensional HOG features for Digits), it costs much fewer bytes compared
to traditional outsourcing by uploading the client data. To be concrete, we also present the
cost of downloading the cloud data and it is way more expensive than the rest two methods.
                                                                          131