EFFICIENT TRANSFER LEARNING FOR HETEROGENEOUS MACHINE
                     LEARNING DOMAINS
                                  By
                            Zhuangdi Zhu
                        A DISSERTATION
                            Submitted to
                    Michigan State University
             in partial fulfillment of the requirements
                          for the degree of
             Computer Science – Doctor of Philosophy
                                 2022


                                        ABSTRACT
       EFFICIENT TRANSFER LEARNING FOR HETEROGENEOUS MACHINE
                                   LEARNING DOMAINS
                                               By
                                        Zhuangdi Zhu
Recent advances in deep machine learning hinge on a large amount of labeled data. Such
heavy dependence on supervision data impedes the broader application of deep learning in
more practical scenarios, where data annotation and labeling can be expensive (e.g. high-
frequency trading) or even dangerous (e.g. training autonomous-driving models.) Transfer
Learning (TL), equivalently referred to as knowledge transfer, is an effective strategy to
confront such challenges. TL, by its definition, distills the external knowledge from relevant
domains into the target learning domain, hence requiring fewer supervision resources than
learning-from-scratch. TL is beneficial for learning tasks for which the supervision data is
limited or even unavailable. It is also an essential property to realize Generalized Artificial
Intelligence.
    In this thesis, we propose sample-efficient TL approaches using limited, sometimes unre-
liable resources. We take a deep look into the setting of Reinforcement Learning (RL) and
Supervised Learning, and derive solutions for the two domains respectively. Especially, for
RL, we focus on a problem setting called imitation learning, where the supervision from the
environment is either non-available or scarcely provided, and the learning agent must transfer
knowledge from exterior resources, such as demonstration examples of a previously trained
expert, to learn a good policy. For supervised learning, we consider a distributed machine
learning scheme called Federated Learning (FL), which is a more challenging scenario than
traditional machine learning, since the training data is distributed and non-sharable during
the learning process. Under this distributed setting, it is imperative to enable TL among
distributed learning clients to reach a satisfiable generalization performance. We prove by
both theoretical support and extensive experiments that our proposed algorithms can fa-


cilitate the machine learning process with knowledge transfer to achieve higher asymptotic
performance, in a principled and more efficient manner than the prior arts.


   Copyright by
ZHUANGDI ZHU
           2022


This thesis is dedicated to my parents, Yi Zhang and Jun Zhu,
             as well as my fiancé Tianxudong Tang.
                                v


                                ACKNOWLEDGEMENTS
This piece of writing marks the end of a marvelous journey and the beginning of a new one,
none of which would be possible without the support of my advisor, colleagues, friends, and
loved ones.
    First and foremost, I would like to express my deep gratitude to my research advisor,
Dr. Jiayu Zhou. I could not have undertaken this journey without his guidance and help. I
thank him for his vision, support, his kind advice on research and beyond. I am also grateful
to my other committee members: Dr. Anil K. Jain, Dr. Panning Tan, Dr. Sijia Liu, and
Dr. Zhaojian Li, for providing their expertise and informative feedback.
    It is my great pleasure to have worked with members of our Illidan Lab, especially
Kaixiang Lin, Junyuan Hong, and Mengying Sun. I thank them for their selfless help in
improving my skills in research. I would like to extend my sincere thanks to Dr. Philip
Quinn for providing me with patient mentoring and warm friendship. Thanks should also
go to Dr. Bo Dai, a superb researcher who has been influential to my research. They have
all encouraged me for being a qualified PhD candidate.
    I would like to shout out to my fiancé, Tang, a kind and loving man who constantly gives
me emotional support. He loves me for who I really am and has inspired me to discover
the better part of myself. I also thank my friends, especially Xin H, Michelle, Mengmeng,
Mimi, Han M, Junyuan H, He L, Yan X, Chuan-Pin C, and Fei Z. They are sources of
joyful distractions from my research and work, and together we have shared many treasured
memories. Thanks to them, I began to understand that slowly is the fastest way to complete
a journey.
    I am deeply indebted to my parents who give me a lifetime of love and care. Their belief
in me has spiritually supported me during the ups and downs of this PhD journey. They
love me deeply by not holding the reins but setting me off on the road to thrive. Special
thanks to my younger sister HaoHao, who has grown up so quickly with a good sense of
                                              vi


responsibility. Thanks to her for accompanying our parents when I am not home. I am
fortunate to have her as my sibling.
    Lastly, I would like to thank my grandfather. His kindness and integrity have influenced
me deeply. Hope that I have made him proud. Dear grandpa, may your good soul rest in
peace. I miss you so much, and I will always, always love you.
                                             vii


                               TABLE OF CONTENTS
LIST OF TABLES       . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . .                .  . . . . . . .  .  . . .  1
   1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .  . . .   1
   1.2 Overview of Thesis Structure . . . . . . . . . . . . . .      . . . . . . . . .  . . .   3
   1.3 Background and Preliminaries . . . . . . . . . . . . . .      . . . . . . . . .  . . .   4
       1.3.1 Reinforcement Learning (RL) . . . . . . . . . .         . . . . . . . . .  . . .   4
       1.3.2 Federated Learning (FL) . . . . . . . . . . . . .       . . . . . . . . .  . . .   6
       1.3.3 Generative Adversarial Learning . . . . . . . . .       . . . . . . . . .  . . .   7
       1.3.4 Notations and Terminologies . . . . . . . . . . .       . . . . . . . . .  . . .   7
CHAPTER 2 PROBLEM OVERVIEW: TRANSFER LEARNING . . . .                              .  .  . . . 10
   2.1 Defitinion of Transfer Learning . . . . . . . . . . . . . . . . . . . .     . .  . . .  10
       2.1.1 Transfer Learning in the Context of Reinforcement Learning            . .  . . .  10
       2.1.2 Transfer Learning in the Context of Supervised Learning . .           . .  . . .  11
   2.2 A Glance of Prior Arts . . . . . . . . . . . . . . . . . . . . . . . . .    . .  . . .  12
CHAPTER 3 KNOWLEDGE TRANSFER IN REINFORCEMENT LEARNING
              FROM SUBOPTIMAL DEMONSTRATIONS . . . . . . . . . . . . .                         14
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    14
   3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      16
   3.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     17
   3.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       18
       3.4.1 Exploration-Driven Objective . . . . . . . . . . . . . . . . . . . . . .          18
       3.4.2 Adaptive Learning Target . . . . . . . . . . . . . . . . . . . . . . . .          19
       3.4.3 Off-Policy Adversarial TD Learning . . . . . . . . . . . . . . . . . . .          21
       3.4.4 Self-Adaptive Imitation Learning . . . . . . . . . . . . . . . . . . . .          22
               3.4.4.1 Reasoning of sampling from a mixture of distributions: . . .            24
   3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      25
   3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     26
       3.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       26
       3.6.2 Performance on Continuous Action-Space Tasks . . . . . . . . . . . .              27
       3.6.3 Effects of Off-Policy Exploration in SAIL . . . . . . . . . . . . . . .           28
       3.6.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        30
   3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       31
CHAPTER 4 OFF-POLICY LEARNING FROM OBSERVATIONS:
              KNOWLEDGE TRANSFER IN REINFORCEMENT LEARNING
              FROM INCOMPLETE SUPERVISION . . . . . . . . . . . . . . . .                      32
                                             viii


  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
  4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 34
  4.3 OPOLO: Off-Policy Learning from Observations . . . . . . . . .         . . . . . . . 36
      4.3.1 Surrogate Objective . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 36
      4.3.2 Off-Policy Transformation . . . . . . . . . . . . . . . . .      . . . . . . . 37
      4.3.3 Adversarial Training with Off-Policy Experience . . . . .        . . . . . . . 38
      4.3.4 Policy Regularization as Forward Distribution Matching           . . . . . . . 39
      4.3.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 40
  4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 40
  4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . 42
      4.5.1 Performance Comparison . . . . . . . . . . . . . . . . . .       . . . . . . . 43
      4.5.2 Sample Efficiency . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 44
      4.5.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 45
      4.5.4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 47
  4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 47
CHAPTER 5 DATA FREE KNOWLEDGE TRANSFER FOR HETEROGENEOUS
             FEDERATED LEARNING . . . . . . . . . . . . . . . . . . . . . . .              49
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
  5.2 Notations and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . .    51
  5.3 FeDGen: Data Free Federated Distillation via Generative Learning . . . . .           53
      5.3.1 Knowledge Extraction . . . . . . . . . . . . . . . . . . . . . . . . . .       53
      5.3.2 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . .     55
      5.3.3 Extensions for Flexible Parameter Sharing . . . . . . . . . . . . . . .        55
  5.4 FeDGen Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      56
      5.4.1 Knowledge Distillation for Inductive Bias . . . . . . . . . . . . . . . .      57
      5.4.2 Knowledge Distillation for Distribution Matching . . . . . . . . . . .         57
      5.4.3 Knowledge Distillation for Improved Generalization . . . . . . . . . .         58
  5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   59
  5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  61
      5.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    61
      5.6.2 Performance Overview: . . . . . . . . . . . . . . . . . . . . . . . . . .      63
      5.6.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .   65
      5.6.4 Extensions to Flexible Parameter Sharing . . . . . . . . . . . . . . .         67
  5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    68
CHAPTER 6 FEDRESCUE: SELF-KNOWLEDGE DISTILLATION FOR RESILIENT
             AND COMMUNICATION EFFICIENT FEDERATED LEARNING . 69
  6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
  6.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
      6.2.1 Prelimnaries of Federated Learning . . . . . . . . . . . . . . . . . . . 71
      6.2.2 Learning with System Heterogeneity . . . . . . . . . . . . . . . . . . 71
      6.2.3 Learning with Unstable Network Connection . . . . . . . . . . . . . . 72
  6.3 Resilient and Communication Efficient FL . . . . . . . . . . . . . . . . . . . 72
      6.3.1 Learning Self-Distilled Local Models . . . . . . . . . . . . . . . . . . 73
                                            ix


       6.3.2 Effective Optimization via Progressive Learning         . . . . . . . . .  .  .  .    75
       6.3.3 Proposed Federated Algorithm: FedResCuE . .             . . . . . . . . .  .  .  .    77
   6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . .  .  .  .    79
   6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .  .  .  .    80
       6.5.1 Experiment Setup . . . . . . . . . . . . . . . . .      . . . . . . . . .  .  .  .    80
       6.5.2 Performance Under System Heterogeneity . . .            . . . . . . . . .  .  .  .    82
       6.5.3 Performance Under Unstable Connections . . .            . . . . . . . . .  .  .  .    83
       6.5.4 Performance Given Insufficient Training Data .          . . . . . . . . .  .  .  .    84
       6.5.5 Evaluation of Communication Efficiency . . . .          . . . . . . . . .  .  .  .    85
       6.5.6 Sensitivity Analysis . . . . . . . . . . . . . . . .    . . . . . . . . .  .  .  .    86
              6.5.6.1 Effects of Knowledge Distillation . . .        . . . . . . . . .  .  .  .    86
              6.5.6.2 Impacts of Submodel Sampling . . . .           . . . . . . . . .  .  .  .    86
              6.5.6.3 Effects of Progressive Learning . . . .        . . . . . . . . .  .  .  .    87
   6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . .  .  .  .    88
CHAPTER 7 OVERVIEW AND OPEN QUESTIONS . . . . . . . . . . . . . . . .                              89
APPENDICES . . . . . . . . . . . . . . . . . . . . . .     . .  . . .  . . . . . . .  .  .  .   . 92
   APPENDIX A        APPENDIX FOR OPOLO . .                . .  . . .  . . . . . . .  .  .  .   . 93
   APPENDIX B        APPENDIX FOR FEDGEN                 . . .  . . .  . . . . . . .  .  .  .   . 106
   APPENDIX C        APPENDIX FOR FEDRESCUE                . .  . . .  . . . . . . .  .  .  .   . 118
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
                                             x


                                       LIST OF TABLES
Table 1.1: Overview of Abbreviations. . . . . . . . . . . . . . . . . . . . . . . . . . .           8
Table 1.2: An overview of mathematical notations. A notation can be applicable to
           either the context Supervised Learning (SL) or Reinforcement Learning
           (RL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        9
Table 3.1: Off-policy exploration (SAIL) achieves higher performance than other
           exploration approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . .         28
Table 3.2: Using 106 interaction samples, performance of SAIL is robust regardless
           of the quality of sub-optimal teacher demonstrations. . . . . . . . . . . . .           29
Table 4.1: Summarization on different stationary distributions, with µπt (s) = p(st =
           s|s0 ∼ p0 (·), ai ∼ π(·|si ), si+1 ∼ P (·|si , ai )), ∀i < t). . . . . . . . . . . . .  36
Table 4.2: Evaluated performance of off-policy approaches. Results are averaged
           over 50 trajectories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       44
Table 5.1: Performance overview. For Mnist and EMnist, a smaller α indicates
           higher heterogeneity. For CelebA, r denotes the ratio between active
           users and total users. T denotes the number of local training steps. . . . .            61
Table 5.2: Effects of the generator’s network structure, using EMnist dataset with
           α = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      65
Table 5.3: Effects of the number of synthetic samples, using EMnist dataset with
           α = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        65
Table 5.4: Performance overview on the Mnist dataset, by only sharing the last
           prediction layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       66
Table 6.1: Progressive vs. overwriting learning.           . . . . . . . . . . . . . . . . . . . . 77
Table 6.2: We report best performance from different S for applicable approaches
           (See Section 6.5.6.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        81
Table 6.3: Performance under faulty connections, given 100% of training data and
           0.1 ≤ er ≤ 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         82
Table 6.4: FedResCuE is the most robust algorithm given heterogeneous data and
           domain-dependent connection error. . . . . . . . . . . . . . . . . . . . . .            83
                                                  xi


Table 6.5: FedResCuE requires notably fewer communication rounds to reach the
           predefined accuracy (Acc). . . . . . . . . . . . . . . . . . . . . . . . . . .    85
Table 7.1: An overview comparison of the proposed TL approaches.          . . . . . . . . .  90
Table A.1: A deterministic but non-injective MDP. . . . . . . . . . . . . . . . . . . .      96
Table A.2: Learning Policy π. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  96
Table A.3: Expert Policy πE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  96
Table A.4: Hyper-parameters for Different Algorithms. . . . . . . . . . . . . . . . . . 104
Table B.1: Accuracy (%) before and after KD.      . . . . . . . . . . . . . . . . . . . . . 113
Table B.2: Network architecture for the generator Gw . . . . . . . . . . . . . . . . . . 114
Table B.3: Network architecture for the classification model.     . . . . . . . . . . . . . 115
Table B.4: Performance overview under different data heterogeneity settings. For
           Mnist and EMnist, user data follows the Dirichlet distribution with
           hyperparameter α, with a smaller α indicating higher heterogeneity. For
           CelebA, r denotes the ratio between active users and total users. T
           denotes the local training steps (communication delay). All the above
           experiments use batch size B=32. . . . . . . . . . . . . . . . . . . . . . . 117
Table C.1: Progressive vs. overwriting learning.    . . . . . . . . . . . . . . . . . . . . 119
Table C.2: The overwriting learning approach leads to severe forgetting on the pre-
           viously learned knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Table C.3: We adopted FedAvg ∗ as the baseline implementation in the main paper. . 121
Table C.4: Communication Efficiency Overview, where ‘-’ indicates that predefined
           performance is not reached before training ends. . . . . . . . . . . . . . . 122
Table C.5: FL Performance with i.i.d. user statistical distributions.     . . . . . . . . . 123
Table C.6: Performance under faulty connections, given 100% of training data.         . . . 124
Table C.7: FedResCuE is the most robust algorithm given heterogeneous data and
           domain-dependent connection errors. . . . . . . . . . . . . . . . . . . . . 124
Table C.8: Performance with and without knowledge-distillation. . . . . . . . . . . . 127
                                             xii


Table C.9: ResNet Architecture for Learning CelebA, omitting BatchNorm and
            ReLU layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Table C.10: Model Architecture for Learning DigitsFive. . . . . . . . . . . . . . . . 133
Table C.11: Configurations of Hyper-parameters.    . . . . . . . . . . . . . . . . . . . . 135
                                           xiii


                                  LIST OF FIGURES
Figure 1.1: Overview of RL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    6
Figure 1.2: An absorbing state (sT −1 ) denotes the termination of a task in an
            infinite-horizon MDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . .    6
Figure 2.1: TL approaches in RL organized by the format of transferred knowledge. .         11
Figure 3.1: Illustraion of SAIL: Navigations in red arrows follow the exploration
            driven IL objective, which approaches to teacher’s density distribution
            while deviating from previous learned ones. It explores more efficiently
            to reach expertise, compared with random explorations (green arrows). .         16
Figure 3.2: Learning curves of SAIL and other previous work using one suboptimal
            demonstration trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . .   27
Figure 3.3: Comparing SAIL with other on-policy baselines using one suboptimal
            demonstration trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . .   27
Figure 3.4: Ablation study by removing different algorithmic components from SAIL.
            Only one teacher trajectory is used as demonstration. . . . . . . . . . . .     29
Figure 4.1: Interaction steps versus learning performance. Compared with others,
            our proposed approach (OPOLO) is the most sample-efficient to reach
            expert-level performance (Grey horizontal line). . . . . . . . . . . . . . .    45
Figure 4.2: Compared with strong off-policy baselines, OPOLO is the only one that
            consistently achieves competitive performance across all tasks, without
            accessing expert actions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 4.3: Removing the inverse action regularization (OPOLO-x ) results in slight
            efficiency drop, although its performance is still comparable to DAC
            and OPOLO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    46
Figure 4.4: Performance given different f -functions. . . . . . . . . . . . . . . . . . .   47
Figure 5.1: Overview of FeDGen: a generator Gw (·|y) is learned by the server
            to aggregate information from different local clients without observing
            their data. The generator is then sent to local users, whose knowledge is
            distilled to user models to adjust their interpretations of a good feature
            distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
                                             xiv


Figure 5.2: After KD, accuracy has improved from 81.2% to 98.4% for one user ( Fig
            5.2a - Fig 5.2b), while a global model obtained by parameter-averaging
            (without KD) has 93.2% accuracy (Fig 5.2c), compared with an oracle
            model with 98.6% accuracy (Fig 5.2d). . . . . . . . . . . . . . . . . . . .       56
Figure 5.3: Samples from the generator gradually approaches to real distribution,
            where each user model (teacher) sees limited, disjoint local data. The
            background color indicates oracle decision boundaries learned over the
            global data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  56
Figure 5.4: Visualization of statistical heterogeneity among users on Mnist dataset,
            where the x-axis indicates user IDs, the y-axis indicates class labels, and
            the size of scattered points indicates the number of training samples for
            a label available to that user. . . . . . . . . . . . . . . . . . . . . . . . .   61
Figure 5.5: Visualized performance w.r.t data heterogeneity. . . . . . . . . . . . . . .      62
Figure 5.6: Selected learning curves, averaged over 3 random seeds. . . . . . . . . . .       62
Figure 5.7: Effects of synthetic samples.     . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 5.8: Learning curves on Mnist with limited parameter sharing. . . . . . . . .          66
Figure 6.1: Model parameters are divided and learned as columns, which are then
            transmitted sequentially between the server and clients, until than an
            interruption occurs to one column, or when all columns are transmitted
            successfully. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   74
Figure 6.2: A self-distilled network is learned via progressively updating columns of
            parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   76
Figure 6.3: Evaluation curves for the ×1.0 model, with 0.1 ≤ er ≤ 0.2 (left) and
            20% training data (right). . . . . . . . . . . . . . . . . . . . . . . . . . .    85
Figure 6.4: KD in FedResCuE benefits smaller submodels. . . . . . . . . . . . . . . . . .     87
Figure 6.5: Impacts of sampling frequency S. . . . . . . . . . . . . . . . . . . . . . . .    87
Figure 6.6: Effects of progressive learning.  . . . . . . . . . . . . . . . . . . . . . . . . 87
Figure A.1: Transition of an non-injective MDP. . . . . . . . . . . . . . . . . . . . . .     97
Figure B.1: Knowledge distillation process for the prototype experiment. . . . . . . . 114
                                               xv


Figure B.2: Performance curves on Mnist dataset, where a smaller α denotes larger
             data heterogeneity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Figure B.3: Performance curves on CelebA dataset.          . . . . . . . . . . . . . . . . . 116
Figure B.4: Performance curves on EMnist dataset, under different kinds of data
             heterogeneity and communication frequencies. . . . . . . . . . . . . . . . 116
Figure C.1: Illustration of progressive learning. . . . . . . . . . . . . . . . . . . . . . 119
Figure C.2: Illustration of overwriting learning.    . . . . . . . . . . . . . . . . . . . . 119
Figure C.3: Impacts of the submodel sampling frequency.        . . . . . . . . . . . . . . . 127
Figure C.4: Compared with FedSeq, knowledge-distillation strategy in FedResCuE
             can benefit smaller submodels. . . . . . . . . . . . . . . . . . . . . . . . . 128
Figure C.5: Compared to FedRush, FedResCuE maintains a high performance for
             the ×1.0 model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Figure C.6: 100% training data, CelebA, uniform architecture, er = 0. . . . . . . . 130
Figure C.7: 100% training data, CelebA, cluster architecture, er = 0. . . . . . . . . 130
Figure C.8: 20% training data, CelebA, uniform architecture, er = 0. . . . . . . . . 131
Figure C.9: 20% training data, CelebA, cluster architecture, er = 0. . . . . . . . . . 131
Figure C.10: 100% training data, CelebA, uniform architecture, 0.1 ≤ er ≤ 0.2. . . . 132
Figure C.11: 100% training data, CelebA, cluster architecture, 0.1 ≤ er ≤ 0.2. . . . . 132
Figure C.12: 100% training data, CelebA, uniform architecture, 0.2 ≤ er ≤ 0.3. . . . 134
Figure C.13: 100% training data, CelebA, cluster architecture, 0.2 ≤ er ≤ 0.3. . . . . 134
                                              xvi


                              LIST OF ALGORITHMS
Algorithm 3.1: Self-Adaptive Imitation Learning . . . . . . . . . . . . . . . . . . . .    23
Algorithm 4.1: Off-Policy Learning from Observations . . . . . . . . . . . . . . . . .     41
Algorithm 5.1: Data Free Federated Distillation via Generalized Learning . . . . . .       53
Algorithm 6.1: Progressive Self-Distillation . . . . . . . . . . . . . . . . . . . . . . . 76
Algorithm 6.2: Resilient and Communication-Efficient Federated Learning . . . . . .        78
Algorithm B.1: FeDGen with Partial Parameter Sharing . . . . . . . . . . . . . . . 115
Algorithm C.1: Slimmable-Training ([196]) . . . . . . . . . . . . . . . . . . . . . 120
Algorithm C.2: Local Model Update for FjORD([69]) . . . . . . . . . . . . . . . . . . 121
Algorithm C.3: Local Model Update for FedSeq . . . . . . . . . . . . . . . . . . . . . 127
Algorithm C.4: Local Model Update for FedRush . . . . . . . . . . . . . . . . . . . . 128
                                            xvii


                                         CHAPTER 1
                                       INTRODUCTION
1.1     Motivation
Recent advances in deep machine learning hinges on a large amount of labeled data. Such
heavy dependence on supervision data impedes the broader application of deep learning in
more practical scenarios, where data annotation and labeling can be expensive (e.g. high-
frequency trading) or even dangerous (e.g. training autonomous-driving models.) Transfer
Learning (TL), which is equivalently referred to as knowledge transfer , serves as an effective
strategy to confront such challenges. TL, by its definition, distills the external knowledge
from relevant domains into the target learning domain, hence requiring fewer supervision
resources than learning-from-scratch. It is beneficial for learning tasks for which the super-
vision data is limited or even unavailable. For example, assisted with knowledge transfer, an
auto-driving model trained in a simulation system can be readily deployed to the real world.
The performance of such models might be otherwise mitigated due to the distribution drift
from simulation to reality.
    In fact, the ability to transfer knowledge is essential to realize Generalized Artificial Intel-
ligence. Learning from external expertise has been widely observed among humankind. For
instance, toddlers learn to wobble around from following their parents [92], and a beginner
player in video gaming may grow into an expert by watching footage of senior players. Such
inherent ability, however, can be challenging for AI models to acquire. An AI model that is
well trained on a source domain may perform poorly in a target domain of interest, given
only a slight change in the data distribution [92]. Therefore, enabling knowledge transfer
across different learning scenarios remains an intriguing research topic in machine learning.
    Observing the merits of transfer learning, we tackle knowledge transfer in two major
machine learning settings: Reinforcement Learning (RL) and Supervised Learning. Pio-
                                                 1


neer efforts in knowledge transfer have been made for both problem settings. For example,
transferring knowledge from external demonstrations to perform imitation learning is shown
effective to realize plenty of RL tasks, including robot navigation [90, 85] and game playing
[107, 156]. On the other hand, learning domain-transferrable feature representations has
enabled domain adaptation in various supervised learning tasks, such as computer vision
[179, 120] and natural language processing [37].
    Along with their promising results, we notice some non-negligible limitations of prior
efforts in transfer learning, mainly in fwo-folds: First, most transfer learning approaches
hinges on sufficient data from the source domain or a relevant domain. For instance,
when transferring the knowledge from a teacher model to the student model, prior approaches
usually require a large set of data from the source domain, or from a domain with resemblant
data distribution. Otherwise, the process of TL is non-feasible due to the lack of such a proxy
dataset. Similarly, in the domain of RL, previous TL approaches usually require a large
bunch of teacher demonstrations [151], or frequent feedbacks from the teacher policy [145],
in order to provide enough guidance for the learning agent. Second, the learning efficiency
(in the context of supervised learning), or sampling efficiency (in the context of RL), can
be further improved for existing TL approaches. Low sampling efficiency can be a crucial
problem for RL, which means a learning agent may take a prohibitively long time, exploring
trajectories randomly with low returns. Especially, existing TL approaches in RL may still
require on-policy learning, which is a costly sampling strategy. Analogously, in supervised
learning, low learning efficiency could lead to more training iterations before convergence,
which are computationally costly, or even incur extra communication burdens to distributed
machine learning applications, such as Federated Learning [116].
    Observing the potential limitations of prior arts, in this thesis, we propose sample-efficient
TL approaches using limited, sometimes unreliable resources in a principled manner. We take
a deep look into the setting of RL and Supervised Learning, and derive solutions for the two
domains respectively. Especially, for RL, we focus on a problem setting called imitation
                                                2


learning, where the supervision from the environment is either non-available or scarcely
provided, and the learning agent must transfer knowledge from exterior resources, such as
demonstration examples of a previously trained expert, to learn a good policy. For supervised
learning, we consider a distributed machine learning scheme called Federated Learning
(FL) [116], which is a more challenging scenario than traditional machine learning, since
the training data is distributed and non-sharable during the learning process. Under this
distributed setting, it is therefore imperative to enable TL among distributed learning clients
to reach a satisfiable generalization performance.
1.2     Overview of Thesis Structure
This section summarizes each of the chapters in this thesis.
    In Section 1.3, we present the preliminaries about two different problem settings, i.e.
RL and supervised federated learning, under which we proposed our transfer learning ap-
proaches. We also introduce the principle of generative adversarial learning which provides
indispensable foundations of our work. In Section 2.1, we clarify the definition of TL in
the related machine learning scenarios, then list the terminologies that will be frequently
used through this thesis. We also briefly review the recent advances of TL in both RL and
supervised learning domains, and point out the distinction between our proposed work and
the prior arts before introducing our techniques.
    Chapter 3 elaborates the approach SAIL , which is an imitation learning approach in a
RL problem setting. It realizes knowledge transfer from limited, suboptimal demonstrations
to assist the agent learning to achieve expert-level performance with high sample efficiency.
In Chapter 4, we present the work of OPOLO , which extends the idea of TL from demon-
stration to a more challenging yet practical setting in RL, i.e. to transfer knowledge from
incomplete observations of expert behavior. In Chapter 5, we shift our focus from RL to an
intriguing supervised learning scenario and introduce our work dubbed as FedGen , which
enables knowledge transfer in FL without the need of accessing training data from the knowl-
                                                 3


edge source. In Chapter 6, we continue the TL study in the context of FL and introduce
FedResCuE, which enables self-knowledge distillation in FL to tackle a systematic heteroge-
neous FL system with unstable network connections. During FL, knowledge is transferred
and preserved in arbitrarily prunable sub-networks of the global model.
    In Section 7, we summarize the connection between each of our work, by looking into the
following key questions: 1) under which problem setting the proposed approach is applicable,
2) what are the teacher and student regarding the TL process, 3) what kind of knowledge
has been transferred, and 4) what can be achieved upon TL is complete. Next, we explore
the potential challenges faced by our TL approaches, point out the open questions that await
future research progress, and briefly discuss our ongoing and future work.
1.3     Background and Preliminaries
In this section, we introduce the basic concepts in RL, (supervised) FL, and adversarial
generative learning, which composed the cornerstones of our transfer learning research.
1.3.1    Reinforcement Learning (RL)
A typical RL problem can be considered as training an agent to interact with an environment
that follows a Markov Decision Process (MPD) [15]. In an MDP, the agent starts with
an initial state and performs an action accordingly, which yields a reward to guide the
agent actions. Once the action is taken, the MDP transits to the next state by following
the underlying transition dynamics of the MDP. The agent accumulates the time-discounted
rewards along with its interactions with the MDP. A subsequence of interactions is referred
to as an episode. For MDPs with infinite horizons, one can assume that there are absorbing
states, such that any action taken upon an absorbing state will only lead to itself and yield
zero rewards. All above-mentioned components in the MDP can be represented using a tuple
M = (µ0 , S, A, P, γ, R, S0 ), in which:
    • µ0 is the set of initial states.
                                               4


    • S is the state space.
    • A is the action space.
    • P : S × A × S → R is the transition probability distribution, where P(s0 |s, a)
      specifies the probability of the state transitioning to s0 upon taking action a from state
      s.
    • R : S × A × S → R is the reward distribution, where R(s, a, s0 ) is the reward an
      agent can get by taking action a from state s with the next state being s0 .
    • γ is a discounted factor, with γ ∈ (0, 1].
    • S0 is the set of absorbing states.
Figure 1.1 illustrates the above components, and Figure 1.2 is an example of trajectories
in an infinite-horizon MDP. An RL agent behaves in M by following its policy π, which is
a mapping from states to actions: π : S → A. For stochastic policies, π(a|s) denotes the
probability for agent to take action a from state s. Given an MDP M and a policy π, one
can derive a value function VM        π
                                        (s), which is defined over the state space:
                                 π
                                   (s) = E r0 + γr1 + γ 2 r2 + . . . ; π, s ,
                                                                             
                               VM
where ri = R(si , ai , si+1 ) is the reward that an agent receives by taking action ai in the i-th
state si , and the next state transits to si+1 . The expectation E is taken over the following
distriubtion:
                                s0 ∼ µ0 , ai ∼ π(·|si ), si+1 ∼ T (·|si , ai ).
The value-function estimates the quality of being in state s, by evaluating the expected
rewards that an agent can get from s, given that the agent follows policy π in the environment
M afterward. Similar to the value-function, each policy also carries a Q-function, which is
defined over the state-action space to estimate the quality of taking action a from state s:
                          QπM (s, a) = Es0 ∼P(·|s,a) [R(s, a, s0 ) + γVM  π
                                                                             (s0 )] .
                                                    5


The objective of RL is to learn an optimal policy πM          ∗
                                                                to maximize the expectation of
accumulated rewards, so that: ∀s ∈ S, πM       ∗
                                                 (s) = arg max Q∗M (s, a), where Q∗M (s, a) =
                                                            a∈A
sup QπM (s, a).
  π
                                                 Figure 1.2: An absorbing state (sT −1 ) denotes
                                                 the termination of a task in an infinite-horizon
      Figure 1.1: Overview of RL.                MDP.
1.3.2    Federated Learning (FL)
Without ambiguity, in this section, we present a typical FL setting for supervised learning,
i.e., the general problem of multi-class classification. Let X ⊂ Rp be an instance space,
Z ⊂ Rd be a latent feature space with d < p, and Y ⊂ R be an output space. T denotes a
domain which consists of a data distribution D over X and a ground-truth labeling function
c∗ : X → Y, i.e. T := hD, c∗ i. Note that we will use the term domain and task equivalently.
A model parameterized by θ := [θ g ; θ p ] consists of two components: a representation learning
module g : X → Z parametrized by θ g , and a predictor h : Z → 4Y parameterized
by θ h , where 4Y is the simplex over Y. Given a non-negative, convex loss function l :
4Y × Y → R, the risk of a model parameterized by θ on domain T is defined as LT (θ) :=
Ex∼D l h(g(x; θ g ); θ h ), c∗ (x) .
                                 
The objective of FL is to learn a global model parameterized by θ that minimizes its risk
on each of the user tasks Tk [116]:
                                     minθ ETk ∈T [Lk (θ)] ,                                  (1.1)
                                                 6


where T = {Tk }K  k=1 is the collection of user tasks. We consider all tasks sharing the same
labeling rules c∗ and loss function l, i.e., Tk = hDk , c∗ i. In practice, Equation 5.1 is empiri-
cally optimized by minθ K1 K    k=1 L̂k (θ), where L̂k (θ) := |D̂ |
                                                                1 P                      g    p    ∗
                             P
                                                                 k
                                                                    xi ∈D̂k [l(h(g(xi ; θ ); θ ), c (xi ))]
is the empirical risk over an observable dataset D̂k . An implied assumption for FL is that
the global data D̂ is distributed to each of the local domains, with D̂ = ∪{D̂k }K          k=1 .
1.3.3    Generative Adversarial Learning
Generative Adversarial learning can be traced back to the work of Generative Adversarial
Network (GAN) [56], which is a representative solution to generative learning. The goal of
GAN is to learn an unknown distribution p(x) using a set of samples from such distribution.
More concretly, GAN learns two modules: a discriminator D and a generator G, to jointly
optimize the following objective J(G, D):
              min max J(G, D) := Ex∼p(x) [log D(x)] + Ez∼G(z) [log(1 − D(G(z)))].                    (1.2)
               G   D
The generator G can be later used to synthesize samples to assist downstream model train-
ing tasks. GAN performs minimax-optimization on the Jesen-Shannon divergence between
the ground-truth distribution p(x) and the generated distribution G(z). Recent extensions
of GAN have explored other forms of distributional-discrepancy measure, such as the f -
divergences [129] or Wasserstain divergence [9]. Among different adversarial generative
learning approaches, distribution matching and game theory is at the heart of their founda-
tion. Generative adversarial learning has been applied to solve a variety of problems, such as
synthetic data generation [204], adversarial attacks and defenses [192], domain adaptatoin
[174], imitation learning [68], etc.
1.3.4    Notations and Terminologies
We list the following terminologies which will be frequently used in this thesis:
   1. Domain: we will refer domain and task equivalently. In the context of RL, a domain
       denotes a Markov Decision Process M = (µ0 , S, A, P, γ, R, S0 ). In the context of
                                                 7


      supervised learning, a domain T = hD, c∗ i is a composition of a data distribution D(x)
      in the input space X and its target labeling function c∗ : X → 4Y , where 4Y is the
      simplex of label space Y. Without ambiguity, in this dissertation, we consider a typical
      supervised learning problem of multi-class classification.
   2. Teacher (Source) and Student (Target): We will refer teacher and source equiv-
      alently, and refer student and target equivalently as well. transfer learning is all about
      transferring knolwedge from the teachers (source) domains to the student (target) do-
      mains. In the contex of RL, a teacher (student) denotes a policy π : X → A. In
      the context of supervised learning, a teacher (student) denotes a predictive model
      f = h ◦ g : X → 4Y .
                            Abbreviation          Definition
                                 TL           Transfer Learning
                                 RL        Reinforcement Learning
                                 SL          Supervised Learning
                                KD          Knowledge Distillation
                                 FL          Federated Learning
                            Table 1.1: Overview of Abbreviations.
    We also present the abbrevations and key notations used in this thesis in Table 1.1 and
1.2.
                                              8


 Symbol                                  Definition                                           Context
    p                            Dimension of input space                                        SL
    d                            Dimension of latent space                                       SL
    X                              Input Space, X ⊂ R    p
                                                                                                 SL
    x                       A vector of input variables, x ∼ X                                   SL
    Z                              Latent Space, Z ⊂ R    d
                                                                                                 SL
    z                       A vector of latent variables, z ∼ Z                                  SL
    Y                              Output Space, Y ⊂ R                                           SL
    y                            An output variable, y ∼ Y                                       SL
    D                                Data Distribution                                          SL
    c∗                         Labeling function, c : X → Y
                                                   ∗
                                                                                                 SL
    T                                Task, T := hD, c∗ i                                         SL
    g                         Representation learning module                                     SL
    h                                Prediction module                                           SL
    θ                             Network paraemeter set                                      SL & RL
    K                           Number of (source) domains                                    SL & RL
   M                             Markov Decision Process                                        RL
    π                                Policy, π : S → A                                          RL
    τπ     A trajectory by following policy π, τ := {s0 , a0 , s1 , a1 , · · · |sat = π(st )}   RL
Table 1.2: An overview of mathematical notations. A notation can be applicable to either
the context Supervised Learning (SL) or Reinforcement Learning (RL).
                                              9


                                          CHAPTER 2
                  PROBLEM OVERVIEW: TRANSFER LEARNING
In this chapter, we first present definitions of transfer learning (TL) in two different contexts:
reinforcement learning (RL) and supervised learning, respectively. Next, we briefly overview
the recent advances of transfer learning approaches in both settings.
2.1      Defitinion of Transfer Learning
2.1.1     Transfer Learning in the Context of Reinforcement Learning
In the context of reinforcement learning, we use Ms = {Mis }K       i=1 to denote a set of source
domains, and Mt to denote the target domain. For a minimalistic case, knowledge can
transfer between two agents within the same domain, resulting in |Ms | = 1, and Ms = Mt .
Definition 1. [Transfer Learning in the Context of Reinforcement Learning] Given
a set of source domains Ms = {Mis }K       i=1 and a target domain Mt , where each domain is
a Markov decision process, transfer learning occurs when an lgorithm A leverages exterior
information Is from Ms and interior information It from Mt as inputs to generate a policy
    π    = A(Is ∼ Ms , It ∼ Mt ), which achieves higher performance in the target domain
S t →At
Mt , compared with not utilizing the source-domain knowledge Is .
      In the above definition, we use A(Is ; It ) to denote the output of a transfer learning
algorithm, i.e. the learned policy based on information Is and It . One can consider regular
RL without transfer learning as an extreme case of the above definition by treating Ms = ∅
and Is = ∅, so that a policy π is learned purely on the feedback provided by the target
domain, i.e. π = A(It ). TL is beneficial in improving the RL performance, especially when
the supervision from a target domain It is lacking or sub-optimal.
      We illustrate some eligible types of source-domain knowledge in Figure 2.1, although
knowledge from the source domain Is can also take other forms of supervision.
                                                 10


    Figure 2.1: TL approaches in RL organized by the format of transferred knowledge.
2.1.2    Transfer Learning in the Context of Supervised Learning
In the context of supervised learning (SL), We use T s = {Tsi }K      i=1 to denote a set of source
domains, and Tt to denote the target domain. Below we provide a high-level definition of
TL for a supervised learning problem setting:
Definition 2. [Transfer Learning in the Context of Supervised Learning] Given
a set of source domains T s = {Tsi }K   i=1 and a target domain Tt , where each domain is a
                                                                 t∗
supervised learning task, with Tsi = hDsi , cs∗
                                             i i and Tt = hDt , c i, transfer learning occurs when
an lgorithm A leverages exterior information Is from T s and interior information It from Tt
as inputs to generate a predictive model h ◦ g = A(Is ∼ T s , It ∼ Tt ), which achieves lower
                                             X t →∆Y
risk in the target domain Tt , compared with not utilizing the source-domain knowledge Is .
Remark 1 (Domain of Multi-Class Classification). Without losing generality, we also
consider that each domain is a multi-class classification problem, although the proposed ap-
proaches in this dissertation have the potential to be extended to other scenarios, including
regression or multi-label classification.
                                                  11


2.2     A Glance of Prior Arts
Transfer learning in RL has gained ever-increasing attention due to the recent success of
RL in various applications, such as game playing [14] and robotics learning [85]. Traditional
transfer learning for RL traces back to a scenario called behavior cloning, in which a target
policy is trained in a supervised learning manner by leveraging state-action samples from a
pretrained teacher [138]. Later approaches are developed to involve the transferred knowledge
into the RL learning loop. For instance, in a knowledge transfer approach called reward
shaping [125], the exterior knowledge can be distilled as a synthetic reward function, which
provides extra guidance besides the reward signal from the environment. Based on the
form of transferred knowledge, approaches along this line can be mainly categorized into the
following: reward shaping [182, 38], learning from demonstrations [105, 25, 175, 123, 68, 79]
policy transfer [146, 191, 168, 12], inter-domain mapping [60, 7], representations transfer [20],
etc. There have been prior efforts in summarizing the knowledge transfer techniques in RL
[167, 91]. Interested readers are referred to our survey [206] for more recent advances along
this line. Our proposed knowledge transfer approaches outstand pior arts in the following
aspects: i) our approaches improve the sample efficiency in RL by learning from the exterior
knowledge in an off-policy manner, and ii) we enable knowledge transfer from accessible
resources, such as sub-optimal or limited teacher demonstrations.
Transfer learning in Supervised Learning is an important machine learning problem
that has been studied extensively [132, 207]. This technique has benefited practical super-
vised learning applications in computer vision, auto-navigation, natural language processing
[19, 104], healthcare, etc. While a comprehensive overview of TL in the SL domain requires
multi-fold criteria, one angle to investigate TL approaches is to answer the key question of
what knowledge to transfer. Specifically, appraoches along this line can be mainly organized
into three categories: 1) instance based transfer [76, 101, 180, 152], 2) feature representation
transfer [174, 203], and 3) model parameter transfer [193, 117]. In instance-based transfer,
the training data from the source domain can directly help to learn the target domain by
                                               12


re-weighting the training samples. In feature representation transfer, the core idea is to learn
and leverage latent features that are robust across different domains. For model parameter
transfer, the knowledge is conveyed by the parameter set of a teacher model, which is usually
more complex in structure than the student model. In Chapter 5, we introduce one of our
FL proposed approaches, which can be considered as a hybrid scheme that transfers both
model parameters and latent representations. Moreover, in Chapter 6, we investigate how to
transfer and preserve knowledge in different model channels of the same model architecture.
    As a specific supervised learning scenario, FL has achieved wide success in practical
applications such as mobile computing and healthcare. As a result, TL learning in FL has
recently emerged as an effective approach to tackle user heterogeneity brought by FL. Most
existing work along this line is data-dependent [103, 159, 58, 30]. Particularly, [103] proposed
FedDFusion, which performs TL to refine the global model, assuming that an unlabeled
dataset is available with samples from the same or similar domains. Complementary TL
efforts have been made to confront data heterogeneity [96, 150]. Specifically, [96] transmits
the proxy dataset instead of the model parameters. FedAUX [150] performs data-dependent
distillation by leveraging an auxiliary dataset to initialize the server model and to weighted-
ensemble user models, while FeDGen performs knowledge distillation in a data-free manner.
FedMix [194] is a data-augmented FL framework, where users share their batch-averaged
data among others to assist local training. On the contrary, in Chapter 5, we proposed
FedGen , which extracts knowledge from the existing user model parameters and hence
faces fewer privacy risks. From a different yet practical perspective, in Chapter 6, we also
investigate the role of self-knowledge distillation in achieving robust and efficient FL that
confronts heterogeneous systems and faulty network connections.
                                               13


                                          CHAPTER 3
    KNOWLEDGE TRANSFER IN REINFORCEMENT LEARNING FROM
                           SUBOPTIMAL DEMONSTRATIONS
This chapter is based on the following work:
     Zhuangdi Zhu, Kaixiang Lin, Bo Dai, and Jiayu Zhou. Self Adaptive Imitation Learning:
Learning Delayed Rewarded Tasks from Suboptimal Demonstrations. Proceedings of the
Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022.
3.1      Introduction
Reinforcement Learning (RL) is notably advantageous in learning sequential decision-making
problems in simulated environments, such as game-playing [119, 156], where massive samples
with dense rewards can be accessed at a negligible cost. However, it is challenging to upscale
RL to real-world scenarios due to its dependence on immediate reward feedback. For practical
applications where rewards are usually delayed in time and sparse in value, RL agents may
struggle with high sample complexity, facing difficulties of connecting a long sequence of
actions to the feedback received in the far future.
     In fact, the ability to learn from delayed feedback is crucial for realizing advanced ar-
tificial intelligence [34, 142]. On the one hand, reducing the frequency of reward sampling
contributes to a lower interaction complexity for practical applications, such as autonomous-
driving [137] and UAV navigation [84]. On the other hand, learning from coarse-grained
supervision, such as human preference [90], is rather useful when it is easy to recognize
the desired behavior but difficult to explain its rationale by designing delicate reward func-
tions [131].
     Recent advances of Imitation Learning (IL) can effectively provide remedies when the
environment feedbacks are delayed or even unavailable, by referencing expert demonstra-
tions [68, 87, 88] or policies [145, 162]. In spite of their success, a major limitation of such IL
                                                 14


approaches is that the learned performance is bounded by the given expert. Consequently,
when the provided demonstrations are sub-optimal, which is a practical yet more challeng-
ing scenario, the IL approaches will induce a sub-optimal policy. In the meantime, some
work has been proposed for learning from sub-optimal guidance in delayed rewarded tasks
[79, 160, 185, 202, 54]. A shared rationale among them is to augment the environment re-
wards with synthetic rewards derived from the demonstrations, after which an actor-critic
algorithm can take over the policy learning. Although technically effective, these approaches
are inherent with twofold limitations. First, the sub-optimality of teacher demonstrations
has not been fully resolved. Once the learning agent reaches a reasonable performance, the
demonstrations will become a bottleneck, leading to negative guidance that contradicts en-
vironment feedbacks [112]. Second, the environment feedback is not well leveraged. Learning
a critic function relying on delayed environment rewards can be sample costly, which may
provide weak signals to compensate for the sub-optimal demonstrations.
    In this paper, we formally consider a problem setting, where an RL agent only has
access to a limited number of sub-optimal demonstrations in a task with highly delayed
rewards. Our goal hence is to combine the merits of RL and IL, by exploring the sub-
optimal demonstrations that are easier to access in practice, while preserving the chance to
explore for better policies guided by the coarse-grained environment feedbacks.
    Noticing the challenges of the proposed problem and limitations in prior arts, we pro-
pose Self-Adaptive Imitation Learning (SAIL), an off-policy imitation learning approach that
strikes a balance between exploitation and exploration to reach high performance. More con-
cretely, we formulate our objective as exploration-driven IL. On the one hand, our approach
minimizes the discrepancy between the teacher and the learning policy; on the other hand, it
encourages the learning policy to deviate from its previously learned predecessors for better
exploration. Specifically, we leverage the delayed feedback from the environment to ex-
plore superior self-generated trajectories that surpass the teacher’s performance. Those self-
generated trajectories are used to replace the suboptimal teachers to constructs a dynamic
                                               15


target distribution that gradually converges to optimality. An overview of our proposed ap-
proach is provided in Figure 3.1. Extensive empirical studies have shown that SAIL achieves
significant improvement regarding both sample efficiency and asymptotic performance on
various popular benchmarks.
Figure 3.1: Illustraion of SAIL: Navigations in red arrows follow the exploration driven IL
objective, which approaches to teacher’s density distribution while deviating from previous
learned ones. It explores more efficiently to reach expertise, compared with random explo-
rations (green arrows).
3.2      Background
Markov Decision Process (MDP) is an ideal environment to formulate RL, which can
be defined by a tuple M = (S, A, T , r, γ, µ0 , S0 ), where S and A are the state and action
space, T (s0 |s, a) denotes the probability of the environment transitioning from state s to
s0 upon action a is taken, r(s, a) is the environment reward received by taking action a on
state s, γ ∈ (0, 1]) is a discounted factor, µ0 is the initial state distribution, and S0 is the
set of terminal states or absorbing states. Any absorbing state always transits to itself and
yields a reward of zero [164]. Given a trajectory τ = {(st , at )}∞     t=0 , we define its return as
R(τ ) = ∞   k=0 γ r(sk , ak ). For an episodic task with a finite horizon, its return can be written
                  k
          P
as R(τ ) = Tk=0 γ k r(sk , ak ), where T is the number of steps to reach an absorbing state.
            P
    The objective of RL is to learn a policy π : S → A that maximizes the expected return
of its trajectories. Equivalently, this objective can be rephrased as finding a distribution
dπ (s, a):
                                                                                                (3.1)
                                                                    
                                 maxπ η(π) := E(s,a)∼dπ (s,a) r(s, a) ,
                                                  16


in which dπ (s, a) is the normalized stationary state-action distribution of π: dπ (s, a) = (1 −
γ)µπ (s, a), and µπ (s, a) is the occupancy measure of a policy π, defined as:
                     X∞
           π
                         γ t P r st = s, at = a|s0 ∼ µ0 , at ∼ π(st ), st+1 ∼ T (st , at ) [68].
                                                                                          
         µ (s, a) =
                     t=0
Without ambiguity, we use density and normalized stationary state-action distribution in-
terchangeably to refer dπ (s, a) in this chapter.
Adversarial Imitation Learning addresses IL from the perspective of distribution match-
ing. A representative work along this line is Generative Adversarial Imitation Learning
(GAIL) [68]. Given a set of demonstrations from an unknown expert policy πE , GAIL aims
to learn a policy π that minimizes the Jensen-Shannon divergence between dπ and dE :
                                minπ DJS [dπ (s, a)||dE (s, a)] − λH(π),
where dπ and dE are the densities derived from the learning policy π and the expert policy
πE , and H(π) is an entropy regularization term [209, 208].
    GAIL applies a saddle-point optimization strategy: it jointly trains a discriminator D
and a policy π to optimize the following minimax objective:
                    minπ maxD Edπ (s,a) [log(1 − D(s, a))] + EdE (s,a) [log(D(s, a))].
 In practice, a fixed set of demonstrations from expert densities dE are given, while samples
from dπ are obtained by on-policy interactions with the environment.
3.3     Problem Setting
In this chapter, we address the problem of learning in an MDP with highly delayed feedbacks.
More concretely, in this MDP, an agent learns from the trajectory-wise reward re , which
is only non-zero upon reaching an absorbing (terminal) state:
                              re (st , at , st+1 ) 6= 0 ⇔ st ∈
                                                             / S0 , st+1 ∈ S0 .
Without losing clarity, we use re (τ ) to denote the trajectory-wise reward obtained by a
trajectory τ . For arbitrary two trajectories τi , τj , their relative ranking of re should align
with the task objective:
                                                        17


Assumption 1 (Legitimacy of the Trajectory Rewards). ∀ τi , τj , re (τi ) ≥ re (τj ) =⇒ Pπ∗ (τi ) ≥
Pπ∗ (τj ), where
                            XT
                                 γ i log π ∗ (ai |si )|τ := {(s0 , a0 ), (s1 , a2 ), · · · , (sT , aT )}
                                                                                                        
                 Pπ∗ (τ ) =
                            i=0
 is the extent to which an oracle policy π ∗ agrees with a trajectory τ .
    Our problem setting provides a generalized framework for a variety of prior arts, including
preference-based RL [51, 183] and learning from human feedbacks [119]. Prior work of
learning sparse-rewarded tasks with K-step feedbacks [160, 79, 175, 112] can also be reduced
to our problem setting, with the advantage that their reward signals are finer-grained and
more frequently provided. Compared with an elaborate reward function, trajectory-wise
rewards are easier to access and more intuitive to human perception [34, 51], which, however,
makes regular RL more challenging.
    To alleviate the learning difficulty, we assume that an agent learning a policy π is allowed
to leverage external demonstrations RT from an unknown teacher policy πT , which are sub-
optimal but more accessible than expert demonstrations. For the following of this chapter,
we use dπ (s, a) and dT (s, a) to denote the density distribution derived from policy π and
πT , respectively. In practice, dT is usually approximated from the demonstration data RT
[209, 49, 68]. Moreover, the learning agent can also access its self-generated transitions
cached in a replay buffer RB , whose density distribution is dB (s, a).
3.4      Methodology
3.4.1     Exploration-Driven Objective
We propose an exploration-driven IL objective to learn from sub-optimal demonstrations,
which is formulated as below:
Objective 1 (Exploration-Driven Imitation Learning).
                 max J(π) := −DKL [dπ (s, a)||dT (s, a)] + DKL [dπ (s, a)||dB (s, a)],
                   π            |              {z             } |                 {z               }
                                           Imitation                        Exploration
                                                          18


    in which DKL denotes the KL-divergence between two distributions:
                                                            p(x)
                                    DKL [p||q] = Ep(x) log       .
                                                            q(x)
Objective (1) can be interpreted as joint motivations for imitation and exploration. The first
term −DKL [dπ (s, a)||dT (s, a)] encourages distribution matching between dπ (s, a) and dT (s, a).
The second term DKL [dπ (s, a)||dB (s, a)], though counter-intuitive at first sight, serves as an
objective for self-exploration. Since dB (s, a) is the density derived from previously-learned
policies, maximizing DKL [dπ (s, a)||dB (s, a)] is in favor of visiting state-actions that are rarely
seen by previously learned policies, which acts as a repulsive force from dB (s, a).
    Specifically, the proposed objective encourages exploration, which is opposed to a con-
ventional IL objective that solely pursues distribution matching between dπ and dT :
                             maxπ JIL (π) := −DKL [dπ (s, a)||dT (s, a)].                       (3.2)
An optimal solution to Eq (3.2) is a policy that exactly recovers the teacher’s density distri-
bution, with dπ (s, a) = dT (s, a) [209]. Given this objective, π is restricted from further ex-
ploring density distributions that deviate from dT , which impedes its potential of generating
more superior trajectories. We will verify by empirical studies that optimizing Objective (1)
achieves more efficient exploration compared with a pure imitation-driven objective.
3.4.2    Adaptive Learning Target
Following Objective (1), the learning policy has obtained the potential to yield trajecto-
ries with performance surpassing the teacher. To fully utilize this self-generated resource,
SAIL adaptively adjusts the teacher’s buffer to replace teacher demonstrations with more
superior trajectories sampled from the learning agent, by leveraging the trajectory-wise feed-
back from the environment. This strategy dynamically improves the lower bound of the
teacher’s performance. As a result, the density dT (s, a) of the teacher buffer is approaching
an oracle distribution:
                                                  19


Theorem 1. For a deterministic policy, rewards of its generated trajectories indicate the
policy’ agreement with an oracle:
                     ∀πi , πj , Eτ ∼πi [re (τ )] > Eτ ∼πj [re (τ )] =⇒
                                  DKL [πi (a|s)||π ∗ (a|s)] < DKL [πj (a|s)||π ∗ (a|s)] .
Proof. Given arbitrary deterministic policies πi and πj , s.t. Eτ ∼πi [re (τ )] > Eτ ∼πj [re (τ )].
Based on Assumption 1, one can derive that:
                                     Eτ ∼πi [Pπ∗ (τ )] − Eτ ∼πj [Pπ∗ (τ )] > 0.                    (3.3)
 Next, one can derive that:
                DKL [πi (a|s)||π ∗ (a|s)] − DKL [πj (a|s)||π ∗ (a|s)]
                                      πi (a|s)                                 πj (a|s)
               =Edπi (s)πi (a|s) [log    ∗
                                                 ] − Edπj (s)πj (a|s) [log ∗            ]
                                      π (a|s)                                  π (a|s)
               =    −H[πi (a|s)]         −Edπi [log π ∗ (a|s)] + H[πj (a|s)] +Edπj [log π ∗ (a|s)]
                    |      {z      }                                   | {z }
                 0 for deterministic πi                                       0
                              X
               = − (1 − γ)          Est ∼µπi ,at ∼πi (st ) [γ t log π ∗ (at |st )]
                                           t
                                t
                              X
                 + (1 − γ)          Es     πj              [γ t log π ∗ (at |st )]
                                       t ∼µt ,at ∼πj (st )
                                t
                                                                            
               = − (1 − γ) Eτ ∼πi [Pπi (τ )] − Eτ ∼πj [Pπj (τ )] < 0 .
                                |                       {z                         }
                                           based on Eq equation 3.3
    Therefore, when the teacher buffer is updated with more superior trajectories generated
by a deterministic policy over time, as in our case, the distribution derived by the teacher
buffer is approaching to optimality. Unlike prior art that bundles their critic learning process
with environment rewards, we leverage this delayed and coarse-grained feedback to construct
a dynamic learning target with increasing superiority, which relieves the bottleneck brought
by sub-optimal demonstrations.
                                                             20


3.4.3    Off-Policy Adversarial TD Learning
While our proposed approach is appealing in combining the merits of exploitation and explo-
ration, it is challenging to directly optimize Objective (1). To make it more approachable,
we draw a connection from Objective (1) to a conventional RL problem:
Remark 2. Objective (1) can be rephrased as the following, which is equivalent to a max-
return RL objective with log ddBT (s,a)
                                   (s,a)
                                         in place of the environment rewards:
                                                                         
                                                               dT (s, a)
                               max J(π) := Edπ (s,a)       log              ,               (3.4)
                                 π                             dB (s, a)
    One can consider the optimization of Equation equation 3.4 as a process of policy se-
lection: for the support of (s, a) where the teacher has visited more frequently than the
previously-learned policies, π is encouraged to build positive densities on those state-actions,
leading to dπ (s, a) > 0 wherever dT (s, a) > dB (s, a). Intuitively, this process implies that
the agent trusts the teacher more than the previously learned policies.
    Based on this insight, we can relate Objective (1) to Temporal-Difference (TD) learning,
and solve it under an actor-critic framework. To obtain the reward function log ddBT (s,a)
                                                                                        (s,a)
                                                                                              , we
build upon prior arts [68] to learn a discriminator D that optimizes the following:
                      max      EdB (s,a) [log(1 − D(s, a))] + EdT (s,a) [log(D(s, a))].      (3.5)
                   D:S×A→(0,1)
    D aims to distinguish between the self-generated data from dB and the teacher demon-
strations from dT . A well-learned discriminator shall satisfy the following [56]:
                                                       dT (s, a)
                                   D∗ (s, a) =                         .
                                                 dT (s, a) + dB (s, a)
The output of D with a constant shift, which we found to be more empirically effective, is
used to render synthetic rewards to the agent:
                        r0 (s, a) = − log(1 − D(s, a)) ≈ log( ddBT (s,a)
                                                                      (s,a)
                                                                              + 1).
                                                    21


In the initial training stage, a well-trained discriminator renders higher rewards to teacher
demonstrations with D(s, a) → 1, and lower rewards for self-generated samples with D(s, a) →
0. The learning policy is therefore designed to confuse the discriminator by maximizing the
shaped accumulated rewards.
    To improve sample efficiency, we adopt an off-policy learning framework. Our objective
is accordingly rephrased to maximize the expectation of Q-values over distributions of a
behavior policy β [155, 102]:
                                     Z
                    maxθ Jβ (πθ ) :=    dβ (s)Q(s, πθ (s))ds = Edβ (s) [Q(s, πθ (s))],   (3.6)
                                      s
 where dβ (s) is the normalized stationary state distribution of β, analogous to the state-
action distribution. The Q-function is a fixed point solution to the Bellman operation based
on the shaped rewards:
                       Q(s, a) = r0 (s, a) + γEs0 ∼T (s0 |s,a),a0 ∼π(s0 ) [Q(s0 , a0 )]. (3.7)
Accordingly, the policy-gradient for the actor can be derived as [102]:
                         ∇θ Jβ (πθ ) ≈ Es∼dβ [∇θ πθ (s)∇a Q(s, a)|a=πθ (s) ].            (3.8)
In the next section, we introduce an algorithm that realizes our objective via the above-
mentioned off-policy TD learning. It adopts an even more effective sampling approach that
further accelerates the learning procedure.
3.4.4    Self-Adaptive Imitation Learning
Combing all the building blocks, we now introduce our approach, dubbed as Self-Adaptive
Imitation Learning (SAIL), as described in Algorithm 3.1. SAIL maintains two replay-buffers
RT and RB , for caching teacher demonstrations and self-generated transitions, respectively.
It jointly learns three components: a discriminator D that serves as a reward provider,
a critic Q that minimizes the Bellman error based on the shaped rewards, and an actor
π that maximizes the shaped returns. During iterative training, high-quality trajectories
                                                  22


Algorithm 3.1: Self-Adaptive Imitation Learning
 1: Input: teacher replay buffer RT with demonstrations,
 2:             self-replay-buffer RB with random transitions.
 3:             policy πθ , discriminator Dw , critic Qφ , batch size N > 0, coefficient α > 0
 4: for n = 1, . . . do
 5:     sample trajectory τ ∼ πθ
 6:     if re (τ ) > minτ 0 {(re (τ 0 ) | τ 0 ∈ RT } ) then
 7:          RT ← RT ∪ τ ; α ← 0
 8:     else
 9:          RB ← RB ∪ τ
10:     if n mod discriminator-update = 0 then
                                 N                                 N
11:            (si , ai , · · · ) i=1 ∼ RB , (sTi , aTi , · · · ) i=1 ∼ RT
12:          update Dw by ascending                gradient :
                  1
                      PN                 
                                            T    T
                                                                                 
13:          Ow N i=1 log D si , ai + log 1 − D si , ai
14:     if n mod Q-update = 0 then
                                                          0T N
15:          {si , ai , s0i }N                 T    T
                             i=1 ∼ RB , {si , ai , si }i=1 ∼ RT
16:          yi ← − log(1 − D(si , ai )) + γ Q̄(s0i , π(s0i ))
              0                                                  0      0
17:          yiT ← − log(1 − D(sTi , aTi )) + γ Q̄(siT , π(siT ))
18:          update Qφ byP          minimizing critic loss: P
                             1−α                              2     α                T    T      0T 2
19:          J(Qφ ) = N               i [(Qφ (si , ai ) − yi ) ] + N      i [(Qφ (si , ai ) − yi ) ]
20:     if n mod policy-update = 0 then
                                 N                               N
21:            (si , ., ., ., .) i=1 ∼ RB , (sTi , ., ., ., .) i=1 ∼ RT
22:          update π by sampled policy gradient:
                                                                    1−α
                                                                         P
23:
       P Oθ J(πTθ )                           ≈                      N        i [Oθ πθ (si )Oa Q(si , ai )|ai =πθ (si ) ]     +
     α                              T     T
     N    i [Oθ πθ (si )Oa Q(si , ai )|aT               T ]
                                              i =πθ (si )
generated by the actor are selected to refill the teacher demonstration buffer RT , while other
trajectories are cached in the self-replay buffer RB . We highlight three key aspects of SAIL:
    (1) Leveraging delayed environment feedback to update teacher buffer RT : High-quality
trajectories with reward re above a threshold CdT are selected to update the teacher buffer
RT . In practice, CdT is simultaneously updated as the lowest reward in the teacher buffer
RT : CdT = minτ 0 {re (τ 0 )|τ 0 ∈ RT }, which guarantees the increasing quality of trajectories in
the teacher’s buffer.
    (2) Realizing exploration-driven IL with an off-policy discriminator : Prior art such as
GAIL relies on on-policy training of a discriminator to estimate the ratio of                                dT (s,a)
                                                                                                             dπ (s,a)
                                                                                                                      .   On the
contrary, we learn an off-policy discriminator D that aligns with our proposed objective and
                                                                23


encourages efficient exploration, whose effectiveness will be elaborated in the Experiment
section.
    (3) Sampling from teacher demonstrations for boosted learning efficiency: In the initial
learning step, we sample from both the teacher dataset RT and the self-generated dataset
RB to construct a mixed density distribution, which plays the role of dβ in Eq (3.8). More
concretely, we derive a mixture distribution: dmix = αdT + (1 − α)dB , where α is the ratio of
samples from teacher demonstrations. In practice, we initialize α = 0.5. Once the learning
policy generates trajectories with performance comparable to the teacher, we anneal the
value of α to zero.
3.4.4.1    Reasoning of sampling from a mixture of distributions:
Sampling from teacher demonstrations has been studied by other prior arts [175, 142]. Along
with the same spirit, the mixture sampling distribution in our case accelerates the IL process
by a behavior-cloning strategy. To see the rationale, one can rephrase the objective in
Equation 3.6 as the following:
               maxθ Jβ (πθ ) := α EdT (s) [Q(s, πθ (s))] +(1 − α)EdB (s) [Q(s, πθ (s))] .
                                  |         {z         }
                                     Behavior Cloning
 In the early training stage, the discriminator will favor teacher trajectories by assigning
them with highest rewards:
                  EdT (s) [max Q(s, a)] = EdT (s,a) [Q(s, a)] ≡ EdT (s) [Q(s, πT (s))] ,
                            a
 which encourages the learning policy to imitate πT on teacher-visted states dT (s). We
will verify by ablation study that sampling from teacher demonstrations accelerates the
process of IL. Given a problem-setting with sub-optimal demonstrations, once the learning
agent reaches the teacher-level performance, we relieve this behavior-cloning regularization
by annealing α to zero, in order to reinforce the effects of exploration as proposed in our
objective.
                                                   24


3.5     Related Work
Our work shares close connections with the following topics:
Imitation Learning (IL) aims to learn from expert demonstrations without accessing
environment feedbacks, among which representative examples include GAIL [68] and its on-
policy extensions [79, 185, 49]. Later IL favors off-policy RL frameworks [148, 88]. Especially,
DAC learns a discriminator by off-policy learning and corrects the distribution shifts by
importance sampling [87]. In contrast to our approach, the above prior arts are motivated
to exactly recover the teacher policy. Our work also draws a subtle connection to Self-
Imitation Learning (SIL) [130, 59], in that they both utilize self-generated trajectories to
build a learning target. However, SIL requires timely feedbacks from the environment to
learn a delicate critic, which is in essence on-policy RL, while SAIL addresses a different
setting by performing exploration-driven IL in an off-policy manner.
Learning from Demonstrations (LfD) facilitates RL by augmenting environment feed-
backs with external demonstrations. Prior work relies on demonstrations that are sufficient
and optimal [66, 175]. Especially, DDPGfD leverages a DDPG framework [102] to enable
off-policy LfD in continuous spaces [175]. Later approaches, such as POfD [79], learn from
sub-optimal demonstrations and trust the environment rewards to learn a critic, whereas
demonstrations are only used as auxiliary guidance [160, 202, 54]. In contrast, our approach
learns a critic without using environmental rewards, which is more robust especially when
environment feedbacks are highly delayed. Some leverage the suboptimal guidance to en-
force a policy regularization term, whose effects are gradually decayed to tackle the imperfect
guidance [112]. The above problem settings can be considered as relaxed versions of ours
with finer-grained feedbacks.
Preference-based RL is a problem setting where the agent learns from the preference
of an expert, which saves the necessity of designing elaborated numeric rewards [181, 183].
The preference relations can be over state-actions pairs [51] or over a pair of trajectories
τi  τj [23], while the former provides more supervision information than the later. Few
                                               25


prior arts address IL in preference-based RL, except for [23, 21], which tackle IL in an
MDP provided with only trajectory-ranked demonstrations but no environment feedbacks.
Focusing on a different problem setting, SAIL utilizes the self-generated trajectories to build
an increasing teacher distribution, which therefore requires fewer teacher demonstrations.
Exploration itself is an independent topic in RL. Classical exploration approaches work by
involving randomness into its learning loop [47, 163, 61]. More recent approaches propose to
use intrinsic rewards for exploration [13, 133]. Especially, [133] proposed curiosity-driven
exploration, a model-based approach which leverages the prediction loss of a transition model
as a reward bonus to encourage surprising behavior. Another exploration approach pursues
a maximized information gain about the agent’s belief of the environment [70]. Readers are
referred to [115] for a comprehensive discussion on the exploration techniques in RL.
3.6      Experiments
In this section, we study how SAIL achieves the objective of imitation learning and explo-
ration in an environment with delayed rewards. Extensive experiments have been conducted
to answer the following key questions:
    Q1. Is SAIL sample-efficient?
    Q2. Can SAIL surpass the demonstration performance via off-policy exploration?
    Q3. Which components in SAIL contribute to the exploration or sample efficiency?
    Q4: Is SAIL robust against different sub-optimal teachers?
3.6.1     Setup
We built SAIL on a TD3 framework [50] based on stable-baselines 1 implementations. It is
tested on 4 popular MuJoCo2 tasks: Walker2d-v2, Hopper-v2, HalfCheetah-v2, and Swimmer-
v2. For each task, we generate teacher demonstrations from a deterministic policy that was
pre-trained to be sub-optimal. All experiments are conducted using one imperfect demon-
    1
      https://stable-baselines.readthedocs.io/en/master/
    2
      https://github.com/openai/mujoco-py
                                                     26


      stration trajectory on 5 random seeds, with each trajectory containing no more than 1000
      transitions. Models are evaluated after training using 106 interaction samples. We defer
      more details and additional experimental results to the Supplementary.
                  Note that the original benchmarks are all in dense-reward settings. To construct the
      delayed rewarded environment as proposed in our chapter, we omit the original rewards
      such that only episodic feedback is provided upon the completion of a trajectory. To align
      with Assumption 1, we cache the original return of each trajectory R(τ ) = i r(si , ai ), and
                                                                                P
      downscale it to get a coarser grained supervision, with re (τ ) = b0.1 ∗ R(τ )c.
                  We compare SAIL with 5 popular baselines that are mostly applicable to our problem
      setting: DAC, GAIL, POfD, DDPGfD, and BC, as discussed in the Related Work. For
      baselines that utilize environment rewards, such as POfD and DDPGfD, we provide them
      with modified rewards re (s, a) upon the completion of each trajectory, instead of the original
      dense reward r(s, a).
                            Walker2d                                        Swimmer                     10000         HalfCheetah                                  Hopper
          5000      SAIL-Dynamic                               300                                                                                  3000
          4000      DAC                                        250                                       8000                                       2500
                    GAIL
                    POfD                                       200                                       6000                                       2000
          3000
 Returns                                              Returns                                  Returns                                     Returns
                    DDPGfD                                     150
                    Demonstration                                                                        4000                                       1500
          2000      BC                                         100
                                                                                                         2000                                       1000
          1000                                                  50                                                                                   500
                                                                 0                                          0
             0                                                                                                                                         0
              0      20      40     60     80                     0   20    40      60   80                  0   20     40     60    80                 0   20    40     60    80
                            Steps (10K)                                    Steps (10K)                                 Steps (10K)                               Steps (10K)
      Figure 3.2: Learning curves of SAIL and other previous work using one suboptimal demon-
      stration trajectory.
                            Walker2d                                       Swimmer                  10000             HalfCheetah                                 Hopper
      5000         SAIL-Dynamic      Demonstration         300                                                                                  3000
      4000         SAIL-Onpolicy     BC                    250                                       8000                                       2500
                   GAIL
      3000                                                 200                                       6000                                       2000
Returns                                              Returns                                  Returns                                     Returns
                                                           150                                       4000                                       1500
      2000                                                 100
                                                                                                     2000                                       1000
      1000                                                  50                                                                                   500
         0                                                   0                                          0
                                                                                                                                                   0
          0         20      40     60     80                  0       20    40      60   80              0       20     40     60    80             0       20    40     60    80
                           Steps (10K)                                     Steps (10K)                                 Steps (10K)                               Steps (10K)
      Figure 3.3: Comparing SAIL with other on-policy baselines using one suboptimal demon-
      stration trajectory.
      3.6.2               Performance on Continuous Action-Space Tasks
      Sample efficiency: As the results shown in Figure 3.2, SAIL is the only method that
      performs consistently better in all tasks in terms of both sample efficiency and asymptotic
                                                                                               27


        Benchmark              HalfCheetah           Swimmer         Hopper            Walker
           SAIL             10660.59±105.53        309.47±3.0    3302.06±14.22    5868.53±108.82
     Curiosity Explore       9043.07 ± 165.97       30.87 ± 6.52  3075.46 ± 15.61  5361.78 ± 58.24
      Entropy Explore        8839.32 + 280.47       65.04 ± 7.66  3079.29 ± 53.25  2792.76 ± 830.20
  Teacher Demonstration           5646.71              121.16         1480.69          1675.01
Table 3.1: Off-policy exploration (SAIL) achieves higher performance than other exploration
approaches.
performance. At the initial stage of the learning, SAIL can quickly exploit the suboptimal
demonstrations and approach to the demonstration’s performance with significantly fewer
samples.
    Exploration ability: Besides sample efficiency, another advantage of SAIL is that it can
effectively explore the environment to achieves expert-level performance, even with highly
sparse rewards. We observe that prior solutions of learning from environment rewards for
exploration, such as POfD and DDPGfD, cannot effectively address our proposed problem
setting, as it is sample-costly to learn a meaningful critic from the delayed feedback. Un-
like other imitation learning baselines whose performance is limited by the demonstrations,
SAIL can rapidly surpass the imperfect teacher via constructing a better demonstration
buffer.
3.6.3    Effects of Off-Policy Exploration in SAIL
Comparison with IL without Exploration: In order to illustrate the benefits of max-
imizing Objective (1) over a conventional IL objective, such as DKL [dπ (s, a)||dT (s, a)], we
conducted a comparison study where we trained the discriminator using teacher demonstra-
tions τT and on-policy self-generated samples τπ , instead of off-policy samples. This on-policy
training scheme is the same as proposed in GAIL [68]. In this way, the discriminator can get
approximations of log( ddTπ ) instead of log( ddBT ). We use the output of this on-policy discrimi-
nator to shape rewards, whereas Q and π are still updated in the same off-policy fashion as
our proposed approach.
    As illustrated in Figure 3.3, compared to GAIL (green) which is an on-policy baseline,
                                                    28


  SAIL-OnPolicy (orange) still enjoys the benefits of an off-policy learning scheme in general.
  However, it is less effective compared with our proposed approach. Even when π and Q are
  learned off-policy, SAIL-OnPolicy is slower to surpass the teacher demonstration (dashed
  gray line), due to its pure imitation-driven objective. SAIL enjoys fast improvement in
  performance not only because of an adaptive teacher demonstration buffer but also because
  it realizes the exploration-driven optimization.
                      Walker2d                                 Swimmer                  10000         HalfCheetah                             Hopper
      5000    SAIL-Dynamic                         300                                                                          3000
      4000    SAIL-without-LfD                                                           8000
              SAIL-without-Adaptation              250                                                                          2500
      3000    Demonstration                        200                                   6000                                   2000
Returns                                      Returns                              Returns                                 Returns
              BC
                                                   150                                   4000                                   1500
      2000                                         100
                                                                                         2000                                   1000
      1000                                          50                                                                           500
         0                                           0                                      0
                                                                                                                                   0
          0    20      40     60        80            0   20    40      60   80              0   20     40     60    80             0   20    40     60    80
                      Steps (10K)                              Steps (10K)                             Steps (10K)                           Steps (10K)
  Figure 3.4: Ablation study by removing different algorithmic components from SAIL. Only
  one teacher trajectory is used as demonstration.
          Benchmark                          Evaluated Performance / Demonstration                                           Performance
          HalfCheetah             10660.59 ± 105.53 / 5646.71 10217.00 ± 104.08 / 3598.90                                     9264.1 ± 163.45 / 875.39
           Swimmer                   309.47 ± 3.0 / 121.16       367.02 ± 1.11 / 46.82                                         361.17 ± 1.37 / 33.61
            Hopper                 3302.06 ± 14.22 / 1480.69   3814.07 ± 10.32 / 665.16                                       3589.92 ± 12.24 / 282.91
            Walker                5868.53 ± 108.82 / 1675.01 4819.20 ± 1240.84 / 484.96                                       4574.61 ± 71.22 / 255.73
  Table 3.2: Using 106 interaction samples, performance of SAIL is robust regardless of the
  quality of sub-optimal teacher demonstrations.
           Comparison with Other Exploration Approaches: We also compared SAIL with
  its two variants to evaluate the effects of different exploration approaches. In particular,
  we integrated SAIL with a soft-actor-critic [61] RL framework to enable an entropy-based
  exploration. For the other variant, we adopted the idea of random-distillation [27] to create
  a curiosity reward in addition to the reward provided by the discriminator. For both variant
  versions, we train the discriminator with on-policy samples in order to remove the effects
  of off-policy exploration. Comparison results in Table 3.1 indicate that, adopting an off-
  policy exploration approach (SAIL) is more effective given a fixed number of environment
  interactions. Entropy-based exploration is prone to high variance, while curiosity-based
                                                                                  29


exploration, on the other hand, requires learning a forward transition model, and achieves
lower final performance.
3.6.4    Ablation Study
We further evaluate SAIL by ablation and sensitivity studies to analyze the following aspects:
    Effects of learning from expert demonstrations: As shown in Figure 3.4, we ob-
served that sampling from a mixture of teacher data and self-generated data accelerates the
learning performance in early training stages. Specifically, the SAIL-Dynamic (blue) refers
our proposed approach, and SAIL-without-LfD (orange) only uses self-generated data to
learn policy by setting α = 0 constantly. We see that the SAIL is superior to SAIL-without-
LfD in terms of initial performance, which is ascribed to a learning strategy resemblant to
behavior-cloning when sampling from teacher demonstrations.
    Effects of updating teacher demonstration buffers: As shown in Figure 3.4, SAIL-
without-Expert-Adaptation (green) refers to a variant of SAIL which never update the teacher’s
replay buffer, even when a better trajectory is collected . We can observe that its asymptotic
performance is bounded by the teacher’s demonstration, which reveals the limitation of most
existing IL approaches. One key insight from these results is that, instead of learning critics
based on sparse rewards, leveraging the sparse guidance to improve the quality of the teacher
can be much more effective in improving the ultimate performance.
    Robustness of SAIL on different teacher qualities: To evaluate the robustness of
SAIL against different teacher performance, we pre-trained a group of teacher policies with
varying qualities, ranging from near-randomness to sub-optimality, then used their generated
trajectories as demonstrations. For each experiment, we only used one teacher trajectory as
demonstrations. Results in Table 3.2 show that SAIL can achieve robust performance no
matter how sub-optimal the teacher behaves. Powered by both an exploration-driven objec-
tive and a self-adaptive learning strategy, SAIL can constantly explore with more superior
                                              30


trajectories to improve its learning target, which results in improving learning performance.
3.7     Summary
In this chapter, we address the problem of reinforcement learning in environments with highly
delayed rewards given sub-optimal demonstrations. To address this challenging problem, we
propose a novel objective that encourages exploration-based imitation learning. Towards this
objective, we design an effective algorithm called Self Adaptive Imitation Learning (SAIL).
The proposed approach is validated to (1) accelerate the agent learning process by fully utiliz-
ing teacher demonstration for knowledge transfer, (2) address sample efficiency by off-policy
imitation learning, and (3) surpass the imperfect teacher with a large margin by iteratively
performing imitation and exploration. Experimental results on challenging locomotion tasks
indicate that SAIL significantly surpasses state-of-the-arts in terms of both sample efficiency
and asymptotic performance.
                                              31


                                         CHAPTER 4
                OFF-POLICY LEARNING FROM OBSERVATIONS:
   KNOWLEDGE TRANSFER IN REINFORCEMENT LEARNING FROM
                             INCOMPLETE SUPERVISION
Proposed approach in this chapter is based on the following paper:
    Zhuangdi Zhu, Kaixiang Lin, Bo Dai, and Jiayu Zhou. Off-Policy Imitation Learning
from Observations. Advances in Neural Information Processing Systems 33 (2020).
4.1     Introduction
Imitation Learning (IL) has been widely studied in the reinforcement learning (RL) domain
to assist in learning complex tasks by leveraging the experience from expertise [145, 68, 88,
87, 136]. Unlike conventional RL that depends on environment reward feedbacks, IL can
purely learn from expert guidance and is therefore crucial for realizing robotic intelligence
in practical applications, where demonstrations are usually easier to access than a delicate
reward function [156, 119].
    Classical IL, or more concretely, Learning from Demonstrations (LfD), assumes that
both states and actions are available as expert demonstrations [1, 68, 88]. Although expert
actions can benefit IL by providing elaborated guidance, requiring such information for IL
may not always accord with the real world. Actually, collecting demonstrated actions can
sometimes be costly or impractical, whereas observations without actions are more accessible
resources, such as the camera or sensory logs. Consequently, Learning from Observations
(LfO) has been proposed to address the scenario without expert actions [171, 189, 172].
On one hand, LfO is more challenging compared with conventional IL, due to missing finer-
grained guidance from actions. On the other hand, LfO is a more practical setting for IL, not
only because it capitalizes on previously unusable resources, but also because it reveals the
potential to realize advanced artificial intelligence. In fact, learning without action guidance
is an inherent ability for human beings. For instance, a novice game player can improve his
                                                 32


skill purely by watching video records of an expert, without knowing what actions have been
taken [10].
    Among popular LfD and LfO approaches, distribution matching has served as a principled
solution [68, 88, 171, 189, 49], which works by interactively estimating and minimizing the
discrepancy between two stationary distributions: one generated by the expert, and the
other generated by the learning agent. To correctly estimate the distribution discrepancy,
traditional approaches require on-policy interactions with the environment whenever the
agent policy gets updated. This inefficient sampling strategy impedes wide applications of
IL to scenarios where accessing transitions are expensive [147, 118]. The same challenge is
aggravated in LfO, as more explorations by the agent are needed to cope with the lack of
action guidance.
    Towards sample efficiency, some off-policy IL solutions have been proposed to leverage
transitions cached in a replay buffer. Mostly designed for LfD, these methods either lack
theoretical guarantee by ignoring a potential distribution drift [87, 149, 169], or hinge on
the knowledge of expert actions to enable off-policy distribution matching [88], which makes
their approach inapplicable to LfO.
    To address the aforementioned limitations, in this work, we propose a LfO approach that
improves sample efficiency in a principled manner. Specifically, we derive an upper bound of
the LfO objective which dispenses with the need of knowing expert actions and can be fully
optimized with off-policy learning. To further accelerate the learning procedure, we combine
our objective with a regularization term, which is validated to pursue distribution matching
between the expert and the agent from a mode-covering perspective. Under a mild assump-
tion of a deterministic environment, we show that the regularization can be enforced by
learning an inverse action model. We call our approach OPOLO (Off POlicy Learning from
Observations). Extensive experiments on popular benchmarks show that OPOLO achieves
state-of-the-art in terms of both asymptotic performance and sample efficiency.
                                              33


4.2    Background
We consider learning an agent in an environment of Markov Decision Process (MDP) [164],
which can be defined as a tuple: M = (S, A, P, r, γ, p0 ). Particularly, S and A are the
state and action spaces; P is the state transition probability, with P (s0 |s, a) indicating the
probability of transitioning from s to s0 upon action a; r is the reward function, with r(s, a)
the immediate reward for taking action a on state s; Without ambiguity, we consider an
MDP with infinite horizons, with 0 < γ < 1 as a discounted factor; p0 is the initial state
distribution. An agent follows its policy π : S → A to interact with this MDP with an
objective of maximizing its expected return:
                                                                   hX∞               i                 h       i
                                                                         t
   max JRL (π) := Es0 ∼p0 ,ai ∼π(·|si ),si+1 ∼P (·|si ,ai ),∀0≤i≤t      γ r(st , at ) = E(s,a)∼µπ (s,a) r(s, a) ,
                                                                    t=0
    in which µ (s, a) is the stationary state-action distribution induced by π, as defined in
               π
Table 4.1.
    Learning from demonstrations (LfD) is a problem setting in which an agent is pro-
vided with a fixed dataset of expert demonstrations as guidance, without accessing the envi-
ronment rewards. The demonstrations RE contain sequences of both states and actions gen-
erated by an expert policy πE : RE = {(s0 , a0 ), (s1 , a1 ), · · · |ai ∼ πE (·|si ), si+1 ∼ P (·|si , ai )}.
Without ambiguity, we assume that the expert and agent are from the same MDP.
    Among LfD approaches, distribution matching has been a popular choice, which min-
imizes the discrepancy between two stationary state-action distributions: one is µE (s, a)
induced by the expert, and the other is µπ (s, a) induced by the agent. Without loss of
generality, we consider KL-divergence as the discrepancy measure for distribution matching,
although any f -divergences can serve as a legitimate choice [68, 186, 129] :
                               min JLfD (π) := DKL [µπ (s, a)||µE (s, a)].                                    (4.1)
    Learning from observations (LfO) is a more challenging scenario where expert guid-
ance RE contains only states. Accordingly, applying distribution matching to solve LfO
                                                             34


yields a different objective that involves state-transition distributions [189, 105, 171]:
                                   min JLfO (π) := DKL [µπ (s, s0 )||µE (s, s0 )].                           (4.2)
    There exists a close connection between LfO and LfD objectives. In particular, the
discrepancy between two objectives can be derived precisely as follows:
           DKL [µπ (a|s, s0 )||µE (a|s, s0 )] = DKL [µπ (s, a)||µE (s, a)] − DKL [µπ (s, s0 )||µE (s, s0 )]. (4.3)
Remark 3. In a non-injective MDP, the discrepancy of DKL [µπ (a|s, s0 )||µE (a|s, s0 )] cannot be
optimized without knowing expert actions. In a deterministic and injective MDP, it satisfies
that ∀ π : S → A, DKL [µπ (a|s, s0 )||µE (a|s, s0 )] = 0.
    Despite the potential gap between these two objectives, the LfO objective in Eq (4.2)
is still intuitive and valid, as it emphasizes on recovering the expert’s influence on the en-
vironment by encouraging the agent to yield the desired state-transitions, regardless of the
immediate behavior that leads to those transitions. In this work, we follow this rationale
and consider Eq (4.2) as our learning objective, which has also been widely adopted by prior
art [171, 170, 161, 22]. We will show later that pursuing this objective is sufficient to recover
expertise for various challenging tasks.
    A common limitation of existing LfO and LfD approaches relies in their inefficient op-
timization. Work along this line usually adopts a GAN-style strategy [56] to perform dis-
tribution matching. Take the representative work of GAIL [68] as an example, in which a
discriminator x : S × A → R and a generator π : S → A are jointly learned to optimize a
dual form of the original LfD objective:
               min max JGAIL (π, x) := EµE (s,a) [log(x(s, a))] + Eµπ (s,a) [log(1 − x(s, a))].
                 π   x
    During optimization, on-policy transitions in the MDP are used to estimate expectations
over µπ . It requires new environment interactions whenever π gets updated and is thus
sample inefficient. This inconvenience is echoed in the work of LfO, which inherits the same
spirit of on-policy learning [189, 171]. In pursuit of sample efficiency, some off-policy solutions
                                                         35


                                          State-Action                            Transition      Inverse-Action
              State Distribution                        Joint Distribution
                                          Distribution                           Distribution      Distribution
Notation              µπ (s)                µπ (s, a)        µπ (s, a, s0 )        µπ (s, s0 )      µπ (a|s, s0 )
 Support                S                    S ×A            S ×A×S                 S ×S            A×S ×S
                      P∞                                                                           µπ (s,a)P (s0 |s,a)
Definition                                               µπ (s, a)P (s0 |s, a)      (s, a, s0 )da
                                                                               R π
             (1 − γ)     t=1 γ t µπt (s)  µπ (s)π(a|s)                          A
                                                                                  µ                    µπ (s,s0 )
Table 4.1: Summarization on different stationary distributions, with µπt (s) = p(st = s|s0 ∼
p0 (·), ai ∼ π(·|si ), si+1 ∼ P (·|si , ai )), ∀i < t).
have been proposed. These methods, however, either lack theoretical guarantee [169, 87], or
rely on the expert actions [87, 88], which makes them inapplicable to LfO.
     To improve the sample efficiency of LfO with a principled solution, in the next section
we show how we explicitly introduce an off-policy distribution into the LfO objective, from
which we derive a feasible upper-bound that enables off-policy optimization without the need
of accessing expert actions.
4.3       OPOLO: Off-Policy Learning from Observations
4.3.1      Surrogate Objective
The idea of re-using cached transitions to improve sample efficiency has been adopted by
many RL algorithms [119, 61, 50, 102]. In the same spirit, we start by introducing an
off-policy distribution µR (s, a), which is induced by a dataset R of historical transitions.
Choosing KL-divergence as a discrepancy measure, we obtain an upper-bound of the LfO
objective by involving µR (s, a):
                                                          µR (s, s0 )
                                                                     
                      0               0
                                                                                                                   (4.4)
             π            E
                                                                         + DKL µπ (s, a)||µR (s, a) .
                                                                                                   
       DKL µ (s, s )||µ (s, s ) ≤ Eµπ (s,s0 ) log E                 0
                                                          µ (s, s )
     As a result, the LfO objective can be optimized by minimizing the RHS of Eq (4.4).
Although widely adopted for its interpretability, KL divergence can be tricky to estimate
due to issues of biased gradients [41, 88]. To avoid the potential difficulty in optimization,
we further substitute the term DKL [µπ (s, a)||µR (s, a)] in Eq (4.4) by a more aggressive f -
                                                         36


divergence, with f (x) = 12 x2 , which serves as an upper-bound of the KL-divergence:
                                           DKL [P ||Q] ≤ Df [P ||Q].                                              (4.5)
    Our choice of f -divergence can be considered as a variant of Pearson χ2 -divergence with
a constant shift, which has also been adopted as a valid measure of distribution discrepan-
cies [121, 122]. Compared with KL-divergence, this f -divergence enables unbiased estimation
without deteriorating the optimality, whose advantages will become increasingly visible in
Section 4.3.2.
    Built upon the above transformations, we reach an objective that serves as an effective
upper-bound of DKL [µπ (s, s0 )||µE (s, s0 )]:
                                                        µR (s, s0 )
                                                                   
                 min Jopolo (π) := Eµπ (s,s0 )      log E              + Df [µπ (s, a)||µR (s, a)].               (4.6)
                   π                                    µ (s, s0 )
4.3.2    Off-Policy Transformation
Optimization Eq (4.6) is still on-policy and induces additional challenges through the term
Df [µπ (s, a)||µR (s, a)]. However, we show that it can be readily transformed into off-policy
learning. We first leverage the dual-form of an f -divergence [127]:
            −Df [µπ (s, a)||µR (s, a)] =     inf       E(s,a)∼µπ [−x(s, a)] + E(s,a)∼µR [f∗ (x(s, a))],
                                          x:S×A→R
    and use this dual transformation to rewrite Eq (4.6):
                                                   µR (s, s0 )
                                                              
      min Jopolo (π) ≡ max Eµπ (s,s0 ) − log E                   − Df [µπ (s, a)||µR (s, a)]
       π                  π                        µ (s, s0 )
                                                           µE (s, s0 )
                                                                                
   ≡ max min Jopolo (π, x) := Eµπ (s,a,s0 ) log R                      − x(s, a) + EµR (s,a) [f∗ (x(s, a))].      (4.7)
        π x:S×A→R                                          µ (s, s0 )
                                                                            E     0
    If we consider a synthetic reward as r(s, a, s0 ) = log µµR (s,s              0 ) − x(s, a), the first term in
                                                                              (s,s )
Eq (4.7) resembles an RL return function: J(π)          ˆ      = E(s,a,s0 )∼µπ (s,a,s0 ) [r(s, a, s0 )]. Observing this
similarity, we turn to learn a Q-function by applying a change of variables:
                                                                         µE (s, s0 )
                                                                                                        
                                                                                                  0 0
               Q(s, a) = Es0 ∼P (·|s,a),a0 ∼π(·|s0 )   −x(s, a) + log R                 + γQ(s , a ) .
                                                                         µ (s, s0 )
                                                          37


     Equivalently, this Q function is a fixed point of a variant Bellman operator B π Q:
                                                             µE (s, s0 )
                                                                                       
                                                                                    0 0
    Q(s, a) = −x(s, a) + Es0 ∼P (·|s,a),a0 ∼π(·|s0 ) log R               + γQ(s , a ) = −x(s, a) + B π Q(s, a).
                                                             µ (s, s0 )
     Rewriting x(s, a) = (B π Q − Q)(s, a) and applying it back to Eq (4.7), we finally remove
the on-policy expectation by a series of telescoping:
   max      min     Jopolo (π, x) ≡ max           min     Jopolo (π, Q)
      π  x:S×A→R                           π    Q:S×A→R
                                    µE (s, s0 )
    := E(s,a,s0 )∼µπ (s,a,s0 ) [log              − (B π Q − Q)(s, a)] + E(s,a)∼µR (s,a) [f∗ ((B π Q − Q)(s, a))]
                                    µR (s, s0 )
     = (1 − γ)Es0 ∼p0 ,a0 ∼π(·|s0 ) [Q(s0 , a0 )] + E(s,a)∼µR (s,a) [f∗ ((B π Q − Q)(s, a))].                         (4.8)
     A similar rationale has also been the key component of distribution error correction
(DICE) [121, 122, 201]. Based on the above transformation, we propose our main objective:
  max      min     Jopolo (π, Q) := (1 − γ)Es0 ∼p0 ,a0 ∼π(·|s0 ) [Q(s0 , a0 )] + EµR (s,a) [f∗ ((B π Q − Q)(s, a))].
     π  Q:S×A→R
                                                                                                                       (4.9)
     Specifically, when f (x) = f ∗ (x) = 12 x2 , the second term EµR (s,a) [f∗ ((B π Q − Q)(s, a))] is
reminiscent of an Bellman error, for which we can have unbiased estimation by mini-batch
gradients.
     Given access to the off-policy distribution µR (s, a) and the initial distribution p0 , opti-
                                                                                                     E     0
mization equation 4.9 can be efficiently realized once we resolve the term log µµR (s,s                    0 ) contained
                                                                                                       (s,s )
in B π Q(s, a).
4.3.3     Adversarial Training with Off-Policy Experience
                                                                                                          E      0
We can take the advantage of GAN training [56] to estimate the term log µµR (s,s                                 0 ) inside
                                                                                                             (s,s )
B π Q(s, a), by learning a discriminator D:
                                             h               i                       h                     i
                                                          0                                            0
                max      E(s,s0 )∼µE (s,s0 )   log(D(s, s )) + E(s,s0 )∼µR (s,s0 ) log(1 − D(s, s )) ,
            D:S×S→R
                                                                           0
     which upon training to optimality, satisfies log( µµER (s,s       (s,s )            ∗    0
                                                                           0 ) ) = log D (s, s )−log(1−D (s, s )).
                                                                                                                  ∗      0
                                                                                                E
Unlike prior art [68, 171, 87] that requires estimating the ratio of log µµπ , the discriminator
                                                            38


in our case is designed to be off-policy in accordance with our proposed objective. Up to
this step, optimization equation 4.9 can be achieved by interactively optimizing Q, π, and
D with pure off-policy learning.
4.3.4    Policy Regularization as Forward Distribution Matching
Optimization 4.9 essentially minimizes an upper-bound of the inverse KL divergence:
                                          DKL [µπ (s, s0 )||µE (s, s0 )],
which is known to encourage a mode-seeking behavior [55]. Although mode-seeking is more
robust to covariate-drift than mode-covering (such as behavior cloning), it requires sufficient
explorations to find a reasonable state-distribution, especially at early learning stages. On
the other hand, a mode-covering strategy has merits in quickly minimizing discrepancies on
the expert distribution, by optimizing a forward KL-divergence such as DKL [πE (a|s)||π(a|s)].
    To combine the advantages of both, in this section we show how we further speed up the
learning procedure from a mode-covering perspective, without deteriorating the efficacy of
our main objective. To achieve this goal, we first derive an optimizable lower-bound from a
mode-covering objective:
         DKL [πE (a|s)||π(a|s)] = DKL [µE (s0 |s)||µπ (s0 |s)] + DKL [µE (a|s, s0 )||µπ (a|s, s0 )],      (4.10)
 in which we define µπ (s0 |s) =          π(a|s)P (s0 |s, a)da as the conditional state transition dis-
                                     R
                                       A
tribution induced by π, likewise for µE (s0 |s).
    Similar to Remark 3, the discrepancy DKL [µE (a|s, s0 )||µπ (a|s, s0 )] is not optimizable with-
out knowing expert actions. However, under some mild assumptions, we found it feasible to
optimize the other term DKL [µE (s0 |s)||µπ (s0 |s)] by enforcing a policy regularization:
Remark 4. In a deterministic MDP, assuming the support of µE (s, s0 ) is covered by µR (s, s0 ),
s.t. µE (s, s0 ) > 0 =⇒ µR (s, s0 ) > 0, then regulating policy using µR (·|s, s0 ) minimizes the
forward distributional divergence DKL [µE (s0 |s)||µπ (s0 |s)]:
 ∃π̃ : S → A, s.t. ∀(s, s0 ) ∼ µE (s, s0 ), π̃(·|s) ∝ µR (·|s, s0 ) =⇒ π̃ = arg min DKL [µE (s0 |s)||µπ (s0 |s)].
                                                                                  π
                                                       39


    Intuitively, when expert labels are unavailable, this regularization can be considered as
performing states matching, by encouraging the policy to yield actions that lead to desired
footprints. Given a transition s → s0 from the expert observations, a conditional distribution
µR (·|s, s0 ) only has support on actions that yield this transition s → s0 . Therefore, following
this regularization avoids the policy from drifting to undesired states.
    In practice, we can estimate µR (·|s, s0 ) by learning an inverse action model PI using
off-policy transitions from µR (s, a, s0 ) to optimize the following:
          max        −DKL [µR (a|s, s0 )||PI (a|s, s0 )] ≡     max   E(s,a,s0 )∼µR (s,a,s0 ) [log PI (a|s, s0 )]. (4.11)
      PI :S×S→A                                            PI :S×S→A
4.3.5       Algorithm
Based on all the abovementioned building blocks, we now introduce OPOLO in Algorithm
4.1. OPOLO involves learning a policy π, a critic Q, a discriminator D, and an inverse
action regularizer PI , all of which can be done through off-policy training.
    In particular, π and Q is jointly learned to find a saddle-point solution to optimiza-
tion equation 4.9. The discriminator D assists this process by estimating a density ratio
      E      0
log µµR (s,s 0 ) . For better empirical performance, we adopt − log(1 − D(s, s )) as the discrim-
        (s,s )                                                                                      0
inator’s output, which corresponds to a constant shift inside the logarithm term, in that
       E       0
log( µµR (s,s  0 ) + 1) = − log(1 − D (s, s )). The inverse action model PI serves as a regularizer
         (s,s )                           ∗     0
to infer proper actions on the expert observation distribution to encourage mode-covering .
4.4       Related Work
The recent development on imitation learning can be divided into two categories:
    Learning from Demonstrations (LfD) traces back to behavior cloning (BC ) [138],
in which a policy is pre-trained to minimize the prediction error on expert demonstrations.
This approach is inherent with issues such as distribution shift and regret propagations.
To address these limitations, [145] proposed a no-regret IL approach called DAgger, which
however requires online access to oracle corrections. More recent LfD approaches favor
Inverse reinforcement learning (IRL) [1], which work by seeking a reward function that
                                                             40


Algorithm 4.1: Off-Policy Learning from Observations
  1: Input: expert observations RT , off-policy-transitions R, initial states S0 , f - function,
  2:            policy πθ , critic Qφ , discriminator Dw , inverse action model PI ϕ , learning rate
     α.
  3: for n = 1, . . . do
  4:     sample trajectory τ ∼ πθ , R ← R ∪ τ
  5:     update Dw : w ← w + αÊ(s,s0 )∼RE [Ow log(Dw (s, s0 )] + Ê(s,s0 )∼R [Ow log(1 − Dw (s, s0 ))].
  6:                           set r(s, s0 ) = − log(1 − Dw (s, s0 )).
  7:     update PI ϕ :       ϕ ← ϕ + αÊ(s,a,s0 )∼R [Oϕ log(PI ϕ (a|s, s0 ))].
  8:     update πθ and Qφ :                                                h 
  9:     J(πθ , Qφ ) = (1 − γ)Ês∼S0 [Qφ (s, πθ (s))] + Ê(s,a,s0 )∼R f ∗ r(s, s0 ) + γQφ (s0 , πθ (s0 )) −
              i
     Qφ (s, a) .
10:      JReg (πθ ) = E(s,s0 )∼RE ,a∼PI ϕ (·|s,s0 ) [log πθ (a|s)].
                                                                                                         
11:                      φ ← φ − αJOφ (πθ , Qφ );               θ ← θ + α JOθ (πθ , Qφ ) + JOθ JReg (πθ ) .
guarantees the superiority of expert demonstrations, based on which regular RL algorithms
can be used to learn a policy [165, 209]. A representative instantiation of IRL is Generative
Adversarial Imitation Learning (GAIL) [68]. It defines IL as a distribution matching problem
and leverages the GAN technique [56] to minimize the Jensen-Shannon divergence between
distributions induced by the expert and the learning policy. The success of GAIL has inspired
a variety of other work, including those adopting different RL frameworks [87], or choosing
different divergence measures [49, 136, 9] to enhance the effectiveness of imitation learning.
Most work along this line focuses on on-policy learning, which is a sample-costly strategy.
     As an off-policy extension of GAIL , DAC [87] improves the sample efficiency by re-using
previous samples stored in a relay buffer rather than on-policy transitions. Similar ideas of
reusing cached transitions can be found in [149]. One limitation of these approaches is that
they neglected the discrepancy induced when replacing the on-policy distribution with off-
policy approximations, which results in a deviation from their proposed objective. Another
off-policy imitation learning approach is ValueDICE [88], which inherits the idea of DICE
[121] to transform an on-policy LfD objective to an off-policy one. This approach, however,
requires the information of expert actions, which otherwise makes off-policy estimation un-
reachable in a model-free setting. Therefore, their approach is not directly applicable to
                                                            41


LfO.
    Learning from Observations (LfO) tackles a more challenging scenario where ex-
pert actions are unavailable. Work alone this line falls into model-free and model-based
approaches. GAIfO [171] is a model-free solution that applies the principle of GAIL to
learn a discriminator with state-only inputs. IDDM [189] further analyzed the theoretical
gap between the LfD and LfO objectives, and proved that a lower-bound of this gap can
be somewhat alleviated by maximizing the mutual-information between (s, (a, s0 )), given an
on-policy distribution µπ (s, a, s0 ). Its performance is comparable to GAIL. [22] assumed that
the given observation sequences are ranked by superiority, based on which a reward function
is designed for policy learning. Similar to GAIL, the sample efficiency of these approaches
is suboptimal due to their on-policy strategy.
    Model-based LfO can be further organized into learning a forward [161, 44] dynamics
model or an inverse action model [169, 105]. Especially, [161] proposed a forward model
solution to learn time-dependent policies for finite-horizon tasks, in which the number of
policies to be learned equals the number of transition steps. This approach may not be suit-
able for tasks with long or infinite horizons. Behavior cloning from observations (BCO) [169]
learns an inverse model to infer actions missing from the expert dataset, after which behavior
cloning is applied to learn a policy. Besides the common issues faced by BC, this strategy
does not guarantee that the ground-truth expert actions can be recovered, unless is a deter-
ministic and injective MDP is assumed. Some other recent work focused on different problem
settings than ours, in which the expert observations are collected with different transition
dynamics [52] or from different viewpoints [105, 107, 157]. Readers are referred to [172] for
further discussions of LfO.
4.5     Experiments
We compare OPOLO against state-of-the-art LfD and LfO approaches on MuJuCo bench-
marks, which are locomotion tasks in continuous state-action space. In accordance with our
assumption in Sec 4.3.4, these tasks have deterministic dynamics. Original rewards are re-
                                                  42


moved from all benchmarks to fit into an IL scenario. For each task, we collect 4 trajectories
from a pre-trained expert policy. All illustrated results are evaluated across 5 random seeds.
    Baselines: We compared SAIL against 7 baselines. We first selected 5 representative
approaches from prior work: GAIL (on-policy LfD), DAC (off-policy LfD), ValueDICE (off-
policy LfD), GAIfO (on-policy LfO), and BCO (off-policy LfO). We further designed two
strong off-policy approaches, Specifically, we built DACfO, which is a variation of DAC that
learns the discriminator on (s, s0 ) instead of (s, a), and ValueDICEfO, which is built based
on ValueDICE. Instead of using ground-truth expert actions, ValueDICEfO learns an inverse
model by optimizing Eq (4.11), and uses the approximated actions generated by the inverse
model to fit an LfO problem setting. To the best of our knowledge, DACfO and ValueDICEfO
have not been investigated by any prior art. Among these baselines, GAIL, DAC, and
ValueDICE are provided with both expert states and actions, while all other approaches only
have access to expert states. More experimental details can be found in the supplementary
material.
    Our experiments focus on answering the following important questions:
   1. Asymptotic performance: Is OPOLO able to achieve expert-level performance given a
      limited number of expert observations?
   2. Sample efficiency: Can OPOLO recover expert policy using less interactions with the
      environment, compared with the state-of-the-art?
   3. Effects of the inverse action regularization: Does the inverse action regularization
      useful in speeding up the imitation learning process?
   4. Sensitivity of the choice of f -divergence: Can OPOLO perform well given different f
      functions?
4.5.1    Performance Comparison
OPOLO can recover expert performance given a fixed budget of expert observations. As
shown in Figure 4.1, OPOLO reaches (near) optimal performance in all benchmarks. For
                                               43


simpler tasks such as Swimmer and InvertedPendulum, most baselines can successfully re-
cover expertise. For other complex tasks with high state-action space, on-policy baselines,
such as GAIL and GAIfO, are struggling to reach their asymptotic performance within a
limited number of interactions, As shown in Figure 4.2, the off-policy baseline BCO is prone
to sub-optimality due to its behavior cloning-like strategy, On the other hand, the perfor-
mance of ValueDICEfO can be deteriorated by potential action-drifts, as the inferred actions
are not guaranteed to recover expertise. For fair comparison, performance of all off-policy
approaches are summarized in Table 4.2 given a fixed number of interaction steps.
     The asymptotic performance of OPOLO is 1) superior to DACfO and ValueDICEfO, 2)
comparable to DAC, and 3) is more robust against overfitting compared with ValueDICE,
whereas both DAC and ValueDICE enjoy the advantage of off-policy learning and extra
action guidance.
  Env           HalfCheetah     Hopper               Walker         Swimmer     Ant
  BCO           3881.10±938.81  1845.66±628.41  421.24±135.18   256.88±4.52     1529.54±980.86
  OPOLO-x       7632.80±128.88  3581.85±19.08   3947.72±97.88   246.62±1.56     5112.04±321.42
  OPOLO         7336.96±117.89  3517.39±25.16   3803.00±979.85  257.38±4.28     5783.57±651.98
  DAC           6900.00±131.24  3534.42±10.27   4131.05±174.13  232.12±2.04     5424.28±594.82
  DACfO         7035.63±444.14  3522.95±93.15   3033.02±207.63  185.28±2.67     4920.76±872.66
  ValueDICE     5696.94±2116.94 3591.37±8.60    1641.58±1230.73 262.73±7.76     3486.87±1232.25
  ValueDICEfO   4770.37±644.49  3579.51±10.23   431.00±140.87   265.05±3.45     75.08±400.87
  Expert        7561.78±181.41  3589.88±2.43    3752.67±192.80  259.52±1.92     5544.65±76.11
  (S, A)        (17, 6)         (11, 3)         (17, 6))        (8, 2)          (111, 8)
Table 4.2: Evaluated performance of off-policy approaches. Results are averaged over 50
trajectories.
4.5.2    Sample Efficiency
OPOLO is comparable with and sometimes superior to DAC in all evaluated tasks, and is
much more sample-efficient than on-policy baselines. As shown in Figure 4.1, the sample-
efficiency of OPOLO is emphasized by benchmarks with high state-action dimensions. In
particular, for tasks such as Ant or HalfCheetah, the performance curves of on-policy base-
lines are barely improved at early learning stages. One intuition is that they need more
explorations to build the current support of the learning policy, which cannot benefit from
                                               44


             cached transitions. For these challenging tasks, OPOLO is even more sample-efficient than
             DAC that has the guidance of expert actions. We ascribe this improvement to the mode-
             covering regularization of OPOLO enforced by its inverse action model, whose effect will be
             further analyzed in Sec 6.5.6. Meanwhile, other off-policy approaches such as BCO and Val-
             ueDICEfO, are prone to overfitting and performance degradation (as shown in Figure 4.2),
             which indicates that the effect of the inverse model alone is not sufficient to recover exper-
             tise. On the other hand, the ValueDICE algorithm, although being sample-efficient, is not
           designed to address LfO and requires expert actions.
          300       Swimmer-v2                InvertedPendulum-v2                                              8000
                                                                                                                                HalfCheetah-v2
                                                           1000
          250
                                                            800                                                6000
          200
   Returns                                           Returns                                             Returns
          150                               OPOLO           600                                                4000
                                            BCO
          100                               DAC             400                                                2000
                                            GAIfO
           50                               GAIL            200
                                            Expert                                                                   0
            0                                                 0
             0     20     40     60     80                     0        5    10 15 20 25 30 35                        0    10      20    30      40
                   Interaction Steps (5e3)                                  Interaction Steps (5e3)                        Interaction Steps (1e4)
                        Walker2d-v2                                                  Ant-v2                                        Hopper-v2
      4000                                                      6000
      3500                                                                                                         3500
                                                                5000                                               3000
      3000                                                      4000
      2500                                                      3000                                               2500
Returns                                               Returns                                              Returns
      2000                                                      2000                                               2000
      1500                                                      1000                                               1500
      1000                                                         0                                               1000
       500                                                      1000                                                500
         0                                                                                                            0
          0       20     40     60     80       100                 0         20    40        60   80   100            0    20      40    60     80   100
                  Interaction Steps (1e4)          Interaction Steps (1e4)     Interaction Steps (1e4)
             Figure 4.1: Interaction steps versus learning performance. Compared with others, our pro-
             posed approach (OPOLO) is the most sample-efficient to reach expert-level performance
             (Grey horizontal line).
             4.5.3 Ablation Study
             In this section, we further analyze the effects of the inverse action regularization by a group
             of ablation studies. Especially, we implement a variant of OPOLO that does not learn an
             inverse action model to regulate the policy update. We compare this approach, dubbed as
             OPOLO-x, against our original approach as well as the DAC algorithm.
                 Effects on Sample efficiency: Performance curves in Figure 4.3 show that removing
             the inverse action regularization from OPOLO slightly affects its sample-efficiency, although
             the degraded version is still comparable to DAC. This impact is more visible in challeng-
                                                                                         45


                          Swimmer-v2                              6000
                                                                                        Ant-v2                                             Walker2d-v2
           300                                                                                                           4000
           250                                                                                                           3500
                                                                  4000                                                   3000
           200                           OPOLO                                                                           2500
   Returns                                              Returns                                                    Returns
           150                           ValueDICE                2000                                                   2000
                                         DICEfO
           100                           DACfO                                                                           1500
                                         BCO                         0                                                   1000
            50                           DAC                                                                              500
             0                           Expert                   2000                                                      0
              0     10     20      30        40                       0         20     40        60    80    100             0       20      40    60      80    100
                    Interaction Steps (1e4)                                    Interaction Steps (1e4)                               Interaction Steps (1e4)
                          Hopper-v2                                                  Humanoid-v2                         8000
                                                                                                                                           HalfCheetah-v2
           3500                                               5000
           3000                                               4000                                                       6000
           2500
   Returns                                              Returns                                                    Returns
           2000                                               3000                                                       4000
           1500                                               2000
           1000                                                                                                          2000
            500                                               1000
                                                                                                                               0
              0                                                  0
               0    20     40     60    80        100             0            20      40        60    80    100                0     10      20    30      40
                    Interaction Steps (1e4)                                   Interaction Steps (1e4)                                 Interaction Steps (1e4)
             Figure 4.2: Compared with strong off-policy baselines, OPOLO is the only one that consis-
             tently achieves competitive performance across all tasks, without accessing expert actions.
                         Swimmer-v2                                           InvertedPendulum-v2                     8000
                                                                                                                                          HalfCheetah-v2
      300                                                  1000
      250                                                   800                                                       6000
      200
Returns                                            Returns                                                    Returns
                                                            600                                                       4000
      150
      100                               OPOLO               400                                                       2000
                                        OPOLO-x
       50                               DAC                 200
                                        Expert                                                                                0
        0                                                     0
         0         20     40     60     80                     0          5    10 15 20 25 30 35                               0     10      20    30       40
                   Interaction Steps (5e3)                                    Interaction Steps (5e3)                                Interaction Steps (1e4)
                         Walker2d-v2                                                        Ant-v2                                           Hopper-v2
          4000                                                    6000
          3500                                                                                                               3500
                                                                  5000                                                       3000
          3000                                                    4000
          2500                                                                                                               2500
Returns
                                                                  3000
                                                        Returns                                                    Returns
          2000                                                    2000                                                       2000
          1500                                                    1000                                                       1500
          1000                                                       0                                                       1000
           500                                                    1000                                                        500
             0                                                                                                                  0
              0     20     40     60     80       100                 0         20      40        60    80    100                0    20      40    60      80    100
                   Interaction Steps (1e4)                                     Interaction Steps (1e4)                               Interaction Steps (1e4)
             Figure 4.3: Removing the inverse action regularization (OPOLO-x ) results in slight efficiency
             drop, although its performance is still comparable to DAC and OPOLO.
                                                                                            46


ing tasks such as HalfCheetah and Ant. From another perspective, the same phenomenon
indicates that an inverse action regularization is beneficial for accelerating the IL process,
especially for games with high observation spaces. An intuitive exploration is that, while
our main objective serves as a driving force for mode-seeking, a regularization term assists
by encouraging the policy to perform mode-covering. Combing these two motivations leads
to a more efficient learning strategy.
   Effects on Performance: Given a reasonable number of transition steps, the effects
of an inverse-action model are less obvious regarding the asymptotic performance. As shown
in Table 4.2, OPOLO-x is mostly comparable to OPOLO and DAC. This implies that the
effect of the state-covering regularization will gradually fade out once the policy learns a
reasonable state distribution. From another perspective, it indicates that following our main
objective alone is sufficient to recover expert-level performance. Comparing with BCO which
uses the inverse model solely for behavior cloning, we find it more effective when serving as
a regularization to assist distribution matching from a forward direction.
4.5.4 Sensitivity Analysis                                                                                   Ant-v2
                                                                                             6000
To analyze the effects of different f -functions on the perfor-                              5000
                                                                                             4000
                                                                                             3000
                                                                                   Returns   2000                        OPOLO
mance of the proposed approach, we explored a family of f -                                  1000                        q=3/2
                                                                                                                         q=3
                                                                                                0                        q=4
                                                                                             1000                        Expert
divergence where f (x) =     1
                             p
                               |x|p , f ∗ (y)   =   1
                                                    q
                                                      |y|q , s.t. p1   +   1
                                                                           q
                                                                               =                 0   20     40    60    80
                                                                                                     Interaction Steps (1e4)
                                                                                                                               100
1, p, q > 1, as adopted by DualDICE [121]. Evaluation re- Figure 4.4:                                             Performance
sults show that OPOLO yields reasonable performance across given different f -functions.
different f -functions, although our choice (q = p = 2 ) turns
out to be most stable. Results using the Ant task is illustrated in Figure 4.
4.6    Summary
Towards sample-efficient imitation learning from observations (LfO), we proposed a princi-
pled approach that performs imitation learning by accessing only a limited number of expert
observations. We derived an upper bound of the original LfO objective to enable efficient
off-policy optimization, and augment the objective with an inverse action model regulariza-
                                                     47


tion to speeds up the learning procedure. Extensive empirical studies are done to validate
the proposed approach.
                                           48


                                        CHAPTER 5
       DATA FREE KNOWLEDGE TRANSFER FOR HETEROGENEOUS
                                 FEDERATED LEARNING
This chapter is based on the work accompanying the following paper:
     Zhu, Zhuangdi, Junyuan Hong, and Jiayu Zhou. Data-Free Knowledge Distillation for
Heterogeneous Federated Learning. Proceedings of the 38th International Conference on
Machine Learning, PMLR 139, 2021.
5.1     Introduction
Federated Learning (FL) is an effective machine learning approach that enables the decen-
tralization of computing and data resources. Classical FL, represented by FedAvg [116],
obtains an aggregated model by iteratively averaging the parameters of distributed local
user models, therefore omits the need of accessing their data. Serving as a communication-
efficient and privacy-preserving learning scheme, FL has shown its potential to facilitate
real-world applications, including healthcare [154], biometrics [3], and natural language pro-
cessing [63, 6], to name just a few.
     Along with its promising prospect, FL faces practical challenges from data heterogene-
ity [98], in that user data from real-world is usually non-iid distributed, which inherently
induces deflected local optimum [80]. Moreover, the permutation-invariant property of deep
neural networks has further increased the heterogeneity among user models [198, 178]. Thus,
performing element-wise averaging of local models, as adopted by most existing FL ap-
proaches, may not induce an ideal global model [99, 98].
     A variety of efforts have been made to tackle user heterogeneity, mainly from two comple-
mentary perspectives: one focuses on stabilizing local training, by regulating the deviation
of local models from a global model over the parameter space [98, 40, 80]. This approach
may not fully leverage the underlying knowledge across user models, whose diversity suggests
                                               49


informative structural differences of their local data and thus deserves more investigation.
Another aims to improve the efficacy of model aggregation [198, 30], among which knowledge
distillation has emerged as an effective solution [103, 96]. Provided with an unlabeled dataset
as the proxy, knowledge distillation alleviates the model drift issue induced by heterogene-
ity, by enriching the global model with the ensemble knowledge from local models, which is
shown to be more effective than simple parameter-averaging. However, the prerequisite of
a proxy data can leave such an approach infeasible for many applications, where a carefully
engineered dataset may not always be available on the server. Moreover, by only refining the
global model, the inherent heterogeneity among user models is not fully addressed, which
may in turn affect the quality of the knowledge ensemble, especially if they are biased due
to limited local data [82], which is a typical case for FL.
    Observing the challenge in the presence of user heterogeneity and the limitations of prior
art, in this work, we propose a data-free knowledge distillation approach for FL, dubbed as
FeDGen (Federated Distillation via Generative Learning). Specifically, FeDGen learns
a generative model derived solely from the prediction rules of user models, which, given
a target label, can yield feature representations that are consistent with the ensemble of
user predictions. This generator is later broadcasted to users, escorting their model training
with augmented samples over the latent space, which embodies the distilled knowledge from
other peer users. Given a latent space with a dimension much smaller than the input space,
the generator learned by FeDGen can be lightweight, introducing minimal overhead to the
current FL framework.
    The proposed FeDGen enjoys multifold benefits: i) It extracts the knowledge out of users
which was otherwise mitigated after model averaging, without depending on any external
data. ii) Contrary to certain prior work that only refines the global model, our approach
directly regulates local model updating using the extracted knowledge. We show that such
knowledge imposes an inductive bias to local models, leading to better generalization perfor-
mance under non-iid data distributions. iii) Furthermore, the proposed approach is ready to
                                                50


address more challenging FL scenarios, where sharing entire model parameters is impractical
due to privacy or communication constraints, since the proposed approach only requires the
prediction layer of local models for knowledge extraction.
     Extensive empirical studies echoed by theoretical elaborations show that, our proposed
approach yields a global model with better generalization performance using fewer commu-
nication rounds, compared with the state-of-the-art.
5.2      Notations and Preliminaries
Without ambiguity, in this work, we discuss a typical FL setting for supervised learning,
i.e., the general problem of multi-class classification. Let X ⊂ Rp be an instance space,
Z ⊂ Rd be a latent feature space with d < p, and Y ⊂ R be an output space. T denotes a
domain which consists of a data distribution D over X and a ground-truth labeling function
c∗ : X → Y, i.e. T := hD, c∗ i. Note that we will use the term domain and task equivalently.
A model parameterized by θ := [θ f ; θ p ] consists of two components: a feature extractor
f : X → Z parametrized by θ f , and a predictor h : Z → 4Y parameterized by θ p ,
where 4Y is the simplex over Y. Given a non-negative, convex loss function l : 4Y ×
Y → R, the risk of a model parameterized by θ on domain T is defined as LT (θ) :=
Ex∼D l h(f (x; θ f ); θ p ), c∗ (x) .
                                  
     Federated Learning aims to learn a global model parameterized by θ that minimizes
its risk on each of the user tasks Tk [116]:
                                           minθ ETk ∈T [Lk (θ)] ,                                      (5.1)
where T = {Tk }K k=1 is the collection of user tasks. We consider all tasks sharing the same la-
beling rules c∗ and loss function l, i.e., Tk = hDk , c∗ i. In practice, Equation 5.1 is empirically
optimized by minθ K1 K        k=1 L̂k (θ), where L̂k (θ) := |D̂ |                                         is
                                                              1 P                      f    p    ∗
                          P                                                                            
                                                               k
                                                                  xi ∈D̂k l(h(f (xi ; θ ); θ ), c (xi ))
the empirical risk over an observable dataset D̂k . An implied assumption for FL is that the
global data D̂ is distributed to each of the local domains, with D̂ = ∪{D̂k }K           k=1 .
                                                   51


     Knowledge Distillation (KD) is also referred as a teacher-student paradigm, with the
goal of learning a lightweight student model using knowledge distilled from one or more
powerful teachers [26, 11]. Typical KD leverages a proxy dataset D̂P to minimize the
discrepancy between the logits outputs from the teacher model θT and the student model
θS , respectively. A representative choice is to use Kullback-Leibler divergence to measure
such discrepancy [67]:
                                        h           h                                         ii
                    minθS Ex∼D̂P DKL σ(g(f (x; θTf ); θTp )kσ(g(f (x; θSf ); θSp ) ,
 where g(·) is the logits output of an predictor h, and σ(·) is the non-linear activation applied
to such logits, i.e. h(z; θ p ) = σ(g(z; θ p )).
     The idea of KD has been extended to FL to tackle user heterogeneity [103, 30], by
treating each user model θk as the teacher, whose information is aggregated into the student
(global) model θ to improve its generalization performance:
                                                     K
                              "                                                                #
                                               1 X                  f  p               f    p
                  min E         DKL [σ(                 g(f (x; θk ); θk ))kσ(g(f (x; θ ); θ )] .
                    θ  x∼D̂P                  K
                                                    k=1
 One primary limitation of the above approach resides in its dependence on a proxy dataset
D̂P , the choice of which needs delicate consideration and plays a key role in the distillation
performance [103]. Next, we show how we make KD feasible for FL in a data-free manner.
                               Local Data
                                              Featur e
                                             Extr actor  Gener ator
                                                                        ...
                                  Pr edictor
Figure 5.1: Overview of FeDGen: a generator Gw (·|y) is learned by the server to aggregate
information from different local clients without observing their data. The generator is then
sent to local users, whose knowledge is distilled to user models to adjust their interpretations
of a good feature distribution.
                                                                 52


Algorithm 5.1: Data Free Federated Distillation via Generalized Learning
  1: Require: Tasks {Tk }K     k=1 ;
  2:   Global parameters θ, local parameters {θk }K              k=1 ;
  3:   Generator parameter w; p̂(y) uniformly initialized;
  4:   Learning rate α, β, local steps T , batch size B, local label counter ck .
  5: repeat
  6:    Server selects active users A uniformly at random, then broadcast w, θ, p̂(y) to A.
  7:    for all user k ∈ A in parallel do
  8:        θk ← θ,
  9:        for t = 1, . . . , T do
10:              {xi , yi }B                                           B
                           i=1 ∼ Tk , {ẑi ∼ Gw (·|ŷi ), ŷi ∼ p̂(y)}i=1 .
11:             Update label counter ck .
12:           θk ← θk − β∇θk J(θk ). . Optimize Equation 5.5
13:         User sends θk , ck back to server.
        Server updates θ ← |A|             k∈A θk , and p̂(y) based on {ck }k∈A .
                                   1
                                       P
14:
15:     w ← w − α∇w J(w).                  . Optimize Equation 5.4
16:  until training stop
5.3     FeDGen: Data Free Federated Distillation via Generative Learn-
        ing
In this section, we elaborate our proposed approach with a summary shown in Algorithm 5.1.
An overview of its learning procedure in illustrated in Figure 5.1.
5.3.1    Knowledge Extraction
Our core idea is to extract knowledge about the global view of data distribution, which is
otherwise non-observable by conventional FL, and distill such knowledge to local models to
guide their learning. We first consider learning a conditional distribution Q∗ : Y → X to
characterize such knowledge, which is consistent with the ground-truth data distributions:
                               Q∗ = arg max Ey∼p(y) Ex∼Q(x|y) [log p(y|x)],             (5.2)
                                        Q: Y→X
where p(y) and p(y|x) are the ground-truth prior and posterior distributions of the target
labels, respectively, both of which are unknown. To make Equation 5.2 optimizable w.r.t Q,
we replace p(y) and p(x|y) with their empirical approximations. First, we estimate p(y) as:
                                               X
                                     p̂(y) ∝         Ex∼D̂k [I(c∗ (x) = y)],
                                                   k
                                                         53


where I(·) is an indicator function, and D̂k is the observable data for domain Tk . In practice,
p̂(y) can be obtained by requiring the training label counts from users during the model
uploading phase. Next, we approximate p(y|x) using the ensemble wisdom from user models:
                                             1 XK
                              log p̂(y|x) ∝              log p(y|x; θk ).
                                            K       k=1
    Equipped with the above approximations, directly optimizing Equation 5.2 over the input
space X can still be prohibitive: it brings computation overloads when X is of high dimension,
and may also leak information about the user data profile. A more approachable idea is hence
to recover an induced distribution G∗ : Y → Z over a latent space, which is more compact
than the raw data space and can alleviate certain privacy-related concerns:
                                                             K
                                                          "                     #
                                                                log p(y|z; θkp ) .
                                                            X
                        G∗ = arg max Ey∼p̂(y) Ez∼G(z|y)                                    (5.3)
                              G:Y→Z                         k=1
    Following the above reasoning, we aim to perform knowledge extraction by learning a
conditional generator G, parameterized by w to optimize the following objective:
                                                                 K
                                                      "                            #
                                                              1 X          p
                     min J(w) := Ey∼p̂(y) Ez∼Gw (z|y)   l(σ(        g(z; θk )), y) ,       (5.4)
                      w                                      K
                                                                k=1
    where g and σ are the logit-output and the activation function as defined in Section 5.2.
Given an arbitrary sample y, optimizing Equation 5.4 only requires access to the predic-
tor modules θkp of user models. Specifically, to enable diversified outputs from G(·|y),
we introduce a noise vector  ∼ N (0, I) to the generator, which is resemblant to the re-
parameterization technique proposed by prior art [83], so that z ∼ Gw (·|y) ≡ Gw (y, | ∼
N (0, I)). We discuss more implementation details in the supplementary.
    Given arbitrary target labels y, the proposed generator can yield feature representations
z ∼ Gw (·|y) that induce ideal predictions from the ensemble of user models. In other
words, the generator approximates an induced image of a consensual distribution, which is
consistent with the user data from a global view.
                                                54


5.3.2    Knowledge Distillation
The learned generator Gw is then broadcasted to local users, so that each user model can
sample from Gw to obtain augmented representations z ∼ Gw (·|y) over the feature space.
As a result, the objective of a local model θk is altered to maximize the probability that it
yields ideal predictions for the augmented samples:
                     min J(θk ) := L̂k (θk ) + Êy∼p̂(y),z∼Gw (z|y) l(h(z; θkp ); y) ,
                                                                                       
                                                                                                            (5.5)
                      θk
                                    h                                    i
 where L̂k (θk ) :=   1
                                      l(h(f (xi ; θkf ); θkp ), c∗ (xi )) is the empirical risk given local data
                           P
                    |D̂k |  xi ∈D̂k
D̂k . We show later that the augmented samples can introduce inductive bias to local users,
reinforcing their model learning with a better generalization performance.
     Up to this end, our proposed approach has realized data-free knowledge distillation, by
interactively learning a lightweight generator that primarily depends on the prediction rule
of local models, and leveraging the generator to convey consensual knowledge to local users.
We justify in Section 5.6.2 that our approach can effectively handle user heterogeneity in
FL, which also enjoys theoretical advantages as analyzed in Section 5.4.
5.3.3    Extensions for Flexible Parameter Sharing
In addition to tackling data heterogeneity, FeDGen can also handle a challenging FL sce-
nario where sharing the entire model is against communication or privacy prerequisites. On
one hand, advanced networks with deep feature extraction layers typically contain millions of
parameters [65, 24], which bring significant burdens to communication. On the other hand,
it has been shown feasible to backdoor regular FL approaches [177]. For practical FL ap-
plications such as healthcare or finance, sharing entire model parameters may be associated
with considerable privacy risks, as discussed in prior work [64].
     FeDGen is ready to alleviate those problems, by sharing only the prediction layer θkp
of local models, which is the primary information needed to optimizing Equation 5.4, while
keeping the feature extractor θkf localized. This partial sharing paradigm is more efficient,
and at the same time less vulnerable to data leakage, as compared with a strategy that shares
                                                            55


the entire model. Empirical study in Section 5.6.4 shows that, FeDGen significantly benefits
local users, even without sharing feature extraction modules. We defer the algorithmic
summary of this variant approach to the supplementary.
   (a) Decision bound-   (b) Decision bound-      (c) Global model    (d) Oracle decision
   ary before KD.        ary after KD.            decision boundary.  boundary.
Figure 5.2: After KD, accuracy has improved from 81.2% to 98.4% for one user ( Fig 5.2a
- Fig 5.2b), while a global model obtained by parameter-averaging (without KD) has 93.2%
accuracy (Fig 5.2c), compared with an oracle model with 98.6% accuracy (Fig 5.2d).
    (a) Randomly ini-     (b) r(x|y) learned      (c) r(x|y) learned   (d) r(x|y) learned
    tialized r(x|y).      after 50 batches.       after 150 batches.   after 250 batches.
Figure 5.3: Samples from the generator gradually approaches to real distribution, where each
user model (teacher) sees limited, disjoint local data. The background color indicates oracle
decision boundaries learned over the global data.
5.4       FeDGen Analysis
In this section, we provide multiple perspectives to understand our proposed approach. We
first visualize what knowledge is learned and distilled by FeDGen, then analyze why the
distilled knowledge is favorable, from the viewpoint of distribution matching and domain
adaptation, respectively. We primarily focus on interpreting the rationale behind FeDGen
and leave detailed discussion and derivations to the supplementary.
                                              56


5.4.1    Knowledge Distillation for Inductive Bias
We illustrate the KD process in FeDGen on a FL prototype, which contains three users,
each assigned with a disjoint dataset D̂k , k ∈ {1, 2, 3}. When trained using only the local
data, a user model is prone to learn biased decision boundaries (See Figure 5.2a).
    Next, a generator Gw (·|y) is learned based on the prediction rule of user models. For clear
visualizations, we learn Gw (·|y) on the raw feature space Y → X ⊂ R2 instead of a latent
space. As shown in Figure 5.3, r(x|y), which denotes the distribution derived from Gw (x|y),
gradually coincides with the ground-truth p(x|y) (Figure 5.2d), even when the individual
local models are biased. In other words, Gw (x|y) can fuse the aggregated information from
user models to approximate a global data distribution.
    We then let users sample from Gw (x|y), which serves as an inductive bias for users with
limited data. As a result, each user can observe beyond its own training data and adjust
their decision boundaries to approach to the ensemble wisdom (Figure 5.2b).
5.4.2    Knowledge Distillation for Distribution Matching
A notable difference between FeDGen and prior work is that the knowledge is distilled
to user models instead of the global model. As a result, the distilled knowledge, which
conveys inductive bias to users, can directly regulate their learning by performing distribution
matching over the latent space Z:
Remark 5. Let p(y) be the prior distribution of labels, and r(z|y) : Y → Z be the conditional
distribution derived from generator Gw . Then regulating a user model θk using samples from
r(z|y) can minimize the conditional KL-divergence between two distributions, derived from
the generator and the user, respectively:
                max Ey∼p(y),z∼r(z|y) [log p(y|z; θk )] ≡ min DKL [r(z|y)kp(z|y; θk )],      (5.6)
                  θk                                     θk
    where we define p(z|y; θk ) as the probability that the input feature to the predictor θk
is z given that it yields a label y. In practice, Equation 5.6 is optimized by using empirical
samples from the generator: {(z, y)|y ∼ p̂(y), z ∼ Gw (z|y)}, which is consistent with the
                                                  57


second term of the local model objective (Equation 5.5), in that ∀ y ∈ Y:
                    max Ez∼r(z|y) [log p(y|z; θk )] ≈ min Ez∼Gw (z|y) l(h(z; θkp ); y) .
                                                                                      
                     θk                                θk
 Distinguished from prior work that applies weight regularization to local models [98, 40],
FeDGen can serve as an alternative and compatible solution to address user heterogeneity,
which inherently bridges the gap among user models w.r.t their interpretations of an ideal
feature distribution.
5.4.3   Knowledge Distillation for Improved Generalization
One can also draw a theoretical connection from the knowledge learned by FeDGen to an
improved generalization bound. To see this, we first present a performance bound for the
aggregated model in FL, which is built upon prior arts from domain adaptation [17, 16]:
Theorem 2. (Generalization Bounds for FL) Consider an FL system with K users. Let
Tk = hDk , c∗ i and T = hD, c∗ i be the k-th local domain and the global domain, respectively. Let
R : X → Z be a feature extraction function that is simultaneously shared among users. Denote hk
the hypothesis learned on domain Tk , and h = K1 K
                                                       P
                                                          k=1 hk the global ensemble of user hypotheses.
Then with probability at least 1 − δ:
                                   !
                        1 X
      LT (h) ≡ LT               hk
                       K
                            k
                                                                     s                            
                  1 X                 1 X                              4         2em         4K
              ≤          L̂Tk (hk ) +       dH4H (D̃k , D̃) + λk +           d log       + log       ,
                 K                    K                                  m           d          δ
                     k                  k
where L̂Tk (hk ) is the empirical risk on Tk , λk := minh (LTk (h) + LT (h)) denotes an oracle per-
formance. dH4H (D̃k , D̃) denotes the divergence measured over a symmetric-difference hypothesis
space. D̃k and D̃ is the induced image of Dk and D over R, respectively, s.t.               Ez∼D̃k [B(z)] =
Ex∼Dk [B(R(x))] given a probability event B, and so for D̃.
    Specifically, LT (h) is usually considered as an ideal upper-bound for the global model
in FL [135, 103]. Two key implications can be derived from Theorem 2: i) Large user
heterogeneity leads to high distribution divergence (dH4H (D̃k , D̃)), which undermines the
                                                     58


quality of the global model; ii) More empirical samples (m) are favorable to the generalization
performance, which softens the numerical constraints.
      In other words, the generalization performance can be improved by enriching local users
with augmented data that aligns with the global distribution:
Corollary 1. Let T , Tk , R defined as in Theorem 2. DA denotes an augmented distribution, and
Dk0 = 21 (Dk + DA ) is a mixture of distributions. Accordingly, D̃A , D̃k0 denotes the induced image
of DA , Dk0 over R, respectively. Let D̂k0 = D̂k ∪ D̂A be an empirical dataset of Dk0 , with |D̂k | = m,
|D̂k0 | = m0 > m. If dH4H (D̃A , D̃) is bounded, s.t ∃  > 0, dH4H (D̃A , D̃) ≤ , then with probability
1 − δ:
                                                                     s
                                                                                 2em0
                                                                                              
               1 X                 1 X                                 4                    4K
     LT (h) ≤        LTk0 (hk ) +        (dH4H (D̃k0 , D̃) + λ0k ) +       d log      + log      ,  (5.7)
              K                    K                                   m0          d         δ
                 k                    k
where Tk0 = {Dk0 , c∗ } is the updated local domain, dH4H (D̃k0 , D̃) ≤ dH4H (D̃k , D̃) when  is small,
       q                             q
                    0
and m40 (d log 2emd   +  log  4K
                               δ ) <    4       2em
                                        m (d log d + log δ ).
                                                             4K
      Such an augmented distribution DA can facilitate FL from multiple aspects: not only
does it relax the numerical constraints with more empirical samples (m0 > m), but it also
reduces the discrepancy between the local and global feature distributions (dH4H (D̃k0 , D̃)).
This finding coincides the merits of FeDGen: since the generator Gw (z|y) is learned to
recover an aggregated distribution over the feature space, one can treat samples from the
generator {z|y ∼ p̂(y), z ∼ Gw (z|y)} as the augmented data from D̃A , which naturally has
a small deviation from the global induced distribution D̃. More rigorous analysis along this
line is left to our future work. We elaborate the role of such an augmentation distribution
DA in the supplementary.
5.5       Related Work
Federated Learning (FL) is first proposed by [116] as a decentralized machine learning
paradigm. Subsequent work along this line tackles different challenges faced by FL, including
heterogeneity [80, 98, 113], privacy [43, 2], communication efficiency [58, 86], and conver-
gence analysis [77, 139, 197]. Specifically, a wealth of work has been proposed to handle
                                                       59


user heterogeneity, by regularizing model weight updates [98], allowing personalized user
models [45, 40], or introducing new model aggregation schemes [198, 113]. We refer readers
to [97] for an organized discussion of recent progress on FL.
    Knowledge Distillation (KD) is a technique to compress knowledge from one or more
teacher models into an empty student [67, 26, 11, 74]. Conventional KD hinges on a proxy
dataset [67]. More recent work enables KD with fewer data involved, such as dataset distil-
lation [180], or core-data selection [173, 152]. Later there emerges data-free KD approaches
which aim to reconstruct samples used for training the teacher [193, 117]. Particularly, [110]
extracts the meta-data from the teacher’s activation layers. [193] learns a conditional genera-
tor that yields samples which maximize the teacher’s prediction probability of a target label.
Along with the same spirit, [117] learns a generator by adversarial training. Different from
prior work, we learn a generative model that is tailored for FL, by ensembling the knowledge
of multiple user models over the latent space, which is more lightweight for learning and
communication.
    Knowledge Distillation in Federated Learning has recently emerged as an effective
approach to tackle user heterogeneity. Most existing work is data-dependent [103, 159, 58,
30]. Particularly, [103] proposed FedDFusion, which performs KD to refine the global
model, assuming that an unlabeled dataset is available with samples from the same or sim-
ilar domains. Complementary KD efforts have been made to confront data heterogeneity
[96, 150]. Specifically, [96] transmits the proxy dataset instead of the model parameters.
FedAUX [150] performs data-dependent distillation by leveraging an auxiliary dataset to
initialize the server model and to weighted-ensemble user models, while FeDGen performs
knowledge distillation in a data-free manner. FedMix [194] is a data-augmented FL frame-
work, where users share their batch-averaged data among others to assist local training. On
the country, FeDGen extracts knowledge from the existing user model parameters, which
faces fewer privacy risks. FedDistill (Federated Distillation) is proposed by [153] which
extracts from user models the statistics of the logit-vector outputs, and shares this meta-data
                                               60


    to users for KD. We provide detailed comparisons with work along this line in Section 5.6.
    5.6                       Experiments
    In this section, we compare the performance of our proposed approach with other key related
    work. We leave implementation details and extended experimental results to the supplemen-
    tary.
                             Training Label Distribution                                       Training Label Distribution                                  Training Label Distribution                                  Training Label Distribution
              8                                                                      8                                                            8                                                            8
              6                                                                      6                                                            6                                                            6
Training labels
              4                                                        Training labels
                                                                                     4                                              Training labels
                                                                                                                                                  4                                              Training labels
                                                                                                                                                                                                               4
              2                                                                      2                                                            2                                                            2
              0                                                                      0                                                            0                                                            0
                  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                     1 2 3 4 5 6 7 8 9 1011121314151617181920                     1 2 3 4 5 6 7 8 9 1011121314151617181920                     1 2 3 4 5 6 7 8 9 1011121314151617181920
                                   User ID                                                               User ID                                                      User ID                                                      User ID
                         (a) α = 0.05                                                          (b) α = 0.1                                                    (c) α = 1                                                   (d) α = 10
    Figure 5.4: Visualization of statistical heterogeneity among users on Mnist dataset, where
    the x-axis indicates user IDs, the y-axis indicates class labels, and the size of scattered points
    indicates the number of training samples for a label available to that user.
                                                                                                                  Top-1 Test                          Accuracy.
    Dataset Setting                                 FedAvg                                FedProx             FedEnsemble                             FedDistill           FedDistill+                             FedDFusion              FeDGen
            α = 0.05                               87.70±2.07                             87.49±2.05           88.85±0.68                             70.56±1.24            86.70±2.27                              90.02±0.96            91.30±0.74
    Mnist,
            α = 0.1                                90.16±0.59                             90.10±0.39           90.78±0.39                             64.11± 1.36           90.28±0.89                              91.11±0.43            93.03±0.32
     T =20
            α=1                                    93.84±0.25                            93.83 ± 0.29          93.91±0.28                             79.88±0.66            94.73±0.15                              93.37±0.40            95.52±0.07
           r = 5/10                                87.48±0.39                            87.67±0.39              88.48±0.23                            76.68±1.23            86.37±0.41                             87.01±1.00            89.70±0.32
    CelebA,
           r = 5/25                                89.13±0.25                            88.84±0.19             90.22±0.31                             74.99±1.57            88.05± 0.43                            88.93±0.79             89.62±0.34
     T =20
           r = 10/25                               89.12±0.20                            89.01±0.33              90.08±0.24                            75.88±1.17            88.14±0.37                             89.25±0.56            90.29±0.47
           α = 0.05                               62.25±2.82                             61.93±2.31              64.99±0.35                            60.49±1.27            61.56±2.15                            70.40±0.79              68.53±1.17
    EMnist,α = 0.1                                66.21±2.43                             65.29±2.94              67.53±1.19                            50.32±1.39            66.06±3.18                             70.94±0.76            72.15±0.21
     T =20 α = 10                                 74.83± 0.69                            74.24±0.81              74.90±0.80                            54.77±0.33            75.55 ±0.94                            74.36±0.40            78.43±0.74
    EMnist,T = 20                                  74.83±0.99                            74.12±0.88              75.12±1.07                            46.19±0.70             75.41±1.05                            75.43±0.37           78.48±1.04
     α=1 T = 40                                    77.02±1.09                            75.93 ±0.95             77.68±0.98                            46.72±0.73             78.12±0.90                            77.58±0.37           78.92± 0.73
    Table 5.1: Performance overview. For Mnist and EMnist, a smaller α indicates higher
    heterogeneity. For CelebA, r denotes the ratio between active users and total users. T
    denotes the number of local training steps.
    5.6.1                        Setup
    Baselines: In addition to FedAvg [116], FedProx regularizes the local model train-
    ing with a proximal term in the model objective [98]. FedEnsemble extends FedAvg
                                                                                                                                    61


                                    EMnist Dataset.                                    CelebA Dataset.                                                    Mnist Dataset.                                Mnist Dataset.
                                                                             91                                  FedGen                         96                                        92
                                                                             90                                  FedAvg                                                                   90
                          75                                                                                     FedProx                        94
                                                                             89                                  Ensemble                                                                 88
                                                                      Accuracy
                                                                                                                                                92
                   Accuracy                                                                                                     Accuracy                                        Accuracy
                          70                                                                                     FedDistill
                                                                             88                                  FedDistill +                                                             86
                                                                                                                                                90
                                                                             87                                  FedFusion                                                                84
                          65                                                                                                                    88
                                                                             86                                                                                                           82
                                                                                                                                                86
                          60                                                 85                                                                                                           80
                           α=0.05 α=0.1 α=1.0 α=10.0                          5 / 25    10 / 25 5 / 10 9 / 10                                    α=0.05 α=0.1 α=1.0 α=10.0                 α=0.05 α=0.1 α=1.0 α=10.0
                             Reduced Data Heterogenetiy                                Active / Total users.                                       Reduced Data Heterogenetiy                Reduced Data Heterogenetiy
                        (a) EMnist Data.                                          (b) CelebA Dataset.                                 (c) Mniston CNN. (d) Mnist on MLP.
                                                Figure 5.5: Visualized performance w.r.t data heterogeneity.
                 Test Classification Accuracy.                        Test Classification Accuracy.                                                                                               Test Classification Accuracy.
                                                                                                                                                     Test Classification Accuracy.
            0.90                                                                                                FedGen                                                                       0.96
                                                                  0.88                                          FedAvg                      0.775
                                                                                                                FedProx                                                                      0.95
            0.88
                                                      Test Accuracy
                                                                                                                                            0.750
Test Accuracy                                                                                                                                                                    Test Accuracy
                                                                                                                Ensemble                                                                     0.94
                                                                                                                                Test Accuracy
                                                                  0.86
            0.86                                                                                                FedDistill
                                                                                                                                            0.725                                            0.93
                                                                  0.84                                          FedDistill +
            0.84                                                                                                FedFusion                   0.700                                            0.92
            0.82                                                  0.82
                                                                                                                                            0.675                                            0.91
            0.80                                                  0.80                                                                      0.650                                            0.90
                     0          25    50      75 100                       0       25    50      75 100                                               0     50 100 150 200                          0     50 100 150 200
                              Communication rounds                               Communication rounds                                                     Communication rounds                          Communication rounds
       (a) CelebA,r = 0.4.                                                       (b) CelebA, r = 0.5.                                     (c) EMnist, α = 1.                                     (d) Mnist, α = 1.
                                          Figure 5.6: Selected learning curves, averaged over 3 random seeds.
                to ensemble the prediction output of all user models. FedDFusion is a data-based KD
                approach [103], for which we provide unlabeled training samples as the proxy dataset. Fed-
                Distill [75] is a data-free KD approach which shares label-wise average of logit-vectors
                among users. It does not share network parameters and therefore experiences non-negligible
                performance drops compared with other baselines. For a fair comparison, we derive a base-
                line from FedDistill, which shares both model parameters and the label-wise logit vectors.
                We name this stronger baseline as FedDistill+ .
                        Dataset: We conduct experiments on three image datasets: Mnist [94], EMnist [35],
                and CelebA [109], as suggested by the Leaf FL benchmark [29]. Among them, Mnist and
                EMnist dataset is for digit and character image classifications, and CelebA is a celebrity-
                face dataset which is used to learn a binary-classification task, i.e. to predict whether the
                celebrity in the picture is smiling.
                        Configurations: Unless otherwise mentioned, we run 200 global communication rounds,
                with 20 user models in total and an active-user ratio r = 50%. We adopt a local updating
                step T = 20, and each step uses a mini batch with size B = 32. We use at most 50% of
                                                                                                                   62


the total training dataset and distribute it to user models, and use all testing dataset for
performance evaluation. For the classifier, we follow the network architecture of [116], and
treat the last MLP layer as the predictor θkp and all previous layers as the feature extractor
θkf . The generator Gw is MLP based. It takes a noise vector  and an one-hot label vector y
as the input, which, after a hidden layer with dimension dh , outputs a feature representation
with dimension d. To further increase the diversity of the generator output, we also leverage
the idea of diversity loss from prior work [114] to train the generator model.
     User heterogeneity: for Mnist and EMnist dataset, we follow prior arts [103, 71] to
model non-iid data distributions using a Dirichlet distribution Dir(α), in which a smaller α
indicates higher data heterogeneity, as it makes the distribution of pk (y) more biased for a
user k. We visualize the effects of adopting different α on the statistical heterogeneity for the
Mnist dataset in Figure 5.4. For CelebA, the raw data is naturally non-iid distributed.
We further increase the data heterogeneity by aggregating pictures belonging to different
celebrities into disjoint groups, with each group assigned to one user.
5.6.2    Performance Overview:
From Table 6.2, we can observe that FeDGen outperforms other baselines with a consider-
able margin.
     Impacts of data heterogeneity: FeDGen is the only algorithm that is robust against
different levels of user heterogeneity while consistently performs well. As shown in Figure 5.5,
the gain of FeDGen is more notable when the data distributions are highly heterogeneous
(with a small α). This result verifies our motivations, since the advantage of FeDGen is
induced from the knowledge distilled to local users, which mitigates the discrepancy of latent
distributions across users. This knowledge is otherwise not accessible by baselines such as
FedAvg or FedProx.
     As one of the most competitive baselines, the advantage of FedDFusion vanishes as
data heterogeneity becomes mitigated, which gradually becomes comparable to FedAvg, as
shown in Figure 5.5a and Figure 5.5c. Unlike FedDFusion, the performance gain of our
                                                63


approach is consistently significant, which outperforms FedDFusion in most cases. This
discrepancy implies that our proposed approach, which directly distills the knowledge to
user models, can be more effective than fine-tuning the global model using a proxy dataset,
especially when the distilled knowledge contains inductive bias to guide local model learning.
    As a data-free KD baseline, FedDistill experiences non-negligible performance drops,
which implies the importance of parameter sharing in FL. FedDistill+ , on the other
hand, is vulnerable to data heterogeneities. As shown in Table 6.2, it can outperform Fe-
dAvg when data distributions are near-iid (e.g. when α ≥ 1 ), thanks to the shared logit
statistics as the distilled knowledge, but performs worse than FedAvg when α gets smaller,
which indicates that sharing such meta-data alone may not be effective enough to confront
user heterogeneity.
    FedEnsemble enjoys the benefit of ensemble predictions from all user models, although
its gain is less significant compared with FeDGen. We ascribe the leading performance of
our approach to the better-generalized performance of local models. Guided by the distilled
knowledge, a user model in FeDGen can quickly jump out of its local optimum, whose ag-
gregation can be better than the ensemble of potentially biased models as in FedEnsemble.
    Learning efficiency: As shown in Figure 5.6, FeDGen has the most rapid learning
curves to reach a performance and outperforms other baselines. Although FedDFusion en-
joys a learning efficiency higher than other baselines under certain data settings, due to the
advantages induced from proxy data, our approach can directly benefit each local user with
actively learned knowledge, whose effect is more explicit and consistent (More illustrations
in supplementary).
    Comments on sharing generative model: Given a compact latent space, the gener-
ative model can be lightweight for learning or downloading. In practice, we use a generator
network with 2 compact MLP layers, whose parameter size is small compared with the user
classification model. The above empirical results also indicate that the leading performance
gain combined with a faster convergence rate can trade off the communication load brought
                                               64


by sharing a generative model.
                                                          Test Accuracy.
                                            74                              FedAvg
                                                                            FedGen
                                            72
                            Test Accuracy (%)
                                            70
                                            68
                                            66
                                                 BG = 8 BG = 16 BG = 32 BG = 64BG = 128
                                                       Synthetic Sampling Size
                         Figure 5.7: Effects of synthetic samples.
                              Effects of the Generator Network Structure.
        [d , dh ]      [64, 256]     [32, 256]    [32, 128]    [16, 128]    [32, 64]
    Accuracy(%)                       FedAvg=66.22±2.58
       FeDGen          71.61±0.25 72.09±0.46 72.43±0.57 72.01±0.76 70.98±0.85
Table 5.2: Effects of the generator’s network structure, using EMnist dataset with α = 0.1.
                            Performance w.r.t synthetic sample sizes.
  G sampling size           BG = 8    BG = 16     BG = 32    BG = 64                      BG = 128
 Training time (ms)                  FedAvg= 47.66 ± 1.68
      FeDGen              57.20±2.22 57.39±2.21 58.17±2.24 58.91±2.29                     60.06±2.32
Table 5.3: Effects of the number of synthetic samples, using EMnist dataset with α = 0.1.
5.6.3   Sensitivity Analysis
Impacts of straggler users: We explore different numbers of total users versus active
users on the CelebA dataset, with the active ratios r ranging from 0.2 to 0.9. Figure 5.5b
shows that our approach is next to FedEnsemble when the number of straggler users are
high (r = 0.2, with 5 out 25 active users per learning round), and is consistently better than
all baselines w.r.t to the asymptotic performance given a moderate number of active users.
Combined with Figure 5.6a and Figure 5.6b, one can observe that our approach requires
                                                               65


much less communication rounds to reach high performance, regardless of the setting of
straggler users.
                Effects of different network architectures: we conduct analysis on the Mnist
dataset, using both CNN and MLP network architectures. As shown in Figure 5.5d and
Figure 5.5c, the outstanding performance of FeDGen is consistent across two different
network settings, although the overall performance trained with CNN networks is noticeably
higher than those with MLP networks.
                Effects of communication frequency: We explore different local updating steps T
on the EMnist, so that a higher T means longer communication delays before the global
communication. Results in Table 6.2 indicates that our approach is robust against different
levels of communication delays (See supplementary for more results).
                Test Classification Accuracy.                   Test Classification Accuracy.                                  Test Classification Accuracy.               Test Classification Accuracy.
                                                                                                                           0.80
                                                           0.650                                FedGen                                                                 0.88
            0.625                                                                               FedAvg
                                                           0.625                                                           0.75
                                                                                                FedDistill +                                                           0.86
                                                                                                               Test Accuracy                               Test Accuracy
            0.600
Test Accuracy                                  Test Accuracy
                                                           0.600                                FedFusion
            0.575                                          0.575                                                           0.70                                        0.84
            0.550                                          0.550
                                                                                                                           0.65                                        0.82
            0.525                                          0.525
            0.500                                          0.500                                                           0.60                                        0.80
                    0     25    50      75 100                     0     25    50      75 100                                     0     25    50      75 100                  0     25    50      75 100
                        Communication rounds                           Communication rounds                                           Communication rounds                        Communication rounds
                    (a) α = 0.05                                           (b) α = 0.1                                                (c) α = 1                                   (d) α = 10
                             Figure 5.8: Learning curves on Mnist with limited parameter sharing.
                                                                                  Top 1 Accuracy (%)
                                Algorithms                      FedAvg               FedDistill+     FedDFusion                                           FeDGen
                                α   =   0.05                   59.67±0.76                 58.83±0.62                                  59.62±0.84        63.60±0.63
                                α   =   0.1                    58.39±0.74                 56.25±0.98                                  58.38±0.81        65.42±0.29
                                α   =   1                      74.49± 0.57                74.24±0.60                                  74.51±0.55        79.72±0.52
                                α   =   10                     86.35±0.60                 86.89±0.26                                  86.28±0.69        87.92±0.46
Table 5.4: Performance overview on the Mnist dataset, by only sharing the last prediction
layer.
                Effects of the generator’s network architecture and sampling size: Extended
analysis has verified that FeDGen is robust across different generator network architectures
(Table 5.2). Moreover, sampling synthetic data from the generator only adds minor training
                                                                                                66


workload to local users (Table 5.2). The gain of FeDGen over FedAvg is consistently
remarkable given different synthetic sample sizes, whereas a sufficient number of synthetic
samples brings even better performance (Figure 5.7). Especially, in Table 5.2, we explored
different dimensions for the input noise (d ) and the hidden layer (dh ) of the generator while
keeping its output layer dimension fixed (i.e. the dimension of the feature space Z). Table 5.3
shows the training time for one local update, averaged across users and the communication
rounds. BG denotes the number of synthetic samples used for each mini-batch optimization.
By default, we set BG = B, and B is the number of real samples drawn from the local
dataset (see Algorithm 1).
5.6.4     Extensions to Flexible Parameter Sharing
Motivated to alleviate privacy and communication concerns, FeDGen is ready to benefit
distributed learning without sharing entire model parameters. To explore this potential, we
conduct a case study on FedAvg, FedDistill+ , and FeDGen, where user models share
only the last prediction layer and keep their feature extraction layers localized. Note that
FedDFusion is not designed to address FL with partial parameter sharing, which requires
entire user models for KD. For a fair comparison, we modify FedDFusion to let it upload
entire user models during the model aggregation phase, but disable the downloading of
feature extractors, so that the server model can still be fine-tuned using the proxy data.
    Results in Table 5.4 show that our approach consistently outperforms other baselines by a
remarkable margin, the trend of which is more significant given high data heterogeneity (Fig-
ure 5.8). Its distinguished performance from FedDFusion verifies the efficacy of data-free
distillation under this challenging scenario. These promising results show that FeDGen has
the potential to further reduce communication workload, not only by fast convergence but
also by a flexible parameter sharing strategy.
                                               67


5.7    Summary
In this chapter, we propose an FL paradigm that enables efficient knowledge distillation
to address user heterogeneity without requiring any external data. Extensive empirical
experiments, guided by theoretical implications, have shown that our proposed approach can
benefit federated learning with better generalization performance using fewer communication
rounds, compared with the state-of-the-art.
                                              68


                                       CHAPTER 6
FEDRESCUE: SELF-KNOWLEDGE DISTILLATION FOR RESILIENT AND
           COMMUNICATION EFFICIENT FEDERATED LEARNING
This chapter is based on our research accompanied by the following paper:
    Zhu, Zhuangdi, Junyuan Hong, Steve Drew, and Jiayu Zhou. Resilient and Commu-
nication Efficient Learning for Heterogeneous Federated Learning. Proceedings of the 39th
International Conference on Machine Learning, 2022.
6.1     Introduction
Federated Learning (FL) is a decentralized machine learning scheme that eliminates private
data sharing on participating devices. Recent years witnessed effervescent development of
FL in varied domains, including healthcare [144], computer vision [106], natural language
processing [63, 116], and Internet of things (IoT) [81, 42], to name just a few. The rapid
adoption and deployment of edge computing have enabled computing to be even closer to
the source of the data and users. The growing demand for privacy-aware and low-latency
machine learning at the edge makes FL a natural fit.
    The diversities among participating devices and their network topologies are phenome-
nal, which imposed significant challenges of statistical and system heterogeneity to FL. The
statistical heterogeneity in FL has been extensively tackled by techniques such as regular-
ized optimization [40], customized model aggregation [178], and data augmentation [205].
In comparison, system heterogeneity is induced by significant gaps in memory capacities
and transmission bandwidth among edge devices, the impact of which is under-explored.
Moreover, traditional FL hinges on reliable connections, where model parameters are trans-
mitted between edge devices and a central server without packet loss. This prerequisite can
be prohibitive for practical edge-based applications, including autonomous driving and IoT,
where devices such as wearable devices and vehicles can frequently opt-in, opt-out, or move
                                              69


around. Under faulty network connections, prior FL solutions may become fragile when the
model parameters fail to be intactly shared among active users due to transmission interrup-
tions. This connection uncertainty is bidirectional, which exists either when a participant
downloads or uploads parameter updates, leading to nonnegligible information loss of the
participating devices. To the best of our knowledge, few pioneer efforts have been made to
address the transmission uncertainty in FL, leaving most FL learning schemes at potential
risk of undermined performance.
    Observing the system heterogeneity and connection uncertainty in FL, in this
paper, we propose an FL framework that addresses both challenges simultaneously. In
our approach, edge devices learn a prunable neural network by self-distillation, such
that a model can be structurally pruned to submodels that contain adequate knowledge of
the learning domain. Towards effective optimization, we propose progressive learning ,
which articulates the knowledge representation in the model in a nested and progressive
structure. This strategy enables FL participants with diverging model architectures to fully
devote their knowledge to model aggregation. Furthermore, powered with a sequential
model transmission paradigm, our approach is especially beneficial in amortizing the
risk of connection interruptions, since the partially transmitted model parameters before
interruption can still contribute self-contained domain knowledge to the recipient.
    To the best of our knowledge, we are the first to address both system heterogeneity
and connection uncertainty in FL by progressively learning self-distilled networks. Exten-
sive empirical studies have verified that, our proposed approach, dubbed as FedResCuE ,
is Resilient to unstable transmission connections while C ommunication E fficient under
system heterogeneity, which achieves high asymptotic performance compared with the state-
of-the-art.
                                               70


6.2     Problem Setting
6.2.1    Prelimnaries of Federated Learning
Without loss of generality, we consider a learning setting that addresses a representative
problem of multi-class classification. Let T = {Tk }K      k=1 denote the learning domains of
edge devices, where a domain Tk = hXk × Yk i is defined by a joint input and output
distribution. Let θk represent local model parameters, and w global model parameters,
which are usually obtained via parameter-wise averaging on {θk }K         k=1 [116]. Denote L :
∇Y × Y → R+ the loss function recgonized by all domains, where ∇Y is a simplex over
Y. The objective of FL is to learn a global model that generalizes well on all devices:
w∗ = arg minETk ∼T [L(f (Xk ; w), Yk )], which is approximated by empirical data in practice:
            w
                                                X                           
                                       1 XK      1     nk
                                                                 i        i
                        ŵ∗  = arg min                     L(f (xk ; w), yk ) ,
                                    w K    k=1 nk      i=1
where {xik , yki }ni=1
                    k
                       ⊂ Tk . A FL system typically involves four iterative phases: i) the down-
loading phase, when the server broadcasts a global model to active users; ii) the local learning
phase, when active users update their local model parameters; iii) the uploading phase, when
active users send parameters back to the server, and iv) the aggregation phase, when the
server derives a global model using user-uploaded parameters.
6.2.2    Learning with System Heterogeneity
In this paper, we tackle FL under system heterogeneity, where edge devices can learn local
models θ k with various network architectures due to different capacities in memory and
transmission bandwidth [69, 39]. To enable FL with system heterogeneity, participants will
first agree on the largest model architecture (i.e. a ×1 network), while smaller models in this
system are treated as the pruned versions (i.e. a ×p network) with a pruning ratio p < 1.
In Deep Neural Networks (DNNs), such pruning is manifested as reducing the number of
                                                71


channels or filters. For instance, the weight matrix w ∈ Rm×n in a Convolution or Linear
layer l will be pruned to w[: m × p, : n × p] by ratio p.          1
    This strategy leaves one potential drawback to model aggregation, in that the global
model is obtained via parameter-wise averaging on edge models with heterogeneous sizes.
Naively learning and aggregating such model parameters may ignore the divergence in their
feature extraction patterns induced by architecture heterogeneity, leading to impaired global
model performance. Consequently, the knowledge uploaded by users with diverging model
sizes might not be absorbed well by their peers.
6.2.3     Learning with Unstable Network Connection
Another challenge tackled in this paper is the connection instability in FL, which differs
from the straggler issue as discussed in prior art [143, 98]. The former refers to active users
transmitting partial instead of complete model parameters due to connection interruption,
while the latter results from inactive users that did not participate in model learning. Con-
nection interruption may occur bidirectionally either during the downloading phase or the
uploading phase. It is a common issue that can be induced by multiple factors, including
bandwidth, transmission power, noisy density, and interference [31, 126], yet enough effort
has been made to address it. A naive solution to connection interruption is to ignore the
faulty connected devices and treat them as stragglers [32, 98], which may waste the local
learning of those devices that could otherwise be leveraged to improve the global model.
6.3     Resilient and Communication Efficient FL
Towards addressing the challenges of system heterogeneity and unstable network connections,
we aim to learn neural networks that can be structurally decomposed for learning, inference,
and transmission. Observing the natural property of deep neural networks, we propose
to vertically decompose a model as a sequence of columns, while a column can be one or
    1
      Without losing generality, in this paper, we unify the pruning ratio p for all layers in a model, although
such pruning can be extended to choose different ratios pl for different layers l.
                                                       72


more model channels described in Section 6.2.2, depending on the desired granularity. This
proposed paradigm enjoys twofold benefits:
• During local learning, the predictive knowledge is progressively captured in model columns
  and can be incrementally enriched with more column connections, which benefits FL
  with system heterogeneity , in that the knowledge from heterogeneous models can be
  structurally aligned in the global model.
• During FL synchronization, model columns are sequentially transmitted between the server
  and the edge device. It makes FL resilient to unstable connections, since losing parts
  of the tailing columns upon interruption does not lead to a catastrophic undermining
  of domain representations, and a lightweight submodel that is successfully transmitted
  still contains intact predictive knowledge. Moreover, this divided-and-transmit strategy
  is also in accord with lower-level transmission protocols. We demonstrate this process in
  Figure 6.1.
    Orthogonal to our work, there are personalized FL approaches, which either divide a
model horizontally then transmit the feature extraction layers [8], or selectively transmit
parameters with unordered structures [158]. Contrarily, in our FL paradigm, the received
columns can be readily assembled for learning and inference. Therefore, instead of discard-
ing the partially received parameters upon connection interruption, they can be effectively
utilized for global aggregation or local model initialization.
6.3.1     Learning Self-Distilled Local Models
We aim to learn a model that can be structurally decomposed, arbitrarily prunable with
reduced columns, and dispenses with the need for fine-tuning. We name such a model self-
distilled, which more concretely, shall satisfy the following objective:
     θ ∗ = arg min L(f (X ; θ), Y) + E [L(f (X ; θ×p ), Y) + DKL [f (X ; θ)kf (X ; θ×p )]] , (6.1)
               θ∈Θ                  p∼P
                                               73


Figure 6.1: Model parameters are divided and learned as columns, which are then transmitted
sequentially between the server and clients, until than an interruption occurs to one column,
or when all columns are transmitted successfully.
where Θ denotes the parameter space; X × Y is the learning domain; P = {p|p ≤ 1, ∀θ ∈
Θ f (X ; θ×p ) ⊆ ∇Y } is the set of legitimate pruning ratios; DKL [pkq] denotes the KL-
divergence between distribution p and q.
    The notion of self-distillation is embodied by two components in Equation 6.1: The first
is L(f (X ; θ×p ), Y), which induces the largest model θ to maintain arbitrary submodels θ×p
that are effective for the learning domain. The other is DKL [f (X ; θ)kf (X ; θ×p )], in that it
encourages the predictive knowledge captured by θ (teacher), which is manifested as the
learned posterior p(·|X ; θ) ∝ f (X ; θ), to be distilled to the submodel θ×p (student) by
distribution matching. We verify in Section 6.5.6 that, the DKL term is especially beneficial
when a submodel itself is not representative enough to capture sophisticated features due
to a limited number of channels. Hence the posterior distribution from the teacher serves
as finer-grained guidance in addition to the label supervision. Moreover, not only is the
learned self-distilled network robust against connection loss, it also benefits FL under system
heterogeneity, as the knowledge of an edge model with a larger structure can be adequately
conveyed by its submodels, which will be aggregated by the server in the next round and
shared with users of smaller model capacities as inductive bias.
                                                74


6.3.2    Effective Optimization via Progressive Learning
Optimizing Equation 6.1 might be prohibitive at first sight, as multiple submodels are bun-
dled within the same network structure, while updating θ×pi may interfere with its nested
submodels θ×pj when pj < pi . Towards effective optimization, we propose an approach
that learns a self-distilled model that incrementally builds up its feature representation by
involving more model columns.
    Specifically, we first sample a batch of ordered pruning ratios P̂ = [pi |pi ∈ P, pi <
pi+1 ∀i < S, pS = 1]Si=1 , then adaptively optimize each sampled submodel towards its objec-
tive function. Once a smaller submodel is updated (e.g. θ×p1 ), we fix its parameters and
update parameters in the subsequent model columns (e.g. θ×p2 \θ×p1 ). This learning scheme
leverages the idea of coordinate descent [184], which works by successively optimizing one co-
ordinate while fixing the others. As illustrated in Figure 6.2, predictive knowledge is learned
progressively by adding more lateral connections. More concretely, we update a sampled
θ×pi as the following:
                              θ×pi ← θ×pi − η∇{θ×pi \θ×pi−1 } J(x; θ×pi ),                 (6.2)
where J(x; θ×pi ) is the objective function for the current submodel θ×pi ; η is the learning
rate, and ∇{ θpi \θ×pi−1 } denotes the gradients w.r.t. parameters that are included in θ×pi but
not in θ×pi−1 . In particular, we tailor the objective for each submodel θ×pi as the following:
                   J(x; θ×pi ) = L(f (x; θ×pi ), y) + αi DKL [f (x; θ̄)kf (x; θ×pi )],      (6.3)
 where θ̄ is a constant cache of the largest network learned from the last iteration, which
serves as the teacher; αi := I[pi < 1] indicates the necessity of knowledge distillation, which
renders 0 if the tail of the model columns is sampled (i.e. pi = 1), and 1 otherwise. Once all
sampled submodels have been visited, we perform an one-time back-propagation to update
the teacher. We summarize this model learning approach in Algorithm 6.1.
    Besides being readily prunable, the merits of progressive learning are multifold. First,
it accelerates training by adaptively reaching a good initialization, which resembles meta-
                                                    75


learning techniques [46]. Second, it alleviates the overfitting issue especially when the teacher
model structure is surplus given insufficient training data, which we verify in Section 6.5.4.
Furthermore, it also alleviates the permutation-invariant issue in deep neural networks [178]
by inducing nested and ordered domain representations captured in submodels.
Figure 6.2: A self-distilled network is learned via progressively updating columns of param-
eters.
Algorithm 6.1: Progressive Self-Distillation
  1: Inputs: Training dataset D ⊂ X × Y; model with parameter set θ ∈ Θ; pruning ratios
      P, learning rate η, loss function L, constant S ≤ |P|.
  2: repeat
  3:     Sample batch of x, y ∼ D.
  4:     Sample ordered ratios P̂ = [pi |pi ∈ P, pi < pi+1 ∀i < S, pS = 1]Si=1
  5:     θ̄ ← stop_gradient(θ), θ×0 = ∅.
  6:     for pi ∼ P̂ do
  7:         gi ← ∇{θ×pi \θ×pi−1 } J(x; θ×pi ) .(Equation 6.3)
  8:         θ×pi ← θ×pi − η ∗ gi .
  9:     θ ← θ − η ∗ ∇θ L(f (x; θ), y).
10: until training stop
11: Return θ
Case Study on the Effects of Progressive Learning: We illustrate with a prelimi-
nary study that, progressively learning a model enjoys the benefits of (1) finding a good
initialization for the learning domain and (2) alleviating knowledge forgetting.
     In this prototype, we divide a convolution model evenly into two columns and use the
first half θ×0.5 to learn on a small subset of the Mnist image data. Next, we learn on the
Synthetic images with a same data size, by updating θ×1.0 \θ×0.5 while keeping parame-
                                                  76


ters in θ×0.5 fixed. This approach is compared with another variant (overwriting), which
learns Mnist using θ×0.5 then learns Synthetic using the entire model θ×1.0 . As shown in
Table 6.1, where results are averaged over 6 random seeds, the progressively learned ×1.0
model outperforms its counterpart on both domains. Overwriting θ×0.5 , on the other hand,
may disrupt such initialization brought by learning on the Mnist domain. This also leads
to non-negligible forgetting of previously learned representations.
                           Accuracy (%) on Mnist and Synthetic.
                         Domain        Progressive learning Overwriting
                         Mnist             77.95±7.93            38.17±28.19
                         Synthetic         69.48±1.93            42.30±32.30
                        Table 6.1: Progressive vs. overwriting learning.
6.3.3   Proposed Federated Algorithm: FedResCuE
Before introducing our FL paradigm, we refine our algorithm with two more techniques to
further tackle system heterogeneity and connection instability.
    Heterogeneous Model Aggregation: In our FL system, edge users will initially agree
on the maximal network architecture and the legitimate pruning ratios P. Next, edge users
can choose their maximal local model size with a capacity ratio pk ∈ P. During downloading
(uploading) phases, user k with ratio pk will receive (send) at most ×pk of model parameters,
depending on the network connection. During the aggregation phase, active users will only
contribute to global parameters that are within their uploading ratios. Accordingly, in the
next learning round t + 1, ∀ pi ∈ P, the global model parameters are derived as follows:
                                               1 X k,t k,t
                              t+1
                            w×p     t+1
                                  \w×p     =               θ×pi \θ×pi−1 ,                 (6.4)
                                i      i−1
                                             |Atpi |    t
                                                     k∈Api
where Atpi = {k|k ∈ At , I[R(θ k,t ) > pi ]}; At denotes the active users from the last round,
and R(θ k,t ) is the uploaded network size of user k.
                                                77


     Local Model Padding: Due to unstable network connections, the local device is prone
to receiving only partial of the global model during the downloading phase. To compensate
for the missing global parameters, we leverage the local parameters cached from the last
round to pad the initialized model to avoid catastrophic information loss. More concretely,
at learning round t, we initialize the local model for the k-th user as follows:
                                     θ k,t ← w×p t         k,t−1
                                                   d,t ∪ {θ×pk \θ
                                                                  k,t−1
                                                                  ×pd,t
                                                                        },                            (6.5)
                                                   k                 k
where w×p t
            d,t denotes the global parameters downloaded by user, and pk
                                                                                   d,t
                                                                                       is the downloading
            k
ratio, which is possibly smaller than pk .
     Built upon the above techniques, we now present the proposed FedResCuE in Algo-
rithm 6.2. In our algorithm, the workload introduced by progressive learning is lightweight,
since the column-wise gradient update has omitted the need for repeated calculation for small
submodels, as opposed to some prior arts [196]. Moreover, as elaborated in Section 6.5.6.2,
FedResCuE can obtain superior performance with a small sampling frequency (S). Our
approach is also communication efficient, which not only provides flexibility for users to se-
lect affordable model architectures but also requires much fewer communication rounds than
prior work to reach the predefined performance, as verified in Section 6.5.5.
Algorithm 6.2: Resilient and Communication-Efficient Federated Learning
  1: Inputs: Tasks {Dk }K    k=1 ; global model w with parameter space Θ; legit pruning ratios
     P, user capacity ratios {pk }K   k=1 , models initialized as {θ                 k=1 . learning rate η,
                                                                        k,0
                                                                            := w×pk }K
     loss function L, sampling frequency S, epochs T .
  2: for t ∈ [T ] do
  3:     Aggregate global parameters wt via Equation 6.4.
  4:     Broadcast wt to active users At .
  5:     for user k ∈ At in parallel do
  6:        Download wt with downloading ratio pd,t            k  ≤ pk depending on the connection
     quality.
  7:        θ k,t = w×p
                     t            k,t−1
                        d,t ∪ {θ×pk \θ
                                           k,t−1
                                           ×pd,t
                                                 } (. model initialization via Equation 6.5)
                        k                     k
  8:        θ k,t ← Algorithm 6.1 (Dk , θ k,t , P, η, L, S )
  9:        Upload θ k,t , with uploading ratio pu,t     k ≤ pk depending on the connection quality.
                                                       78


6.4     Related Work
Systematic Heterogeneity in FL is a rising challenge induced by the emergence of FL
applications to wireless communications and IoT, where participating devices have vary-
ing capacities in computation and transmission. Some work enables heterogeneous model
architectures by sharing model predictions on a public dataset instead of sharing model pa-
rameters [166, 75], at the cost of non-negligible performance degradation. Some approaches
allow users to share partial network layers, leaving potential opportunities for adopting differ-
ent architectures for unshared layers [205, 8]. In general, yet enough efforts have been made
to effectively tackle FL with heterogeneous model architectures, except for a few pioneers
such as FedHetero [39] and FjORD [69], which are extensively analyzed in Section 6.5.
Network Pruning has long been studied in non-FL scenarios, which aims to prune a
lightweight model from a larger one with the maximal knowledge reserved. Prior approaches
usually require fine-tuning using label supervisions [62, 95, 111, 36], Later there emerge
zero-short pruning [196, 28]. Approaches include structured pruning that reduces model
channels [190, 196, 195, 28]. Orthogonal approaches include early exit [199, 200], which
learns horizontally pruned networks with reduced number of neural layers. Other pruning
strategies follow the lottery ticket hypothesis [48, 140, 108]. Most prior work derives one
submodel after pruning, whereas the slimmable learning [196] and [28] makes arbitrarily
submodels prunable. The idea of prunable models has been applied to FL to tackle system
heterogeneity by FjORD [69]. To the best of our knowledge, we are the first to apply
progressive learning to address both system heterogeneity and connection instability in FL.
Resilient and Communication Efficient FL addresses FL under restricted or unstable
connection bandwidths [143, 57, 69]. FedProx [98] tackles stragglers via a regularized objec-
tive. Other scheduling-based approaches assume that the server has control over the active
users [143, 32, 128]. Not much work has mentioned the faulty connections issues. Some
chooses to naively drop the faulty connected edge devices [32, 143]. [57] and [188] suggest to
use stale model parameters for the disconnected users. Meanwhile, there are complementary
                                                79


efforts that compress transmitted model parameters by quantization [4, 5] or sketching [73].
Some approaches focus on asynchronous communication [33, 187]. Our algorithm can be
potentially combined with related work to further improve resiliency and communication
efficiency in FL.
6.5      Evaluation
In this section, we conduct extensive experiments to answer the following key questions,
leaving more experimental details to the supplementary:
1. Is FedResCuE resilient to system heterogeneity and unstable network connections?
2. Is FedResCuE communication-efficient to reach satisfactory performance with fewer syn-
    chronization rounds, compared with the state-of-the-art?
3. Which components of FedResCuE have contributed to its resiliency and communication
    efficiency?
Results: Experiments below show that FedResCuE notably outperforms related work in
communication efficiency and asymptotic performance. Its superiority is consistent across
different FL settings, and become more prominent under insufficient training data, hetero-
geneous model architectures, and unstable network connections.
6.5.1     Experiment Setup
Dataset: We use CelebA [89] to simulate edge users with i.i.d. data distributions. We
also apply DigitsFive [134] to simulate users with statistical heterogeneity, which is a
multi-domain benchmark with five image datasets: MNIST [93], SVHN [124], USPS [72],
Synthetic, and MNIST-M [53].
Models: We build a ResNet neural network [65] for learning the CelebA domain, and
build a model consisting of 3 Conv2d layers followed by 3 Linear layers for learning
                                            80


DigitsFive domains. To enable effective model aggregation under system heterogeneity,
we perform careful treatments on the BatchNorm layers.
Compared Approaches: In addition to FedAvg [116], we compare FedResCuE against
the following approaches that tackle system heterogeneity: i) FedHetero [39] extended
FedAvg to allow edge devices with different model sizes; ii) FjORD [69] learns prunable
local models without progressive learning; iii) FedSlim is a proposed baseline in this paper,
in which models are locally updated by following the slimmable training [196] and globally
aggregated as FedAvg.
Training: Active users will sync with the server after a complete epoch of local training.
For faulty connection settings (Section 6.5.3), the local padding strategy is applied to all
evaluated algorithms for fair comparisons. For the CelebA domain, training data is i.i.d.
sampled and assigned to 20 total users. For the DigitsFive domains, we assign each domain
data to 2 unique users. For both types of experiments, 5 active users are randomly selected
per communication round. We also evaluate the algorithmic performance given different
sizes of training data in Section 6.5.4.
Evaluation: Unless otherwise specified, the performance is reported using the global model
on all available testing data, which is evaluated every 2 communication rounds. Results are
averaged over 3 random seeds. Asymptotic performance is reported after 300 rounds for
CelebA, and 100 rounds for DigitsFive.
            Global Model Accuracy (%) Evaluated on CelebA, Stable Network Connection.
  Training                      Evaluated
               User Capacity                FedAvg   FedHetero     FjORD     FedSlim   FedResCuE
    Data                         Model
                                  w×1     81.06±0.63      -      80.57±0.91 81.14±0.76 81.39±0.20
            ∀k pk = 1 (uniform)
                                 w×0.25 18.57±0.64        -      69.94±0.65 70.47±0.61 71.19±0.19
    100%
                                  w×1          -     76.80±0.53 75.71±0.47 77.49±0.40  78.22±0.41
           ∀k pk ∼ PC (cluster) w×0.25         -     68.56±0.51 70.98±0.75 73.22±0.34  73.25±0.47
                                  w×1     68.03±0.50      -      67.89±1.47 67.96±0.72 71.27±0.27
            ∀k pk = 1 (uniform)
                                 w×0.25   16.47±2.24      -     61.38±1.69  59.56±1.39  61.12±1.35
    20%
                                  w×1          -     59.38±0.41 62.43±1.65  59.53±0.86 64.53±1.06
           ∀k pk ∼ PC (cluster)
                                 w×0.25        -     55.41±0.39 61.86±1.21  58.31±0.23 61.98±0.85
Table 6.2: We report best performance from different S for applicable approaches (See
Section 6.5.6.2).
                                                 81


6.5.2    Performance Under System Heterogeneity
We apply CelebA data to explore two system settings: 1) the uniform setting, where all
edge devices maintain the same network architecture; and 2) the cluster setting, where user
model capacities are randomly sampled from a set PC ⊂ P to represent system heterogeneity.
Results: As shown in Table 6.2 and Table 6.3, FedResCuE consistently exceeds other ap-
proaches in asymptotic performance, especially under a heterogeneous (cluster ) system. In
particular, the advantage of FedResCuE on the ×0.25 global model demonstrates its bene-
fits to small-capacity users compared to related work, in that FedResCuE helps predictive
knowledge be distilled from larger models into their nested sub-models, which will be even-
tually shared by users with smaller model sizes. In the meantime, FedResCuE is also more
effective than FjORD and FedSlim, which is largely ascribed to the benefit of its progressive
learning, as opposed to a batch-gradient update scheme in prior art.
    Moreover, when system heterogeneity is bundled with connection instability (Table 6.3),
enabling system heterogeneity in FL naturally introduces resiliency to connection instability,
which can be revealed by the performance gain of FedHetero over FedAvg in Table 6.3. In fact,
FedHetero can be treated as macro-level prunable training, although the resiliency brought
by diversified model architectures is less effective than learning self-distilled networks.
         Global Model Accuracy (%)     Evaluated on CelebA Under Connection Loss.
    User     Evaluated
                            FedAvg      FedHetero        FjORD        FedSlim       FedResCuE
  Capacity     Model
                w×1       50.36±2.17          -        61.79±1.62 57.31±1.27        70.02±0.40
  uniform
               w×0.25     12.58±0.51          -        60.20±1.67 55.33±0.89        67.40±0.84
                w×1            -        60.92±1.33 64.52±0.60 62.35±1.76            69.78±0.74
   cluster
               w×0.25          -        59.70±0.64 64.11±0.41 61.77±1.62            68.83±1.00
Table 6.3: Performance under faulty connections, given 100% of training data and 0.1 ≤
er ≤ 0.2.
                                                82


                   Accuracy(%) on DigitsFive, Given      5% Training Data.
                  Domain       SVHN Syn          USPS    MNIST MNIST-M
                    Local       45.88 62.04 89.95         85.84       61.99
                   FedAvg      69.54 80.17 94.38          93.61       75.22
                  FedSlim       66.77 78.95 94.00         93.80       73.95
                   FjORD        49.72 57.12 68.60         68.01       56.29
                FedResCuE 69.32 80.57 95.17               95.05       76.89
Table 6.4: FedResCuE is the most robust algorithm given heterogeneous data and domain-
dependent connection error.
6.5.3    Performance Under Unstable Connections
To simulate the unstable connection scenario, a connection error rate er is generated dynam-
ically to denote the probability that the current column transmission is interrupted. When
transmitting one network, we traverse all columns until one column is interrupted based
on probability er, or when all columns have been successfully transmitted. Note that the
connection loss occurs bidirectionally. Hence an edge device may receive or upload models
with a smaller size than its assigned architecture. In practice, we set the size of a transmis-
sion column to be 0.125× of the global network. We apply both CelebA and DigitsFive
to explore unstable connection scenarios. For experiments on the CelebA domain, er is
dynamically sampled, with er ∼ [0.1, 0.2]. For experiments on DigitsFive domains, er is
set to be a constant depending on the specific domain.
Results: Under faulty network connections, FedResCuE consistently outperforms other ap-
proaches under both i.i.d. and heterogeneous data distributions. Given the CelebA domain
(Table 6.3), FedResCuE achieves higher accuracy than FedSlim and FjORD with a signifi-
cant margin, which we ascribe to both its progressive learning procedure and the proposed
optimization objective. In fact, given unstable connections, a smaller network architecture
turns out to be more reliable, whose transmission is less likely to be interrupted. FedRes-
CuE can facilitate FL in this scenario, as it follows a progressive scheme to gradually learn
the larger network, which captures the complementary representation built upon its nested
smaller submodels. On the other hand, both FedResCuE and FjORD are more competent
than FedSlim, which indicates that solely performing distillation from the teacher (×1.0
                                              83


model) to student (submodel), as FedSlim performs, may be insufficient to deliver reliable
submodels, especially given a model with information staleness caused by transmission loss.
Contrarily, the optimization of FedResCuE (Equation 6.1), which encourages both knowledge
distillation and submodel-learning with label supervision, is a more robust strategy.
    As shown in Table 6.4, under domain-dependent connection errors, FedResCuE also
shows consistent robustness against heterogeneous statistical distributions. Note that a
small training dataset from DigitsFive is applied to ensure the necessity of FL. Hence Local
learning without sharing parameters yields worse performance than FL. The performance
gain of FedResCuE resides in both the small (×0.25 ) and the large (×1.0) model. Contrarily,
FjORD and FedSlim may underperform FedAvg when evaluated using the ×1.0 model,
indicating their potential drawback given insufficient data, which we investigate more in
Section 6.5.4.
6.5.4     Performance Given Insufficient Training Data
To analyze the impacts of data sufficiency, we assign 100% and 20% of the CelebA to users
for training, respectively, assuming a stable network connection (er = 0).
Results: As shown in Table 6.2, when training data is sufficient with i.i.d. distributions,
all algorithms perform comparably well. However, given only 20% of the training data, both
FjORD and FedSlim slightly underperform FedAvg when evaluated using the ×1.0 model,
while FedResCuE remarkably outperforms all others. In fact, the progressive parameter
update in FedResCuE can make a subnetwork a good initialization for the encompassing
larger submodel, which is analogous to meta-learning that delivers an effective model with
fewer shots of training. This is especially beneficial in the lack of training data. Contrarily,
FjORD and FedSlim, which adopt batch-gradient updates for learning prunable models,
may struggle with the interference of noisy gradients, which can be further amplified in a
potentially overfit model.
                                              84


6.5.5   Evaluation of Communication Efficiency
We analyze the communication efficiency via the number of FL synchronization rounds for
the global model to reach a reasonable performance. As shown in Table 6.5, under sys-
tem heterogeneity (i.e. the cluster setting), FedResCuE constantly learns faster to obtain a
predefined accuracy, requiring fewer communication rounds than all its peers. The commu-
nication efficiency of FedResCuE also resides in its flexibility in user model architectures, in
that devices that choose a small model architecture can be further benefited by transmit-
ting fewer parameters per communication round. Although other baselines e.g. FedHetero
also enable system heterogeneity, they require non-negligible more communication rounds to
perform comparably to FedResCuE. Accompanying performance curves of the ×1.0 model
are visualized in Figure 6.3.
                        Communication Efficiency on CelebA dataset.
                        Acc Model FedHeteroFjORD FedSlim FedResCuE
                            Size
                             100 % training data, 0.1 ≤ er ≤ 0.2.
                        60% w×0.5 256.7        218.0 253.3 124.7
                                  20 % training data, er = 0
                        55% w×0.5    180.7     156.0 192.0        96.0
Table 6.5: FedResCuE requires notably fewer communication rounds to reach the predefined
accuracy (Acc).
Figure 6.3: Evaluation curves for the ×1.0 model, with 0.1 ≤ er ≤ 0.2 (left) and 20% training
data (right).
                                              85


6.5.6     Sensitivity Analysis
6.5.6.1     Effects of Knowledge Distillation
To analyze the role of knowledge distillation in our model learning, we design a variant of
our approach called FedSeq to compare against FedResCuE. Particularly, the optimization
objective of FedSeq does not require minimizing the KL-divergence between the teacher θ̄
and a student θ×pi , i.e. it always sets the term αi to 0 in Equation 6.3.
Results: Knowledge distillation is especially beneficial to smaller submodels, whereas the
gap between FedSeq and FedResCuE gradually diminishes when evaluating using larger sub-
models. As illustrated in Figure 6.4, where 20% of the CelebA training data is given,
both approaches are learning comparably well in initial training stages, while FedResCuE
converges to higher asymptotic performance. The learning curves demonstrate that it is ben-
eficial to distill representation knowledge from a large, complete model to smaller submodels,
in that the larger model has more channels to capture refined domain knowledge.
6.5.6.2     Impacts of Submodel Sampling
Sampling frequency, denoted as S = |P̂ | in Algorithm 6.1, is the number of submodels
sampled per batch update. A key question regarding FedResCuE is: how does S affect the
learning performance? This question is equally intriguing to FedSlim and FjORD, both of
which require submodel sampling. To answer this question, we traverse different choices of
S, while the sampling granularity is set to be ×0.05 of the largest model width.
Results: As shown in Figure 6.5, FedResCuE is constantly the most robust under different
S. In the meantime, a moderate number of ratio sampling (e.g. S = 4) benefit most evaluated
algorithms, while oversampling with a large S causes non-negligible performance degradation
on FedSlim and FjORD. Their over-sensitivity to S can be induced by their batch-gradient
update scheme, which may cause the gradients w.r.t. larger submodels to interfere with
those of smaller ones when aggregating all gradients in one batch, hence undermining model
performance. On the contrary, FedResCuE alleviates such issues by using a progressive
                                                86


                          (a) ×0.25 model.             (b) ×0.5 model.
                   Figure 6.4: KD in FedResCuE benefits smaller submodels.
 Figure 6.5: Impacts of sampling frequency S.     Figure 6.6: Effects of progressive learning.
learning scheme that decouples such mutual impacts.
6.5.6.3    Effects of Progressive Learning
To evaluate the efficacy of the progressive learning scheme, we compare FedResCuE against
an intuitive alternative named FedRush, which directly updates the sampled submodel with-
out freezing the preceding parameters.
Results: As shown in Figure 6.6, a rush gradient update as in FedRush leads to notably
undermined performance. Particularly, when updating the θ×pi+1 model, FedRush overwrites
the parameters in θ×pi , which could otherwise serve as a good initialization to support
the subsequent submodels. Contrarily, FedResCuE learns more effectively by avoiding the
potential information forgetting.
                                              87


6.6     Summary
In this chapter, we propose a self-distillation approach which enables resilient knowledge
sharing in heterogeneous federated learning setting. Our proposed, dubbed as FedResCuE,
addresses both system heterogeneity and unstable network connections, by learning self-
distilled networks in a progressive manner, which proves to be communication-efficient with
higher performance in the proposed FL settings compared with the state-of-the-art.
                                              88


                                         CHAPTER 7
                         OVERVIEW AND OPEN QUESTIONS
In this chapter, we first present a systematic comparison of the proposed transfer learning
approaches, discuss their contributions as well as potential limitations. We also provide a
few thoughts on the limitations of our transfer learning and list a few research directions
that are worth more effort in the near future.
    We compare the recipes of each of our TL works in Table 7.1. Although these approaches
are proposed to address heterogeneous machine learning problems, they share the same core
idea of TL from accessible, usually limited or unreliable knowledge source to assist target
domain learning in a sample-efficient manner.
    In spite of their promising results, one limitation of the proposed approaches is that they
are not designed to assist learning for unseen domains. In fact, for the RL approaches
OPOLO and SAIL , we assume that the source and target Markov decision process share the
same underlying components, such as reward logics and state transition probabilities. This
problem setting is analogous to the one where the child learns to walk by imitating adults in
a world that follows the same physical rules. For FedGen which tackles a Federated Learn-
ing setting, each client task is served as the source and the target domain simultaneously.
The data distributions of different clients are heterogeneous but remain stable through the
learning process.
    Another limitation of our work is that our approach might be vulnerable to negative
transfer . By now we assume that the knowledge source is beneficial and relevant to the
target domain learning. Effective schemes to detect irrelevant or even adversarial source
domains remain to be proposed to improve the robustness of our transfer learning approaches.
                                                89


  Algorithm:           OPOLO                            SAIL                      FedGen              FedResCuE
Problem        RL without exter- RL with sparse and Federated Super-                              Federated Super-
setting:       nal rewards.                  delayed rewards.             vised        Learning   vised     Learning,
                                                                          without          extra  with system het-
                                                                          training data on        erogeneity     and
                                                                          the server.             unstable network
                                                                                                  connections.
Difference     |Ms | = 1, Ms =               |Ms | = 1, Ms = Each FL client Tk ∈ Similar to Fed-
between        Mt .                          Mt .                              i=1 is a source
                                                                          {Ts }K                  Gen , with Ti not
source and                                                                and target domain necessarily equal
target do-                                                                at the same time. Tj ∀i, j ∈ [K], i 6=
mains                                                                     Ti 6= Tj ∀i, j ∈ j.
                                                                          [K], i 6= j
Carrier of Examples of ex-                   Examples                  of a conditional gen- Parameter of sub-
transferred    pert visited states           sub-optimal                  erator G : Y → Z model θp learned
knowledge      without               expert  demonstrations               learned from the by self-distillation,
               actions:           RT =       of both states network                      param- with p ≤ 1.
               (s0 , s1 , s2 , · · · ).      and              actions: eters        of     client
                                             RT = (s0 , a0 , s1 , models.
                                             a1 , s2 , a2 , · · · ).
Challenges     No action guid-               Teacher             demon- Local user data 1) FL clients learn
of transfer-   ance is available;            strations                are distribution is het- models with dif-
ring     such  Inefficient sampling          suboptimal , hence erogeneous and ferent architectures.
knowledge      due to on-policy              traditional              ap- unavailable          to 2) Unstable Net-
               learning.                     praoches lead to the server.                         work connection.
                                             suboptimal perfor-
                                             mance.
Solutions      Optimizing                to- An                iterative Learning            and 1) Model chan-
               wards          an        off- process of ex- transmitting                        a nels are vertically
               policy           objective    ploitation              and generator        purely decomposed      and
               that omits the need           exploration: ex- out of the client sequentailly trans-
               of knowing expert             ploiting            teacher model parameters mitted. 2) Learn
               actions.                      dmonstrations to to                   approximate self-distilled models
                                             reach         reasonable p(z|y),         i.e. the that are knowledge
                                             performance; ex- global latent fea- preserving, and ro-
                                             ploring for better ture distribution.                bust to connection
                                             self-generated                                       drop.
                                             demonstrations to
                                             repalce the teacher.
           Table 7.1: An overview comparison of the proposed TL approaches.
                                                              90


                                 Table 7.1 (cont’d)
Distilled Near-optimal state-  Sub-optimal state-   Latent distribution Feature representa-
knowledge transisin distribu-  action distributions p(z|y) of the global tions reserved in
          tions of the teacher of the teacher do-   data.                the 1.0× model.
          domain: µE (s, s0 )  main µT (s, a).
Transfer  Recover      expert- Supass       teacher Ensemble knowl-      Enabled        system
learning  level performance    demonstrations to    edge from hetero-    heterogeneity in FL
results   via efficient off-   reach near-optimal   geneous user data    system.       Learned
          policy learning.     performance with     distribution      to self-distilled mod-
                               high sample effi-    achieve high global  els that are readily
                               ciency.              performance.         prunable and ro-
                                                                         bust to connection
                                                                         loss.
                                          91


APPENDICES
    92


                                            APPENDIX A
                                  APPENDIX FOR OPOLO
For all the following derivations, we use DKL [P (X)||Q(X)] to denote the KL-divergence
between two distributions P and Q:
                                                                       Z
                                                           p(x)                      p(x)
                  DKL [P (X)||Q(X)] = Ex∼p(x) log                 =        p(x) log       dx.
                                                           q(x)         X            q(x)
Accordingly, when P (X|Z) and Q(X|Z) are conditional distributions, DKL [P ||Q] denotes
their conditional KL-divergence:
                                                    Z
                                                                               p(x|z)
                  DKL [P (X|Z)||Q(X|Z)] =                   p(z)p(x|z) log             dxdz.
                                                     Z×X                       q(x|z)
   For simplicity, we will equivalently use Ex∼p(x) [·] and Ep(x) [·] to denote certain expectation
in which x is sampled from the distribution P (X).
A.0.1    Derivation of the Surrogate Objective
We first refer Lemma 1 from [189] for a complete presentation:
Lemma 1.
                     DKL [µπ (s, a, s0 )||µE (s, a, s0 )] = DKL [µπ (s, a)||µE (s, a))].
Proof.
                                                                          µπ (s, a) · P (s0 |s, a) 0
                                           Z
                      0             0
             π             E
       DKL [µ (s, a, s )||µ (s, a, s )] =              µπ (s, a, s0 ) log                          ds dads
                                             S×A×S                        µE (s, a) · P (s0 |s, a)
                                                                          µπ (s, a) 0
                                           Z
                                         =             µπ (s, a, s0 ) log E         ds dads
                                             S×A×S                        µ (s, a)
                                                                    µπ (s, a)
                                           Z
                                         =         µπ (s, a) log E            dads
                                             S×A                   µ (s, a)
                                         = DKL [µπ (s, a)||µE (s, a)].
                                                       93


Lemma 2.
                         DKL [µπ (s, s0 )||µE (s, s0 )] ≤ DKL [µπ (s, a)||µE (s, a)].
Proof. As defined in Table 4.1, µπ (a|s, s0 ) is the inverse-action transition probability induced
by policy π:
        π       0      µπ (s, a, s0 )         Hπ
                                              µH (s)π(a|s)P      (s0 |s, a)            π(a|s)P (s0 |s, a)
       µ (a|s, s ) = π                =            H                            =  R                           .
                        µ (s, s0 )                                                     π(ā|s)P (s0 |s, ā)dā
                                          R   HπH
                                            A
                                               µ (s)π(ā|s)P
                                                   H             (s0 |s, ā)dā      A
Based on this notion, we can derive:
                  DKL [µπ (s, a)||µE (s, a)]
                = DKL [µπ (s, a, s0 )||µE (s, a, s0 )]
                   |               {z                 }
                               Lemma 1
                                                µπ (s, a, s0 ) 0
                   Z
                =            µπ (s, a, s0 ) log E              ds dads
                     S×A×S                      µ (s, a, s0 )
                                                           µπ (s, s0 ) × µπ (a|s, s0 ) 0
                   Z
                =            µπ (s, s0 )µπ (a|s, s0 ) log E                           ds dads
                     S×A×S                                 µ (s, s0 ) × µE (a|s, s0 )
                                                           µπ (s, s0 ) 0
                   Z
                =            µπ (s, s0 )µπ (a|s, s0 ) log E           ds dads
                     S×A×S                                 µ (s, s0 )
                                                              µπ (a|s, s0 ) 0
                      Z
                   +            µπ (s, s0 )µπ (a|s, s0 ) log E              ds dads
                       S×A×S                                  µ (a|s, s0 )
                                              µπ (s, s0 ) 0
                   Z
                =            µπ (s, s0 ) log E           ds ds + DKL [µπ (a|s, s0 )||µE (a|s, s0 )]
                     S×A×S                    µ (s, s0 )
                =DKL [µπ (s, s0 )||µE (s, s0 )] + DKL [µπ (a|s, s0 )||µE (a|s, s0 )]                             (A.1)
                ≥DKL [µπ (s, s0 )||µE (s, s0 )].
   Based on Lemma2, we can derive the upper-bound of our original objective:
Theorem 3 (Surrogate Objective as the Divergence Upper-bound).
                                                             µR (s, s0 )
          DKL [µπ (s, s0 )||µE (s, s0 )] ≤ Eµπ (s,s0 ) [log               ] + DKL [µπ (s, a)||µR (s, a)].
                                                             µE (s, s0 )
                                                          94


Proof.
                                                                   µπ (s, s0 )
                                           Z
                 π      0    E      0
         DKL [µ (s, s )||µ (s, s )] =            µπ (s, s0 ) log    E (s, s0 )
                                                                                 dsds0
                                             S×S                  µ
                                           Z                       µR (s, s0 ) µπ (s, s0 ) 
                                         =       µπ (s, s0 ) log E              0)
                                                                                   × R          0)
                                                                                                   dsds0
                                             S×S                    µ     (s,  s       µ   (s, s
                                                                  µR (s, s0 )
                                           Z
                                         =       µπ (s, s0 ) log E            0
                                                                                 dsds0
                                                                  µ (s, s )
                                           Z S×S
                                                                  µπ (s, s0 )
                                        +        µπ (s, s0 ) log R           0)
                                                                                 dsds0
                                            S×A                   µ    (s,  s
                                                            µR (s, s0 )
                                         = Eµπ (s,s0 ) [log E           ] + DKL [µπ (s, s0 )||µR (s, s0 )]
                                                            µ (s, s0 )
                                                            µR (s, s0 )
                                         ≤ Eµπ (s,s0 ) [log E       0
                                                                        ] + DKL [µπ (s, a)||µR (s, a)].
                                                            µ (s, s )        |              {z           }
                                                                                  derived from Lemma 2
A.0.2   Connections between LfO and LfD
Theorem 4.
       DKL [µπ (a|s, s0 )||µE (a|s, s0 )] = DKL [µπ (s, a)||µE (s, a)] − DKL [µπ (s, s0 )||µE (s, s0 )].
Proof. We can refer Eq (A.1) from the proof of Lemma 2:
       DKL [µπ (s, a)||µE (s, a)] = DKL [µπ (s, s0 )||µE (s, s0 )] + DKL [µπ (a|s, s0 )||µE (a|s, s0 )].
A.0.3   An Unoptimizable Gap Between LfO and LfD
Remark 3: In a non-injective MDP, the discrepancy of DKL [µπ (a|s, s0 )||µE (a|s, s0 )] cannot
be optimized without knowing expert actions.
Proof. We provide proof with a counter-example. Consider a non-injective MDP in a tabular
case, whose transition dynamics is shown in Table A.1, with |S| = 3, and |A| = 4. Especially,
                                                         95


                                            P           a0   a1   a2   a3
                                        P (s1 |s1 , ·)  0    1     0    0
                                        P (s2 |s1 , ·)  1    0     1    0
                                        P (s3 |s1 , ·)  0    0     0    1
                                        P (s1 |s2 , ·)  0    1     0    0
                                        P (s2 |s2 , ·)  0    0     1    0
                                        P (s3 |s2 , ·)  0    0     0    1
                                        P (s1 |s3 , ·)  0    1     0    0
                                        P (s2 |s3 , ·)  0    0     1    0
                                        P (s3 |s3 , ·)  0    0     0    1
                         Table A.1: A deterministic but non-injective MDP.
                        π    s1 s2 s3                                 πE s 1 s 2 s 3
                        a0 0.5 0 0                                    a0 0 0 0
                        a1    0      0 1                              a1 0 0 1
                        a2 0.5 0 0                                    a2 1 0 0
                        a3    0      1 0                              a3 0 1 0
                 Table A.2: Learning Policy π.                Table A.3: Expert Policy πE .
there exists two actions which lead to the same deterministic transition, i.e. for s1 , s2 ∈ S,
∃ a0 , a2 ∈ A, s.t. P (s2 |s1 , a2 ) = P (s2 |s1 , a0 ) = 1, as illustrated in Figure A.1.
     In this MDP, there is an expert policy πE as listed in Table A.3. Trajectories generated
by this expert are illustrated as blue lines in Figure A.1. In a LfO scenario, a learning agent
only has access to sequences of states visited by the expert: RT = {s1 , s2 , s3 , s1 , s2 , s3 , · · · },
without knowing what actions have been taken by the expert.
     Based on the given observations RT , a policy π can only satisfy the state distribution
matching with DKL [µπ (s, s0 )||µE (s, s0 )] = 0, but unable to optimize DKL [µπ (a|s, s)||µE (a|s, s)],
as both a0 and a2 lead to a deterministic transition of s1 → s2 . In lack of expert actions, the
best guess for a learning policy is to equally distribute action probabilities with π(a0 |s1 ) =
(a2 |s1 ) = 0.5. which results in µπ (a0 |s1 , s2 ) = µπ (a2 |s1 , s2 ) = 0.5, whereas µE (a2 |s1 , s2 ) = 1,
µE (a0 |s0 , s1 ) = 0. Consequently, we reach at DKL [µπ (a|s, s0 )||µE (a|s, s0 )] > 0.
     Remark:          In a deterministic and injective MDP, it satisfies that ∀ π : S → A,
DKL [µπ (a|s, s0 )||µE (a|s, s0 )] = 0.
                                                        96


                                                                 a1
                                                            s1
                                         a3                            a0
                                               a1                   a2
                                                                 a1
                            a3                              a3                     a2
                                  s3                                          s2
                                                            a2
                          Figure A.1: Transition of an non-injective MDP.
    We provide proof in a finite, discrete state-action space, although the conclusion is valid
to extend to continuous cases.
Proof. In a deterministic and injective MDP, we can interpret the transition dynamics with
a deterministic function g:
    ∃g : S × A → S, s.t. ∀ (s, a, s0 ), g(s, a) = s0 ⇐⇒ P (s0 |s, a) = 1, and g(s, a) 6= s0 ⇐⇒
P (s0 |s, a) = 0.
    since this MDP is also injective, given arbitrary policy π and a transition s → s0 , (s, s0 ) ∼
µπ (s, s0 ), there exists one and only action a which satisfies g(s, a) = s0 , P (s0 |s, a) = 1.
    Accordingly, µπ (a|s, s0 ) =     π(a|s)P (s0 |s,a)
                                   Eā∼π(·|s) [P (s0 |s,a)]
                                                            = 1[g(s, a) = s0 ] depends only on the transition
dynamics, where 1(x) is an indicator function. The same conclusion applies to µE (a|s, s0 ) as
well. Therefore, we reach at:
                             ∀ π : S → A, DKL [µπ (a|s, s0 )||µE (a|s, s0 )]
                                                                   1[g(s, a) = s0 ] 
                                               =Eµπ (s,a,s0 ) log
                                                                    1[g(s, a) = s0 ]
                                                                   1
                                               =Eµπ (s,a,s0 ) log      = 0.
                                                                    1
                                                            97


A.0.4      Upper Bound of the KL-Divergence
Theorem 5. For two arbitrary distributions P and Q, and an f -divergence with f (x) = 12 x2 ,
it satisfies that DKL [P ||Q] ≤ Df [P ||Q] .
Proof. Given two distributions P and Q, their density ratio is denoted as wp|q , with wp|q =
p(x)
q(x)
      ≥ 0. If we consider a function g(w) = w log(w) − 12 w2 , g(w) is constantly decreasing
when w ∈ (0, ∞), as      ∂g
                         ∂w
                            = log w + 1 − w ≤ 0 ∀w ≥ 0.
     Since KL-Divergence is a special case of f -divergence with fDKL (x) = x log x, it is sufficient
to show that:
                                               Z                              1        2
                                                                                          
                DDKL [P ||Q] − Df [P ||Q] =        q(x) wp/q log(wp/q ) − (wp/q ) dx
                                                                               2
                                               ZX
                                                                                  1
                                           ≤       q(x) sup (w log(w) − w2 )dx
                                                 X       w∈(0,+∞)                 2
                                               Z
                                                                              1
                                           =       q(x) lim+ (w log(w) − w2 )dx
                                                X        w→0                  2
                                           = 0.
A.0.5      Forward Distribution Matching
A.0.5.1      Lower-bound of the BC Objective
Theorem 6.
          DKL [πE (a|s)||π(a|s)] = DKL [µE (s0 |s)||µπ (s0 |s)] + DKL [µE (a|s, s0 )||µπ (a|s, s0 )].
Proof. Based on the definition of µπ (a|s, s0 ) in Table 4.1:
                                         π(a|s)P (s0 |s, a)        π(a|s)P (s0 |s, a)
                      µπ (a|s, s0 ) = R                          =                    ,               (A.2)
                                       A
                                         π(ā|s)P (s0 |s, ā)dā       µπ (s0 |s)
                                                    98


and similar for µE (a|s, s0 ), we can derive at the following:
       DKL [πE (a|s)||π(a|s)]
       Z
                                   πE (a|s)
     =        µE (s)πE (a|s) log             dads
         S×A                        π(a|s)
       Z
                             πE (a|s)
     =        µE (s, a) log           dads
         S×A                  π(a|s)
                                           πE (a|s)P (s0 |s, a) 0
       Z
     =          µE (s, a)P (s0 |s, a) log                         ds dads
         S×A×S                              π(a|s)P (s0 |s, a)
                                   πE (a|s)P (s0 |s, a) 0
       Z
     =          µE (s, a, s0 ) log                       ds dads
         S×A×S                      π(a|s)P (s0 |s, a)
                                   µE (a|s, s0 )µE (s0 |s) 0
       Z
     =          µE (s, a, s0 ) log π                       ds dads
         S×A×S                     µ (a|s, s0 )µπ (s0 |s)
                               |           {z            }
                                         Eq (A.2)
                                     µE (a|s, s0 )         µE (s0 |s)  0
       Z                      
     =          µE (s, a, s0 ) log π               + log               ds dads
         S×A×S                       µ (a|s, s0 )          µπ (s0 |s)
                                   µE (a|s, s0 ) 0                                           µE (s0 |s) 0
       Z                                                     Z
                  E         0                                             E          0
     =          µ (s, a, s ) log π              ds  dads   +            µ   (s, a, s   ) log            ds dads
         S×A×S                     µ (a|s, s0 )                S×A×S                         µπ (s0 |s)
     =DKL [µE (a|s, s0 )||µπ (a|s, s0 )] + DKL [µE (s0 |s)||µπ (s0 |s)].
A.0.5.2      Policy Regularization as A Forward Distribution Matching
Without loss of generality, in this section we provide proof based on a finite, discrete state-
action space.
Assumption 2 (Deterministic MDP). ∃g : S×A → S a deterministic function, s.t. ∀ (s, a, s0 ),
g(s, a) 6= s0 ⇐⇒ P (s0 |s, a) = 0, and g(s, a) = s0 ⇐⇒ P (s0 |s, a) = 1.
    Based on Assumption 2, we have the following:
Corollary 2. In a deterministic MDP, ∀ π : S → A, µπ (a|s, s0 ) > 0 =⇒ P (a|s, s0 ) = 1.
Proof. µπ (a|s, s0 ) ∝ π(a|s)P (s0 |s, a) > 0 =⇒ P (s0 |s, a) > 0. Based on Assumption 2, it holds
that g(s, a) = s0 , therefore P (s0 |s, a) = 1.
                                                        99


Assumption 3 (Support Coverage). The support of expert transition distribution µE (s, s0 )
is covered by µR (s, s0 ):
                                     µE (s, s0 ) > 0 =⇒ µR (s, s0 ) > 0.
    Combing Corollary 2 and Assumption 3, we can reach at the following:
Corollary 3. ∀(s, s0 ) ∼ µE (s, s0 ), µR (a|s, s0 ) > 0 =⇒ P (a|s, s0 ) = 1.
Lemma 3. Given a policy π̂, s.t. ∀(s, s0 ) ∼ µE (s, s0 ), π̂(a|s) ∝ µR (a|s, s0 ), then it satisfies
that:
                 ∀π : S → A, DKL [µE (s0 |s)||µπ (s0 |s)] ≥ DKL [µE (s0 |s)||µπ̂ (s0 |s)].
Proof. In a discrete state-action space, µπ (s0 |s) can be denoted as µπ (s0 |s) = Ea∼π(·|s) [P (s0 |s, a)],
and the similar for µπ̂ (s0 |s):
              DKL [µE (s0 |s)||µπ̂ (s0 |s)] − DKL [µE (s0 |s)||µπ (s0 |s)]
                              µE (s0 |s)         µE (s0 |s)
                                                            
            =EµE (s,s0 ) log π̂ 0         − log π 0
                              µ (s |s)           µ (s |s)
            =EµE (s,s0 ) log µπ (s0 |s)] − log µπ̂ (s0 |s)
                                                          
            =EµE (s,s0 ) log Ea∼π(·|s) [P (s0 |s, a)] − EµE (s,s0 ) log Ea∼π̂(·|s) [P (s0 |s, a)]
                                                                                                
            =EµE (s,s0 ) log Ea∼π(·|s) [P (s0 |s, a)] − EµE (s,s0 ) log Ea∼µR (·|s,s0 ) [P (s0 |s, a)]
                                                                                                   
            =EµE (s,s0 ) log Ea∼π(·|s) [P (s0 |s, a)] − EµE (s,s0 ) log Ea∼µR (·|s,s0 ) [1]
                                                                                         
                                                                   |          {z           }
                                                                           Corollary 3
            =EµE (s,s0 ) log Ea∼π(·|s) [P (s0 |s, a)]
                                                    
                                           
            ≤EµE (s,s0 ) log Ea∼π(·|s) [1]
            =0.
    Remark 4.        In a deterministic MDP, assuming the support of µE (s, s0 ) is covered by
µR (s, s), s.t. µE (s, s0 ) > 0 =⇒ µR (s, s0 ) > 0, then regulating policy using µR (·|s, s0 ) can
                                                       100


minimize DKL [µE (s0 |s)||µπ (s0 |s)]:
∃π̃ : S → A, s.t. ∀(s, s0 ) ∼ µE (s, s0 ), π̃(·|s) ∝ µR (·|s, s0 ) =⇒ π̃ = arg min DKL [µE (s0 |s)||µπ (s0 |s)].
                                                                                             π
Proof. Based on Lemma 3, we have that:
                 ∀π : S → A, DKL [µE (s0 |s)||µπ (s0 |s)] ≥ DKL [µE (s0 |s)||µπ̃ (s0 |s)].
Therefore, π̃ = arg minπ DKL [µE (s0 |s)||µπ (s0 |s)].
A.0.5.3      Estimating the Inverse Action Distribution
Theorem 7.
           max   −DKL [µR (a|s, s0 )||PI (a|s, s0 )] ≡          max      E(s,a,s0 )∼µR (s,a,s0 ) [log PI (a|s, s0 )].
       PI :S×S→A                                            PI :S×S→A
Proof.
             − DKL [µR (a|s, s0 )||PI (a|s, s0 )]
                                                           µR (a|s, s0 )
                  Z
             =−               µR (s, s0 )µR (a|s, s0 ) log            0)
                                                                         dadsds0
                                                           P I (a|s, s
                  ZS×S×A                                                                          
             =−               µR (s, s0 )µR (a|s, s0 ) log µR (a|s, s0 ) − log PI (a|s, s0 ) dadsds0
                    S×S×A
                                     Z
                               0
                     R
             = H[µ (a|s, s )] +                µR (s, s0 )µR (a|s, s0 ) log PI (a|s, s0 )dadsds0
               |       {z        }     S×S×A
                 fixed w.r.t. PI
             = H[µR (a|s, s0 )] +EµR (s,a,s0 ) [log PI (a|s, s0 )].
               |       {z        }
                 fixed w.r.t. PI
    Note that we use H[µR (a|s, s0 )] to denote the conditional entropy of µR (a|s, s0 ), with
H[µR (a|s, s0 )] = EµR (s,a,s0 ) [− log µR (a|s, s0 )].
                                                         101


A.0.6        Derivation of Eq (4.8):
Jopolo (π, Q) = E(s,a,s0 )∼µπ (s,a,s0 ) [r(s, s0 ) − (B π Q − Q)(s, a)] + E(s,a)∼µR (s,a) [f∗ ((B π Q − Q)(s, a))],
                                                  h                         i                              E (s,s0 )
where B π Q(s, a) = Es0 ∼P (·|s,a),a0 ∼π(·|s0 ) r(s, s0 ) + γQ(s0 , a0 ) , and r(s, s0 ) = log µµR (s,s          0) .
Proof. The first term in the RHS of the above equation can be reduced to the following:
  E(s,a,s0 )∼µπ (s,a,s0 ) [r(s, s0 ) − (B π Q − Q)(s, a)]
                   h                                                 i
=E(s,a)∼µπ (s,a) Es0 ∼P (·|s,a) r(s, s0 ) − (B π Q − Q)(s, a)
                                     
                   h                                                                      i
=E(s,a)∼µπ (s,a) Es0 ∼P (·|s,a) [r(s, s0 )] + Q(s, a) − Es0 ∼P (·|s,a) [B π Q(s, a)]
                   h                                                                                                 i
                                     XXX 0                                              XXX0                0 0
=E(s,a)∼µπ (s,a) Es0 ∼P (·|s,a) [r(s,     sX)]
                                            X  + Q(s, a) −  E  0            0      0
                                                              s ∼P (·|s,a),a ∼π(·|s ) [ r(s, sX) +  γQ(s     , a )]
                   h                                                  i
=E(s,a)∼µπ (s,a) Q(s, a) − γEs0 ∼P (·|s,a),a0 ∼π(·|s0 ) [Q(s0 , a0 )]
             X ∞                                              X ∞
= (1 − γ)          γ t Es∼µπt (s),a∼π(s) [Q(s, a)] − (1 − γ)        γ t+1 Es∼µπt ,a∼π(·|s),s0 ∼P (·|s,a),a0 ∼π(·|s0 ) [Q(s0 , a0 )]]
              t=0                                              t=0
  |                 {z                  }
               see Table 4.1
             X∞                                            X∞
=(1 − γ)            t
                  γ Es∼µπt ,a∼π(s) [Q(s, a)] − (1 − γ)           γ t+1 Es∼µπt+1 ,a∼π(·|s) [Q(s, a)]]
             t=0                                           t=0
=(1 − γ)Es∼p0 ,a0 ∼π(·|s0 ) [Q(s0 , a0 )].
Therefore:
        Jopolo (π, Q) = (1 − γ)Es∼p0 ,a0 ∼π(·|s0 ) [Q(s0 , a0 )] + E(s,a)∼µR [f∗ ((B π Q − Q)(s, a))].
A.0.7       Implementation Details of OPOLO
A.0.7.1        Practical Considerations for Algorithm Implementation
We provide some practical considerations to effectively implement our algorithm:
                                                          102


      Initial state sampling: To increase the diversity of initial samples, we use state samples
from an off-policy buffer and treat them as virtual initial states. A similar strategy is adopted
by [88].
      Constant shift on synthetic rewards: In practice, we adopt the same strategy of prior
art [189] to use r(s, s0 ) = − log(1−D(s, s0 )), instead of log(D)−log(1−D) as the discriminator
                                                                                           µE (s,s0 )
output. A fully optimized discriminator D∗ satisfies − log(1 − D∗ (s, s0 )) = log(1 +      µR (s,s0 )
                                                                                                      ),
                                               µE (s,s0 )
which corresponds to a constant shift on       µR (s,s0 )
                                                          before the log term.
      Q and π network update: We follow the advice of AlgeaDICE [122] by using a
target Q network and policy gradient clipping. Especially, when taking the gradients of
Jopolo (π, Q, α) w.r.t.Q, we use the value from a target Q network to calculate B π Q(s, a) in
order to stabilize training; on the other hand, since an optimal x∗ (s, a) = (B π Q∗ −Q∗ )(s, a) =
 µπ (s,a)
µR (s,a)
          represents a density ratio and should always be non-negative, we clip (B π Q − Q)(s, a)
to above 0 when taking gradients w.r.t.π.
A.0.7.2       Hyper-parameters
Table A.4 lists the hyper-parameters for GAIL [68], GAIfO [171], BCO [169], DAC [87], and
our proposed approach OPOLO. Specifically, for off-policy approaches, each self-generated
interaction will be stored the replay buffer in a FIFO manner, and update frequency is the
number of interactions sampled from the MDP after which the module is updated. Moreover,
considering the different scales for the gradients of J(πθ , Qφ ) and JReg (πθ ) in Algorithm 4.1,
we apply a coefficient λ for OPOLO to adjust the regularization strength when calculating
the total policy loss:
                                                                          
                              θ ← θ + α JOθ (πθ , Qφ ) + λJOθ JReg (πθ ) .
A.0.8       Challenges of DICE without Expert Actions
In this section, we analyze the principle of offline imitation learning using DICE [121, 201,
122] and the reason that impedes its direct application to an LfO setting.
                                                  103


            Hyper-parameters                                                            Value
            Shared Parameters for Off-Policy Approaches
               Buffer size                                                              107
               Batch size                                                               100
               Learning rate                                                            3e−4
               Discount factor γ                                                        0.99
               Network architecture                                                     MLP [400, 300]
               Q, π update frequency / gradient steps                                   103 /103
               D update frequency / gradient steps                                      500/10
            Shared Parameters for On-Policy Approaches
               Batch size                                                               2048
               mini-Batch size                                                          256
               Learning rate                                                            3e−4
               Discount factor γ                                                        0.99
               Network architecture                                                     MLP [400, 300]
            BCO
               PI pre-train gradient steps                                              104
               PI update frequency / gradient steps                                     103 /100
            DAC
               Number of extra absorbing states                                         1
            OPOLO
               PI update frequency / gradient steps                                     500/50
               PI regularization coefficient λ                                          0.1
                      Table A.4: Hyper-parameters for Different Algorithms.
    In a LfO setting where expert actions are unavailable, the learning objective is to minimize
the discrepancy of state-only distributions induced by the agent and the expert. Without
loss of generality, we consider an arbitrary f-divergence Df as the discrepancy measure:
                       max −Df [µπ (s, s0 )||µE (s, s0 )]
                         π
                     = max min Eµπ (s,s0 ) [−x(s, s0 )] + EµE (s,s0 ) [f ∗ (x(s, s0 ))],                          (A.3)
                         π  x:S×S→R
    in which f ∗ (x) is the conjugate of f (x) for the f -divergence. To remove the on-policy
dependence of µπ (s, s0 ), we follow the rationale of DICE and use a similar change-of-variable
trick mentioned in Sec 4.3.2 to learn a value function v(s, s0 ):
         v(s, s0 ) := −x(s, s0 ) + γEa0 ∼π(.|s0 ),s00 ∼P (.|s0 ,a0 ) [v(s0 , s00 )] = −x(s, s0 ) + B π v(s, s0 ).
                                                        104


This value function is a fixed point solution to an variant Bellman operator B π , which,
however, is problematic in a model-free setting. To see this, we substitute x(s, s0 ) by (B π v −
v)(s, s0 ) to transform Eq (A.3) into the following:
         max min Eµπ (s,s0 ) [−x(s, s0 )] + EµE (s,s0 ) [f ∗ (x(s, s0 ))]
           π  x:S×S→R
     = max min (1 − γ) Es0 ∼p0 ,s1 ∼P (·|s0 ,π(s0 )) [v(s0 , s1 )] + EµE (s,s0 ) [f ∗ ((B π v − v)(s, s0 ))] .
           π v:S×S→R        |              {z                }                |             {z             }
                                        term 1                                            term 2
    where B π v(s, s0 ) = γEa0 ∼π(.|s0 ),s00 ∼P (.|s0 ,a0 ) [v(s0 , s00 )]. Optimizing this objective is trouble-
some, in that the B π v(s, s0 ) in term 2 requires knowledge of P (·|s, π(s)), ∀s ∼ µE (s). In
another word, for any state sampled from the expert distribution, we need to know what
would be the next state if following policy π from this state. A similar issue is echoed in
term 1, where s1 is sampled from P (·|s0 , π(s0 )). Consequently, directly applying DICE loses
its advantage in a LfO setting, as it incurs a dependence on a forward transition model,
which is costly to estimate and may counteract the efficiency brought by off-policy learning.
                                                             105


                                          APPENDIX B
                               APPENDIX FOR FEDGEN
B.0.1     Notations and Preliminaries
Let X ⊂ Rp be the input space, Z ⊂ Rd be the latent feature space, and Y ⊂ R be the output
space. R : X → Z denotes a representation function that maps inputs into features. T
denotes a domain (or task ), which consists of a data distribution D over X and a ground-
truth labeling function c∗ : X → Y. Given a domain T := hD, c∗ i and a representation
function R, we use D̃ to denote the induced image of D under R [17], s.t. given a probability
event B,
                                  Ez∼D̃ [B(z)] = Ex∼D [B(R(x))].
Accordingly, c̃∗ denotes the induced labeling function under R:
                                c̃∗ (z) := Ex∼D [c∗ (x)|R(x) = z] .
Let h : Z → Y denote a hypothesis that maps features to predicted labels, and H ⊆ {h :
Z → Y} denote a hypothesis class. For our analysis, we assume the FL tasks are for binary
classification, i.e. Y = {0, 1}, and the loss function is 0-1 bounded, with l(ŷ, y) = |ŷ − y|.
Same assumptions have been adopted by various prior art [17, 18, 16, 103, 17].
    Given two distributions D and D0 , dH (D, D0 ) is defined as the H-divergence between D
and D0 , i.e.:
                          dH (D, D0 ) := 2 sup |PrD (A) − PrD0 (A)|},
                                            A∈AH
where AH is a set of measurable subsets under D and D0 for certain h ∈ H. Moreover, H4H
is defined as the symmetric difference hypothesis space [18], i.e.:
                               H4H := {h(z) ⊕ h0 (z), h, h0 ∈ H}
                                                106


where ⊕ denotes the XOR operator, so that h(z) ⊕ h0 (z) indicates that h and h0 disagrees
with each other. Accordingly, AH4H is a set of measurable subsets for ∀ h(z)⊕h0 (z) ∈ H4H.
Then dH4H (·, ·) is defined as the distribution divergence induced by the symmetric difference
hypothesis space [18]:
                         dH4H (D, D0 ) := 2       sup   |PrD (A) − PrD0 (A)|}.
                                               A∈AH4H
    Specifically, let D, D0 be two arbitrary distributions on the input space X , and let D̃, D̃0
be their induced images over R. Then based on the definition of dH4H (·, ·), one can have:
            dH4H (D̃, D̃0 ) = 2     sup    |Ex∼D [Pr(A(R(x)))] − Ex∼D0 [Pr(A(R(x)))]|
                                 A∈AH4H
                            =2      sup    |Ez∼D̃ [Pr(A(x))] − Ez∼D̃0 [Pr(A(z))]|
                                 A∈AH4H
                            =2      sup    |PrD̃ (A) − PrD̃0 (A)|}.
                                 A∈AH4H
B.0.2    Derivations of Remark 5
Remark. Let p(y) be the prior distribution of labels, and r(z|y) : Y → Z be the conditional
distribution derived from generator Gw . Then regulating a user model θk using samples from
r(z|y) can minimize the conditional KL-divergence between two distributions, derived from
the user and from the generator, respectively:
               max Ey∼p(y),z∼r(z|y) [log p(y|z; θk )] ≡ min DKL [r(z|y)kp(z|y; θk )],
                 θk                                         θk
Proof. Expanding the KL-divergence, we have
                                                                                 
                                                                           r(z|y)
                ∵ DKL [r(z|y)kp(z|y; θk )] ≡ Ey∼p(y) Ez∼r(z|y) log
                                                                         p(z|y; θ)
                     = Ey∼p(y) Ez∼r(z|y) [log r(z|y)] − Ey∼p(y) Ez∼r(z|y) [log p(z|y; θ)]
                     = − H(r(z|y)) − Ey∼p(y) Ez∼r(z|y) [log p(z|y; θ)].
                         constant w.r.t θk
                                                    107


where H(r(z|y)) is constant w.r.t θk . Therefore when optimizing θk we have:
                      min DKL [r(z|y)kp(z|y; θk )]
                       θk
                   ≡ min − Ey∼p(y),z∼r(z|y) [log p(z|y; θk )]
                       θk
                                                 p(y|z; θk )p(z)
                   ≡ max Ey∼p(y) Ez∼r(z|y) [log                   ]
                       θk                               p(y)
                   ≡ max Ey∼p(y) Ez∼r(z|y) [log p(y|z; θk ) + log p(z) − log p(y)]
                       θk
                   ≡ max Ey∼p(y) Ez∼r(z|y) [log p(y|z; θk )].
                       θk
where H(r(z|y)) denotes the entropy of the probability distribution r(z|y) which is not
optimizable w.r.t θk , and p(z|y; θk ) := p(y|z;θk )p(z)
                                               p(y)
                                                         is defined as the probability that the input
representation to the predictor is z if it yields a label y.
B.0.3   Derivations of Theorem 2
Before deriving Theorem 1, we first present an upper-bound for the generalization perfor-
mance from prior art [17], which analyzes the role of a feature representation function in the
context of domain adaptation:
Lemma 4. Generalization Bounds for Domain Adaptation [17, 18]:
   Let TS and TT be the source and target domains, whose data distributions are DS and DT .
Let R : X → Z be a feature representation function, and D̃S , D̃T be the induced images of
DS and DT over R, respectively. Let H be a set of hypotheses with VC-dimension d. Then
with probability at least 1 − δ, ∀ h ∈ H:
                                  s                             
                                      4          2em           4
            LTT (h) ≤ L̂TS (h) +          d log          + log      + dH4H (D̃S , D̃T ) + λ,    (B.1)
                                     m             d           δ
   where e is the base of the natural logarithm, L̂TS (h) is the empirical risk of the source
domain given m observable samples, and λ = minh∈H (LTT (h) + LTS (h)) is the optimal risk
on the two domains.
                                                 108


    One insight from Lemma 4 is that a good representation function plays a tradeoff be-
tween minimizing the empirical risk (L̂TS (h)) and the induced distributional discrepancy (
dH4H (D̃S , D̃T )). Based on Lemma 4, one can establish Theorem 1 as the following:
Theorem. (Generalization Bounds for FL) Consider an FL system with K users. Let
Tk = hDk , c∗ i and T = hD, c∗ i be the k-th local domain and the global domain, respectively.
Let R : X → Z be a feature extraction function that is simultaneously shared among users.
Let hk denote the hypothesis learned on domain Tk , and h = K1 K
                                                                              P
                                                                                 k=1 hk be the global ensemble
of user predictors. Then with probability at least 1 − δ:                   s                               
              1 X                    1 X                                        4          2em           4K
   LT (h) ≤            L̂Tk (hk ) +            (dH4H (D̃k , D̃) + λk ) +             d log        + log        ,
              K                      K                                          m            d            δ
                 k∈[K]                  k∈[K]
where L̂Tk (hk ) is the empirical risk of hk , λk := minh (LTk (h) + LT (h)) denotes an oracle
performance on Tk and T , and D̃k and D̃ is the induced image of Dk and D from R,
respectively, s.t. Ez∼D̃k [B(z)] = Ex∼Dk [B(R(x))] given a probability event B, and so for D̃.
Proof. By treating each one of the local domains k ∈ [K] as the source and the global domain
as the target, one can have that, ∀ δ > 0, with probability 1 − Kδ :
                                                                 s                                
                                                                     4           2em           4K
            LT (hk ) ≤ L̂Tk (hk ) + dH4H (D̃k , D̃) + λk +                 d log        + log        .
                                                                     m             d            δ
Also, due to the convexity of risk function and Jesen inequality, one can have:
                                                            
                                                1   X              1 X
                            LT (h) ≡ LT                  hk  ≤             LT (hk ).
                                                K                 K
                                                   k∈[K]              k∈[K]
Therefore,
                                                                         s                               
                                                                                                          
                  1  X
                          L̂Tk (hk ) +
                                           X                                   4         2em          4K 
  Pr LT (h) >                                  (dH4H (D̃k , D̃) + λk ) +          d log       + log
                 K                                                             m           d           δ
                    k∈[K]                 k∈[K]
                                                                                   s                            
      1 X                 1 X                      X                                  4          2em         4K
≤Pr[          LT (hk ) >           (L̂Tk (hk ) +        (dH4H (D̃k , D̃) + λk ) +          d log       + log      )]
     K                    K                                                           m            d          δ
        k∈[K]                k∈[K]                k∈[K]
                                                              s                               
                                                                                              
        _                                                         4           2em         4K 
≤Pr        LT (hk ) > L̂Tk (hk ) + dH4H (D̃k , D̃) + λk +             d log       + log
                                                                 m             d           δ
      k∈[K]
   X δ
≤          = δ.
        K
  k∈[K]
                                                       109


     Theorem 1 shows that the performance of the aggregated hypothesis is upper-bounded
 by: 1) the local performance of each user hypothesis (L̂Tk (hk )), 2) the dissimilarity between
 the global and local distributions over the feature space (dH4H (D̃k , D̃)), 3) the oracle per-
 formance (λk ), and 4) the numerical constraints regarding the number of empirical samples
 m and the VC-dimension d.
 B.0.4    Derivations of Corollary 1
 Corollary. Let T , Tk , R defined as in Theorem 1. DA denotes an augmented data distri-
 bution, and Dk0 = 12 (Dk + DA ) is a mixture of distributions. Accordingly, D̃A and D̃k0 denote
 the induced image of DA and Dk0 over R, respectively. Let D̂k0 = D̂k ∪ D̂A be an empirical
 dataset of Dk0 , with |D̂k |=m, |D̂k0 | = |D̂k | + |D̂A | = m0 . Assume the discrepancy between D̃A
 and D̃ is bounded, s.t ∃  > 0, dH4H (D̃A , D̃) ≤ , then with probability 1 − δ:
                                                                       s 
                                                                                    2em0
                                                                                                  
          1 X                  1 X                0         1 X
                                                                   0      4                    4K
LT (h) ≤          LTk0 (hk ) +        (dH4H (D̃k , D̃)) +         λ +         d log      + log      ,
          K k                  K k                          K k k        m0           d         δ
                                                                                                (B.2)
 where Tk0 = {Dk0 , c∗ } is the updated local domain, λ0k = minh (LTk0 (h) + LT (h)) denotes the
 oracle performance, and dH4H (D̃k0 , D̃) ≤ dH4H (D̃k , D̃) when  is small.
 Proof. Equation B.2 can be directly derived by Theorem 1. We now focus on analyzing the
 relation between dH4H (D̃k , D̃) and dH4H (D̃k0 , D̃), which is the data dissimilarity before and
 after data augmentation using samples from distribution DA , respectively.
                                                     110


     Based on the definition of dH4H (·, ·), one can derive that:
              dH4H (D̃k0 , D̃)
            =2     sup     Ez∼D̃0 [Pr(A(z))] − Ez∼D̃ [Pr(A(z))]
                                 k
                A∈AH4H
            =2     sup     Ez∼ 1 (D̃k +D̃A ) [Pr(A(z))] − Ez∼D̃ [Pr(A(z))]
                               2
                A∈AH4H
                           1                          1
            =2     sup       Ez∼D̃K [Pr(A(z))] + Ez∼D̃A [Pr(A(z))] − Ez∼D̃ [Pr(A(z))]
                A∈AH4H     2                          2
            ≤    sup     Ez∼D̃K [Pr(A(z))] − Ez∼D̃ [Pr(A(z))]
              A∈AH4H
              +      sup    Ez∼D̃A [Pr(A(z))] − Ez∼D̃ [Pr(A(z))]
                 A∈AH4H
              1                      1
            = dH4H (D̃k , D̃) + dH4H (D̃A , D̃).
              2                      2
It is clear that 21 dH4H (D̃A , D̃), which is bounded by , affects the dissimilarity between the in-
duced image of local and the global distribution, therefore plays a key role in upper-bounding
the global performance (LT (h) in Equation B.2). Next, we discuss different scenarios when
FL can benefit from such augmented data, and when the quality of augmented distribution
DA can limit the generalization performance of the aggregated model.
     DA can benefit local users when  is small: To see this, one can assume that:
                                dH4H (D̃A , D̃) =  ≤ min dH4H (D̃k , D̃),
                                                            k
of which the intuition is that, after feature mapping, the discrepancy between the augmented
distribution and the global distribution is smaller than the discrepancy between an individual
user and the global. Based on this assumption, one can conclude that ∀ Tk ∈ T :
                                               1                 1
                         dH4H (D̃k0 , D̃) = dH4H (D̃k , D̃) + dH4H (D̃A , D̃)
                                               2                 2
                                               1
                                             ≤ dH4H (D̃k , D̃) + min dH4H (D̃j , D̃)
                                               2                   j
                                             ≤dH4H (D̃k , D̃),
                                                      111


Therefore, a small dH4H (D̃A , D̃) benefits local users w.r.t their generalization performance,
by both reducing the data discrepancy and enriching the empirical samples, in that:
                                                           s 
                                                                           2em0
                                                                                            
                                                              4                           4
                              0                 0
   LT (hk ) ≤ LTk0 (hk ) + λk + ≤ dH4H (D̃k , D̃) +                d log          + log          ( Lemma 4).
                                  |         {z        }       m0              d           δ
                                      ≤dH4H (D̃k ,D̃)      |      √ 4 {z2em           4
                                                                                            }
                                                                ≤   m
                                                                      (d log   d
                                                                                 +log δ )
     DA has positive effects on the generalization performance when  is moderate:
Instead, one might as well assume that
                                                              K
                                                           1 X
                                dH4H (D̃A , D̃) =  ≤             dH4H (D̃k , D̃),
                                                          K k=1
which implies that, after feature mapping over R, the dissimilarity between DA and the
global distribution D is at least as small as the average dissimilarity between local users and
the global. Based on this assumption, one can derive that:
                                               s                                s 
                                                              2em0
                                                                                                             
 X                       X                         4                         4          4         2em       4
      dH4H (D̃k0 , D̃) ≤     dH4H (D̃k , D̃),           d log       + log        ≤          d log     + log     ,
                                                  m0            d            δ          m          d        δ
   k                      k
     which can still contribute to a tighter upper-bound for the global performance in Equa-
tion B.2, compared with not using the augmented data.
     Conversely, when  is over-large, which implies that DA is not relevant to the original FL
task, it may have negative impacts on the generalization performance.
B.0.5     Practical Settings of FedGen
We first discuss some practical considerations for implementing our algorithm:
     • Weighting user models: User models vary in their ability to predict certain labels
       over others due to their statistical heterogeneity. Therefore, we use the number of
       training labels available to users to summarize a weight matrix Λ = {λck |c ∈ Y, k ∈
                                          λci    nc
       {1, 2, · · · , K}}, s.t. ∀c, i, j, λcj
                                              = nci indicates the ratio of training samples for label c
                                                   j
       between two users i and j, and k λk = 1 ∀c ∈ Y. We then apply this weight matrix
                                               P c
                                                        112


      to adjust the generator objective as the following:
                                                        X                   
                                                    y      1     K          p
                  min J(w) := Ey∼p̂(y) Ez∼Gw (z|y) λk l σ            g(z; θk ) , y .
                   w                                      K      k=1
       We found that this weighted objective can further mitigate the impact of negative
      ensemble, especially when a teacher model is too weak to predict certain labels due to
      lacking training samples of that category.
    • Stochastic generative learning: Built upon prior arts on generative learning [83],
      we use an auxiliary noise vector with dimension dn to infer the desirable feature rep-
      resentation for a given label y, s.t. z ∼ Gw (·|y) ≡ Gw (y, | ∼ N (0, I)). To further
      increase the diversity of the generator output, we also leverage the idea of diversity
      loss from prior work [114] to train the generator model.
B.0.6    Prototype Results
We adopt a one-round FL setting for the prototype exper-
iment, for which the dataset distributions of local users,            User 1   User 2 User 3 Oracle
                                                              Before   97.1     81.3   81.2
as well as their model decision boundaries before and af-     After    98.6     98.3   98.2
                                                                                              98.4
ter knowledge distillation, are illustrated in Figure B.1.
                                                             Table B.1: Accuracy (%) before
Accuracy of user models on the global dataset is also
                                                             and after KD.
summarized in Table B.1, from which one can observe
that the generalization performance of user models have
been notably improved by the distilled knowledge.
B.0.7    Experimental Setup
We provide the network architecture for the generator and the classifier in Table B.2 and
Table B.3. For the generator Gw , we adopt a two-MLP layer network. It takes a noise vector
 and a one-hot label vector y as the input, which, after a hidden layer with dimension dh ,
outputs a feature representation with dimension d. For the classifier, we adopt a network
                                               113


                    (a) Local user data distribution and total data distribution.
     (b) User models generate biased decision boundaries before KD, provided with incom-
     plete local data.
                  (c) Decision Boundaries of user models are improved after KD.
          Figure B.1: Knowledge distillation process for the prototype experiment.
architecture with a CNN module followed by an MLP module. Hyperparameter settings for
the experiments are provided in Table B.2 and B.3.
                             Dataset          Hyperparameter         Value
                            CelebA                 dn , dh , d    32, 128, 32
                        Mnist& EMnist              dn , dh , d    32, 256, 32
                     Table B.2: Network architecture for the generator Gw .
                                                 114


                             Dataset        Hyperparameter                    Value
                                             CNN Module               [16, M, 32, M, 64]
                            CelebA
                                             MLP Module                     [784, 32]
                             Mnist           CNN Module                      [6, 16]
                           & EMnist          MLP Module                     [784, 32]
                 Table B.3: Network architecture for the classification model.
B.0.8    FedGen with Partial Parameter Sharing
Algorithm B.1 summarizes a variant approach of FeDGen for a specific FL setting, where
only the last prediction layer is shared among users while keeping the feature extraction
layers localized.
Algorithm B.1: FeDGen with Partial Parameter Sharing
  1: Require: Tasks Tk , k ∈ {1, · · · , K};
  2:   Global predictor θ p , local parameters {θk = [θkf ; θkp ]}K         k=1 ;
  3:   Generator parameter w; p̂(y) uniformly initialized;
  4:   Learning rate α, β, local steps T , batch size B, local label counter ck .
  5: repeat
  6:    Server selects active users A uniformly at random, then broadcast w, θ p , p̂(y) to A.
  7:    for all user k ∈ A in parallel do
  8:        θkp ← θ p ,
  9:        for t = 1, . . . , T do
10:              {xi , yi }B                                           B
                           i=1 ∼ Tk , {ẑi ∼ Gw (·|ŷi ), ŷi ∼ p̂(y)}i=1 .
11:             Update label counter ck .
12:           θk ← θk − β∇θk J(θk ).
13:         User sends θkp , ck back to server.
        Server updates θ p ← |A|           k∈A θk , and p̂(y) based on {ck }k∈A .
                                     1
                                        P       p
14:
15:     w ← w − α∇w J(w).
16:  until training stop
B.0.9    Extended Experimental Results
We elaborate the learning curves trained on the Mnist, CelebA, and EMnist dataset in
Figure B.2, Figure B.3, and Figure B.4, respectively, with their performance summarized in
Table B.4.
                                                        115


                                    Test Classification Accuracy.                                   Test Classification Accuracy.                                                       Test Classification Accuracy.                                    Test Classification Accuracy.
                                                                                            0.92                                        FedGen                                     0.96                                                             0.96
                                0.90                                                                                                    FedAvg
                                                                                            0.90                                                                                   0.95                                                             0.95
                                                                                                                                        FedProx
                                0.88
                Test Accuracy                                                   Test Accuracy                                                                      Test Accuracy                                                    Test Accuracy
                                                                                            0.88                                        Ensemble                                   0.94                                                             0.94
                                0.86                                                                                                    FedDistill
                                                                                            0.86                                        FedDistill +                               0.93                                                             0.93
                                0.84                                                                                                    FedFusion                                  0.92                                                             0.92
                                                                                            0.84
                                0.82                                                        0.82                                                                                   0.91                                                             0.91
                                0.80                                                        0.80                                                                                   0.90                                                             0.90
                                       0      50 100 150 200                                         0     50 100 150 200                                                                  0     50 100 150 200                                            0      50 100 150 200
                                            Communication rounds                                         Communication rounds                                                                  Communication rounds                                             Communication rounds
                                       (a) α = 0.05                                                           (b) α = 0.1                                                                      (c) α = 1                                                       (d) α = 10
                    Figure B.2: Performance curves on Mnist dataset, where a smaller α denotes larger data
                    heterogeneity.
                      Test Classification Accuracy.                                    Test Classification Accuracy.                                                         Test Classification Accuracy.                                      Test Classification Accuracy.
                                                                                                                                    FedGen                     0.90                                                             0.90
            0.88                                                            0.88                                                    FedAvg
                                                                                                                                    FedProx                    0.88                                                             0.88
Test Accuracy                                                   Test Accuracy                                                                      Test Accuracy                                                    Test Accuracy
            0.86                                                            0.86                                                    Ensemble
                                                                                                                                    FedDistill                 0.86                                                             0.86
            0.84                                                            0.84                                                    FedDistill +
                                                                                                                                    FedFusion                  0.84                                                             0.84
            0.82                                                            0.82                                                                               0.82                                                             0.82
            0.80                                                            0.80                                                                               0.80                                                             0.80
                                0     25    50      75 100                                      0     25    50      75 100                                                          0       25    50      75 100                                     0       25    50      75 100
                                    Communication rounds                                            Communication rounds                                                                  Communication rounds                                             Communication rounds
                            (a) r = 9/10.                                                                (b) r = 5/10.                                                             (c) r = 5/25.                                                      (d) r = 0.4.
                                                                        Figure B.3: Performance curves on CelebA dataset.
                                       Test Classification Accuracy.                        0.72
                                                                                                    Test Classification Accuracy.                                                         Test Classification Accuracy.                                    Test Classification Accuracy.
                                                                                                                                        FedGen
                            0.675                                                           0.70                                        FedAvg                                 0.775                                                            0.775
                                                                                                                                        FedProx
                                                                                            0.68
                                                                                Test Accuracy
                            0.650                                                                                                       Ensemble                               0.750                                                            0.750
                Test Accuracy                                                                                                                                      Test Accuracy                                                    Test Accuracy
                                                                                            0.66                                        FedDistill                             0.725                                                            0.725
                            0.625                                                                                                       FedDistill +
                            0.600                                                           0.64                                        FedFusion                              0.700                                                            0.700
                            0.575                                                           0.62                                                                               0.675                                                            0.675
                            0.550                                                           0.60                                                                               0.650                                                            0.650
                                        0     50 100 150 200                                         0     50 100 150 200                                                                  0     50 100 150 200                                             0     50 100 150 200
                                            Communication rounds                                         Communication rounds                                                                  Communication rounds                                             Communication rounds
                    (a) α = 0.05, T = 20                                                                 (b) α = 0.1, T = 20                                                       (c) α = 1, T = 20                                          (d) α = 10, T = 20
                                       Test Classification Accuracy.
                                                                                                    Test Classification Accuracy.                                                         Test Classification Accuracy.                                    Test Classification Accuracy.
                            0.700                                                                                                       FedGen                                                                                                  0.800
                                                                                            0.70                                        FedAvg
                            0.675                                                                                                                                              0.775                                                            0.775
                                                                                            0.68                                        FedProx
                                                                                Test Accuracy
                                                                                                                                        Ensemble                               0.750                                                            0.750
                Test Accuracy                                                                                                                                      Test Accuracy                                                    Test Accuracy
                            0.650
                                                                                            0.66                                        FedDistill
                            0.625                                                                                                       FedDistill +                           0.725                                                            0.725
                            0.600                                                           0.64                                        FedFusion                              0.700                                                            0.700
                            0.575                                                           0.62                                                                               0.675                                                            0.675
                            0.550                                                           0.60                                                                               0.650                                                            0.650
                                        0     25    50      75 100                                   0     25    50      75 100                                                            0     25    50      75 100                                       0     25    50      75 100
                                            Communication rounds                                         Communication rounds                                                                  Communication rounds                                             Communication rounds
                    (e) α = 0.05, T = 40                                                                 (f) α = 0.1, T = 40                                                       (g) α = 1, T = 40                                          (h) α = 10, T = 40
                    Figure B.4: Performance curves on EMnist dataset, under different kinds of data hetero-
                    geneity and communication frequencies.
                                                                                                                                         116


                                         Top-1 Test Accuracy.
 Dataset Setting    FedAvg      FedProx FedEnsemble FedDistill FedDistill+ FedDFusion FeDGen
         α = 0.05  87.70±2.07   87.49±2.05    88.85±0.68  70.56±1.24 86.70±2.27   90.02±0.96 91.30±0.74
         α = 0.1   90.16±0.59   90.10±0.39    90.78±0.39 64.11± 1.36 90.28±0.89   91.11±0.43 93.03±0.32
 Mnist
         α=1       93.84±0.25  93.83 ± 0.29 93.91±0.28    79.88±0.66 94.73±0.15   93.37±0.40 95.52±0.07
         α = 10    94.23±0.13   94.06±0.10    94.25±0.11  89.21±0.26 95.04±0.21   93.36±0.45 95.79±0.10
         r = 5/10 87.48±0.39 87.67±0.39       88.48±0.23  76.68±1.23 86.37±0.41   87.01±1.00 89.70±0.32
 CelebA r = 5/25 89.13±0.25 88.84±0.19       90.22±0.31   74.99±1.57 88.05± 0.43  88.93±0.79  89.62±0.34
         r = 10/25 89.12±0.20 89.01±0.33      90.08±0.24  75.88±1.17 88.14±0.37   89.25±0.56 90.29±0.47
         α=   0.05 62.25±2.82   61.93±2.31    64.99±0.35  60.49±1.27 61.56±2.15  70.40±0.79 68.53±1.17
 EMnist, α =  0.1  66.21±2.43   65.29±2.94    67.53±1.19  50.32±1.39 66.06±3.18   70.94±0.76 72.15±0.21
   T =20 α =  1    74.83±0.99   74.12±0.88    75.12±1.07  46.19±0.70 75.41±1.05   75.43±0.37 78.48±1.04
         α=   10   74.83± 0.69  74.24±0.81    74.90±0.80  54.77±0.33 75.55 ±0.94  74.36±0.40 78.43±0.74
         α=   0.05 64.51±1.13   63.60±0.69    65.74±0.45  60.73±1.62 60.73±1.06  70.46±1.16 67.64 ±0.75
 EMnist, α =  0.1  67.71±1.31   66.79±0.77    68.96±0.66  49.54±1.18 67.01±0.38  71.55±0.43 70.90 ±0.49
   T =40 α =  1    77.02±1.09  75.93 ±0.95    77.68±0.98  46.72±0.73 78.12±0.90   77.58±0.37 78.92± 0.73
         α=   10   77.52±0.66   76.54±0.71    77.92±0.62  54.85±0.44 78.37±0.76   77.31±0.45 79.29±0.53
Table B.4: Performance overview under different data heterogeneity settings.                        For
Mnist and EMnist, user data follows the Dirichlet distribution with hyperparameter α,
with a smaller α indicating higher heterogeneity. For CelebA, r denotes the ratio between
active users and total users. T denotes the local training steps (communication delay). All
the above experiments use batch size B=32.
                                                    117


                                      APPENDIX C
                              APPENDIX FOR FEDRESCUE
C.0.1     Case Study: Effects of Progressive Learning
For this prototype experiment, we apply the same network architecture used for learning
DigitsFive domains, which is presented in Section C.0.6.1. We illustrate the progressive
and overwriting learning approaches in Figure C.1 and Figure C.2, respectively. For progres-
sive learning, parameters in θ×0.5 are updated first for learning the Mnist domain, which
are then frozen when updating parameters in θ×1.0 \θ×0.5 for learning the Synthetic do-
main. Contrarily, the overwriting approach updates the entire set of model parameters when
learning on the Synthetic domain. We set the learning rate to 0.01 and train 10 epochs
for learning each domain, with a batch size set to be 32. For the 10% (5%) training data
setting, we apply 743 (371) training samples for learning both Mnist and Synthetic do-
mains. Experimental results are averaged from 6 random seeds, with seed numbers set to be
1, 3, 5, 7, 9, 11, respectively.
    As shown in Table C.1 and Table C.2, the overwriting learning procedure leads to drastic
information forgetting on the previously learned domain (Mnist). On the contrary, evalua-
tions on both the ×0.5 model and ×1.0 model indicate that a progressive learning scheme can
adaptively build up knowledge on the new training data without undermining the preced-
ing knowledge representations. In fact, the new collateral connections that are progressively
learned can also improve the overall model performance on the Mnist domain. For instance,
when training using 5% of domain data, model performance on the Mnist domain has been
improved from 66.46% (on the ×0.5 model) to 72.27% (on the ×1.0 model), which verifies
the advantage of our progressive model learning strategy.
                                             118


Figure C.1: Illustration of progressive learn- Figure C.2: Illustration of overwriting learn-
ing.                                                 ing.
    Accuracy (%) Evaluated on the × 1.0 Model          Accuracy (%) of Mnist on the × 0.5 Model
    Domain        Progressive     Overwriting          Train data    Progressive       Overwriting
             Given 10 % training data                            Given 10 % training data
    Mnist         77.95±7.93      38.17±28.19          10%          74.99±11.26        33.51±24.95
    Synthetic 69.48±1.93          42.30±32.30                     Given 5 % training data
                                                       5%           66.46±14.20.       19.23±16.21
              Given 5 % training data
    Mnist         72.27±9.43      19.60±17.53         Table C.2: The overwriting learning
    Synthetic 52.43±5.88          19.03±16.32
                                                      approach leads to severe forgetting
           Table C.1: Progressive vs. overwrit- on the previously learned knowledge.
           ing learning.
 C.0.2     Details of Evaluated Related Work
 One related approach to learn prunable models is the slimmable network proposed in [196],
                                                                                  |P |
 which works by sampling a set of pruning ratios P = {pi |0 < pi ≤ 1}i=1 then performing a
 batch gradient update, as summarized in Algorithm C.1. During each learning iteration, a
 constant copy of the complete network θ̄ is referred to as teacher, whose prediction distri-
 bution p(·|x; θ̄) ∝ f (x; θ) is distilled into the submodel θ×pi , i.e. the student. The gradients
 for both teacher and student models are later aggregated to update the network parameter.
     Slimmable learning is proposed for non-FL settings, where sufficient data is accessible
 on a central machine. There are potential drawbacks of directly applying it to our FL setting.
 First is the error propagation issue: when the teacher model is underperforming e.g. given
 insufficient training or connection loss, its sub-optimality may be propagated back into the
 student. In contrast, we propose an objective (in Equation 6.1) that alleviates this issue by
                                                  119


introducing minθ×p L(f (X ; θ×p ), Y), which allows submodels to receive label supervisions.
Second, the gradients for different submodels may interfere with each other when being
aggregated, which weakens the model’s learning effect, as verified in Section 6.5.6.2.
     Another compared related work is FjORD [69], which also learns prunable networks
for a heterogeneous FL system. We summarize its local model updating procedure in Algo-
rithm C.2. In practice, we also apply loss backpropagation through the teacher model (i.e.
line 9 in Algorithm C.2) when optimizing towards the FjORD objectives, which is suggested
in [69] to further improve their model performance. Note that FjORD does not involve pro-
gressive parameter updates, the effects of which will be elaborated more in Section C.0.4.1.
In our experiments, we explore different choices of S and perform evaluations on FjORD
and FedSlim using their optimal S accordingly.
Algorithm C.1: Slimmable-Training ([196])
  1: Inputs: training dataset D ⊂ X × Y; model with parameter set θ; pruning ratios P,
     learning rate η, loss function L, constant S ≤ |P|.
  2: repeat
  3:     Sample batch x, y ∼ D.
  4:     Sample widths P̂ = [pi |pi ∼ P]Si=1 .
  5:     θ̄ ← stop_gradient(θ).
  6:     for pi ∼ P̂ do
  7:         gi ← ∇θ×pi KL[f (x; θ̄)kf (x; θ×pi )].
  8:     gθ ← ∇θ L(f (x; θ), y)
         Update parameter θ ← θ − η ∗ (gθ + i∈|S| gi )
                                                   P
  9:
10:  until training stop
C.0.3      Experiments
C.0.3.1      Treatments on Batch Normalization Layers
The BatchNorm layers need to be carefully tackled for effectively training prunable mod-
els. Prior arts either make the BatchNorm layers localized on user devices to improve
personalized FL, such as proposed in FedBN[100], or learn individual BatchNorm mod-
ules for each possible submodel [196], which unnecessarily inreases the number of learnable
                                                 120


Algorithm C.2: Local Model Update for FjORD([69])
 1:  Inputs: training dataset D ⊂ X × Y; model with parameter set θ; pruning ratios P,
     learning rate η, loss function L, constant S ≤ |P|.
 2:  repeat
 3:      Sample batch x, y ∼ D.
 4:      Sample widths P̂ = [pi |pi ∼ P]Si=1 . . (S = 1 in [69])
 5:      θ̄ ← stop_gradient(θ).
 6:      for pi ∼ P̂ do 
 7:          gi ← ∇θ×pi L(f (X ; θ×pi ), Y) + DKL [f (x; θ̄)kf (x; θ×pi )] .
 8:      gθ ← ∇θ L(f (x; θ), y)
         Update parameter θ ← θ − η ∗ (gθ + i∈|S| gi )
                                                P
 9:
10:  until training stop
parameters. In our work, we apply a lightweight approach that still shares the trainable pa-
rameters in BatchNorm layers, while disabling tracking the running average and variance
of training batches, which proves to be an effective scheme across different datasets [39]. For
fair comparisons, we apply the same strategy on all baselines, including the FedAvg.
    Table C.3 summarizes the impacts of different practices regarding BatchNorm layers on
the FedAvg performance, which indicates that personalizing BN layers, as FedBN applies,
might not be good practice for learning i.i.d. yet complicated domains such as CelebA. On
the contrary, decoupling feature learning from relying on tracking the mean and variance
of training data, proves to be effective on both FedAvg and our proposed self-distillation
approach.
                     Effects of Different BatchNorm Layer Configurations.
                Personalized      Tracking        Test Accuracy (%)           Test Accuracy (%)
  Algorithm
                 BN Layers Training Status (Given 100% training)             (Given 20% training)
  FedAvg ∗           ×                ×                81.06±0.63                 68.03±0.50
  FedAvg             ×                X                79.73±0.22                 68.14±0.47
  FedBN              X                X                76.12±0.74                 60.41±0.57
    Table C.3: We adopted FedAvg ∗ as the baseline implementation in the main paper.
                                               121


C.0.3.2     Overview of Communication Efficiency
We summarize the communication efficiency of evaluated algorithms on CelebA domain in
Table C.4, where evaluation is performed under a heterogeneous FL system. Results demon-
strate that FedResCuE requires the least communication rounds to reach the predefined
accuracy.
              Communication Efficiency on CelebA Dataset, Cluster Setting
                     Model
         Accuracy            FedHetero      FjORD ∗       FedSlim∗    FedResCuE
                      Size
                           100 % training data, 0.1 ≤ er ≤ 0.2.
            65%       w×1         -             -             -       174.7±22.2
            60%      w×0.5 256.7±19.3 218.0±5.9 253.3±26.5 124.7±11.1
            55%      w×0.25 222.7±6.8 165.3±5.2 190.7±23.8             85.3±1.9
                               20 % training data, er = 0
            60%       w×1         -       244.0±33.5 188.7±133.4 166.0±14.2
            55%      w×0.5 180.7±14.8 156.0±13.4 192.0±7.1             96.0±2.8
            50%      w×0.25 163.3±11.5 122.7±4.1         168.7±5.2     76.7±1.9
Table C.4: Communication Efficiency Overview, where ‘-’ indicates that predefined perfor-
mance is not reached before training ends.
C.0.3.3     Performance with System Heterogeneity
In addition to the discussion in Section 6.5.2 of the main paper, we provide two more ob-
servations regarding system heterogeneity: 1) when FL is free from the risk of a connection
interruption, a uniform setting in which the same model architecture is assigned to all the
edge devices, is generally more beneficial than a cluster setting, where heterogeneous and
smaller models are enabled. This result conforms to our perception that larger models can
capture more representative feature maps for the learning domain, while smaller models sac-
rifice such information gain for computation and communication efficiency. 2) Contrarily,
the diversity in model architecture makes FL resilient to connection loss to some extent.
Specifically, performing parameter-wise averaging on heterogeneous models, just as Fed-
Hetero applies, can potentially make submodels within the global model function as well,
in that the submodel parameters are contributed by smaller-capacity users. Therefore, it
                                            122


can outperform FedAvg under unpredictable connection losses, although such an advantage
is much less effective than FedResCuE with self-distillation learning. We provide a more
comprehensive summary in Table C.5, with the accompanying learning curves illustrated in
Section C.0.5.
                        Global Model Accuracy (%) Evaluated on CelebA.
  Training     User   Evaluated
                                  FedAvg      FedHetero    FjORD ∗     FedSlim∗    FedResCuE
    Data    Capacity    Model
                         w×1    81.06±0.63         -      80.57±0.91 81.14±0.76   81.39±0.20
            uniform
                        w×0.5   49.46±1.10         -      77.59±0.31 77.47±0.75   77.82±0.15
    100%
                        w×0.25  18.57±0.64         -      69.94±0.65 70.47±0.61   71.19±0.19
                         w×1         -        76.80±0.53 75.71±0.47 77.49±0.40    78.22±0.41
             cluster
                        w×0.5        -        76.08±0.50 75.35±0.43 77.12±0.40    77.58±0.33
                        w×0.25       -        68.56±0.51 70.98±0.75 73.22±0.34    73.25±0.47
                         w×1    68.03±0.50         -      67.89±1.47 67.96±0.72 71.27±0.27
            uniform
                        w×0.5   36.56±1.74         -      65.13±1.70 64.80±1.19 68.15±0.39
    20%
                        w×0.25  16.47±2.24         -     61.38±1.69 59.56±1.39 61.12±1.35
                         w×1         -        59.38±0.41 62.43±1.65 59.53±0.86 64.53±1.06
             cluster
                        w×0.5        -        59.95±0.36 63.06±1.61 60.36±0.45 66.37±0.78
                        w×0.25       -        55.41±0.39 61.86±1.21 58.31±0.23 61.98±0.85
             Table C.5: FL Performance with i.i.d. user statistical distributions.
C.0.3.4     Performance Under Connection Loss
In our experiments, we simulate the unpredictable transmission scenarios with connection
loss via a random variable er, which denotes the probability that the current column trans-
mission is interrupted. For experiments on the CelebA domain, er is first uniformly sampled
within an error bound (e.g. er ∈ [0.1, 0.2]). Next, we use er to determine whether the current
column transmission will be interrupted, by comparing it against another uniformly-sampled
random variable , which leads to interruption iff  < er. We perform such random sampling
on er and  for transmitting each column in a model, while a column for transmission is set
to be ×0.125 of the global model width.
    Performance overview on the CelebA domain given different connection loss rates is
presented in Table C.6. For experiments on the DigitsFive domains, we set constant er
that depends on the domain type. Performance on all five domains is provided in Table C.7.
                                               123


Note that we applied a small training dataset from DigitsFive to ensure the necessity of
FL, so that Local learning without parameter sharing yields worse performance than FL.
        Global Model Accuracy (%) Evaluated on CelebA Under Connection    Loss.
  Connection     User      Evaluated
                                         FedAvg    FedHetero    FjORD      FedSlim    FedResCuE
     Error     Capacity     Model
                             w×1       50.36±2.17       -     61.79±1.62  57.31±1.27   70.02±0.40
               uniform
                            w×0.25     12.58±0.51       -     60.20±1.67  55.33±0.89   67.40±0.84
  er ≤ 0.2
                             w×1            -      60.92±1.33 64.52±0.60  62.35±1.76   69.78±0.74
                cluster
                            w×0.25          -      59.70±0.64 64.11±0.41  61.77±1.62   68.83±1.00
                             w×1       41.79±4.33       -     54.91±1.07 48.46±0.73    64.80±0.85
               uniform
                            w×0.25     12.27±0.93       -     55.20±1.31 48.42±0.87    64.10±0.26
  er ≤ 0.3
                             w×1            -      51.70±1.14 57.60±2.25 56.28±1.61    65.74±0.30
                cluster
                            w×0.25          -      51.90±0.33 57.69±2.22 56.34±1.56    65.18±0.57
      Table C.6: Performance under faulty connections, given 100% of training data.
                             Constant Error Rates er for DigitsFive
     Domain           SVHN            Synthetic         USPS         MNIST         MNIST-M
        er              0.05             0.05             0.3          0.3             0.05
                      DigitsFive performance, trained using 5% of data.
     Domain           SVHN            Synthetic         USPS         MNIST         MNIST-M
                                              ×1.0 Model.
      Local         45.88±0.92       62.04±1.16      89.95±1.93    85.84±3.35      61.99±1.79
     FedAvg        69.54±0.15 80.17±0.24             94.38±0.55    93.61±0.38      75.22±0.87
     FedSlim        66.77±0.61       78.95±0.82      94.00±0.41    93.80±0.53      73.95±1.02
     FjORD         49.72±23.42 57.12±30.11 68.60±36.34 68.01±37.35                56.29±26.48
   FedResCuE        69.32±0.78 80.57±0.77 95.17±0.47 95.05±0.44                   76.89±0.76
                                             ×0.25 Model.
     FedAvg         20.00±0.55       19.14±5.64 34.03±14.65 32.16±18.65 20.47±9.28
     FedSlim        64.00±0.96       77.15±2.13      93.44±1.26    93.38±0.75      72.16±1.51
     FjORD         48.11±22.35      54.58±31.15 68.51±36.28 65.79±39.13 52.75±29.16
   FedResCuE 67.11±0.77             79.06±0.41 94.55±0.37 94.64±0.28 75.64±0.37
Table C.7: FedResCuE is the       most robust algorithm given heterogeneous data and domain-
dependent connection errors.
C.0.4     Modeling of Connection Uncertainty
In our experiments, we assume that an error rate er is related to each model column trans-
mission between the server and the edge device. This setting is built upon the mechanism of
                                                  124


downstream wireless connections. Specifically, we present a fine-grained formulation of the
connection error rates in a wireless communication scenario:
    Lossy Wireless Connections: Following the wireless model of [141], we can leverage
LTE-A [78], which is a representative model of 4G network, for the wireless links between
the edge server (ES) and edge devices for FL, considering the orthogonal frequency division
multiple access (OFDMA) scheme. The parameter des,k denotes the distance between the
edge server and the k th edge device while the path loss between them can be characterized
by d−σ
     es,k and the white Gaussian noise power N0 , where σ is the path loss exponent. The
wireless channel is modeled as a frequency-flat block-fading Rayleigh fading channel, with
the uplink channel fading coefficient h [176]. The uplink data rate of the k th edge device is
defined as:
                                                                  !
                                                   Pt d−σ      2
                                                       es,k |h |
                               Rk = Bk log2    1+                    .                  (C.1)
                                                     N0 + I
In the equation above, Bk denotes the channel bandwidth, Pt represents the transmission
power of the ES. I is the inter-cell interference. As Rk fluctuates based on changing wireless
network conditions, we can define a minimum uplink rate Rmin , such that any rate lower
than Rmin will lead to timeouts and packet loss. As a result, the probability that a
packet gets lost over the edge network shall be derived as the following:
                                        (                              !        )
                                                        Pt d−σ       2
                                                             es,k |h |
                P r {Rk < Rmin } = P r Bk log2 1 +                       < Rmin
                                                          N0 + I
                                        (                                   )
                                                                 2
                                                          P t |h   |
                                 = P r dσ >                 Rmin
                                                (N0 + I) 2 Bk − N0 − I
                                                                            ! σ1 
                                                                   2
                                                          Pt |h |                
                                 = Pr d >                     Rmin                  .   (C.2)
                                                (N + I) 2
                                                   0
                                                               Bk
                                                                     −N −I
                                                                         0
                                                                                  
Suppose in a given LTE-A network, Bk , N0 , I, Pt , σ, and h are constants. If the distances
between the edge devices and the ES, i.e., des,k , follow the Poisson distribution, then the
probability of a packet to be lost during transmission can be derived by Equation (C.2).
    Macro-Level Simulation of Connection Uncertainty: When conducting experi-
ments for unstable network connection scenarios, we use er, i.e. the connection loss rate for
                                               125


each column, as a macro-level modeling to approximate P r {Rk < Rmin } in Equation (C.2).
This type of modeling is grounded in that an upstream column and a downstream packet
are logically related to each other. Depending on the size limit of a transmission packet and
the granularity of our model decomposition, a column could be further decomposed into one
or multiple packets during wireless transmission.
C.0.4.1      Ablation Study
Impacts of Sampling Frequency: We explored the effects of different sampling fre-
quency S on model learning, where S is the number of submodels sampled per batch update
(e.g.|P̂ | = S in Algorithm 6.1). S = 0 would reduce all algorithms to regular FedAvg without
self distillation. An overlarge S, on the other hand, may bring extra computation workload
to edge devices. In our experiments, the sampling granularity is set to be ×0.05 of the largest
model width, and the smallest model width is set to be ×0.1. Hence there are 19 eligible
pruning ratios, i.e. |P| = 19.
    As shown in Figure C.3, FedResCuE can benefit from finer-grained submodel sampling
and maintains a robust performance across different choices of S. On the contrary, the
performance of FjORD is only slightly improved by increasing S from 2 to 4, whose perfor-
mance drops notably when S becomes overlarge, especially given insufficient training data.
FedSlim is the most sensitive to the choice of S, which learns prunable submodels purely
by knowledge distillation and batch-gradient updates. Its performance degrades drastically
with an increasing S and becomes highly unstable under connection interruptions. We as-
cribe the robustness of FedResCuE over others to its progressive learning scheme, rather
than a batch-gradient update strategy as FedSlim and FjORD applies.
Effects of Knowledge Distillation: We compare FedResCuE against its ablated variant
named FedSeq which does not require minimizing the KL-divergence between the largest
model and the sampled submodel. We provide the model learning process of FedSeq in
Algorithm C.3. Table C.8 summarizes their asymptotic performance under different FL
                                              126


 (a) 20% training data, (b) 20% training data, (c) 0.1 ≤ er ≤ 0.2, ×1.0 (d) 0.1 ≤ er ≤ 0.2,
 ×1.0 model.             ×0.5 model.                  model.            ×0.5 model.
                  Figure C.3: Impacts of the submodel sampling frequency.
settings, which demonstrates that FedResCuE is more beneficial to smaller submodels. We
present the corresponding evaluation curves in Figure C.4.
Algorithm C.3: Local Model Update for FedSeq
  1: Inputs: Training dataset D ⊂ X × Y; model with parameter set θ ∈ Θ; pruning ratios
     P, learning rate η, loss function L, constant S ≤ |P|.
  2: repeat
  3:    Sample batch of x, y ∼ D.
  4:    Sample ordered ratios P̂ = [pi |pi ∈ P, pi < pi+1 ∀i < S, pS = 1]Si=1
  5:    for pi ∼ P̂ do
  6:        gi ← ∇{θ×pi \θ×pi−1 } L(f (X ; θ×pi ), Y)
  7:        θ×pi ← θ×pi − η ∗ gi .
  8:    θ ← θ − η ∗ ∇θ L(f (x; θ), y).
  9: until training stop
10: Return θ
                  Comparing FedResCuE and FedSeq, on CelebA dataset
            Algorithm            w×0.25            w×0.5     w×0.75         w×1.0
                                     Given 20 % training data
            FedResCuE 61.12±1.35 68.15±0.39 69.98±0.41 71.27±0.27
            FedSeq           58.96±1.24 66.39±0.60 69.18±0.28 70.13±0.16
                                            0.1 ≤ er ≤ 0.2
            FedResCuE 67.40±0.84. 69.91±0.42. 70.23±0.42. 70.02±0.40
            FedSeq           66.47±0.36 69.45±0.41 69.70±0.37 69.67±0.32
              Table C.8: Performance with and without knowledge-distillation.
Effects of Progressive Learning: One contributing factor to FedResCuE ’s superior per-
formance is its progressive learning strategy, which adaptively learns gradients by fixing the
                                                   127


       (a) ×0.25 model.       (b) ×0.5 model.        (c) ×0.75 model.        (d) ×1.0 model.
Figure C.4: Compared with FedSeq, knowledge-distillation strategy in FedResCuE can ben-
efit smaller submodels.
parameters of previously learned submodels. To validate the efficacy of progressive learning,
we compared FedResCuE against a variant called FedRush Ṫhe detailed model learning pro-
cess of FedRush is provided in Algorithm C.4. Learning curves of these two algorithms are
illustrated in Figure C.5, which demonstrate that the progressive parameter update, as Fe-
dResCuE adopts, is necessary to derive reliable models, especially given insufficient training
data or connection interruptions. FedRush, on the other hand, undermines the knowledge
learned by smaller submodels by overwriting parameters during the model update, which
could otherwise serve as a good initialization for the subsequent submodel to build up rep-
resentative domain knowledge.
Algorithm C.4: Local Model Update for FedRush
  1: Inputs: training dataset D ⊂ X × Y; model with parameter set θ; pruning ratios P,
     learning rate η, loss function L, constant S ≤ |P|.
  2: repeat
  3:     Sample batch x, y ∼ D.
  4:     Sample ordered ratios P̂ = [pi |pi ∈ P, pi < pi+1 ∀i < S, pS = 1]Si=1
  5:     θ̄ ← stop_gradient(θ), θ×0 = ∅.
  6:     for pi ∼ P̂ do 
  7:         gi ← ∇θ×pi L(f (X ; θ×pi ), Y) + DKL [f (x; θ̄)kf (x; θ×pi )] .
  8:         θ×pi ← θ×pi − η ∗ gi . . (Overwrite previously learned θpi−1 .)
  9:     θ ← θ − η ∗ ∇θ L(f (x; θ), y)
10:  until training stop
                                               128


     (a) ×0.25 model.    (b) ×0.5 model.     (c) ×0.75 model.    (d) ×1.0 model.
Figure C.5: Compared to FedRush, FedResCuE maintains a high performance for the ×1.0
model.
                                         129


C.0.5    Overview of Model Learning Performance
Performance with Stable Network Connections:
         Figure C.6: 100% training data, CelebA, uniform architecture, er = 0.
     (a) ×0.25 model      (b) ×0.5 model      (c) ×0.75 model        (d) ×1.0 model
          Figure C.7: 100% training data, CelebA, cluster architecture, er = 0.
   Performance Given Insufficient Training Data:
                                          130


     Figure C.8: 20% training data, CelebA, uniform architecture, er = 0.
(a) ×0.25 model      (b) ×0.5 model      (c) ×0.75 model       (d) ×1.0 model
     Figure C.9: 20% training data, CelebA, cluster architecture, er = 0.
                                     131


    Performance Under Connection Loss:
      (a) ×0.25 model       (b) ×0.5 model        (c) ×0.75 model      (d) ×1.0 model
      Figure C.10: 100% training data, CelebA, uniform architecture, 0.1 ≤ er ≤ 0.2.
      (a) ×0.25 model       (b) ×0.5 model        (c) ×0.75 model      (d) ×1.0 model
       Figure C.11: 100% training data, CelebA, cluster architecture, 0.1 ≤ er ≤ 0.2.
C.0.6     Experiment Configurations
C.0.6.1    Model Architecture
For the CelebA domain, we build a ResNet neural network [65] using 4 residual blocks,
while a residual block maintains 1) a convolution module that consits of 3 Conv2d layers,
each followed by a BatchNorm2d layer and a ReLU activation layer; and 2) a shortpath
module to be in parallel with the convolution module. The ResNet model contains 8,036,426
trainable parameters in total, with a model size of 33.09 MB. For the DigitsFive domains,
we build a neural network consiting of 3 Conv2d layers, each followed by a BatchNorm
layer, and 3 Linear layers. It contains 14,214,090 trainable parameters in total, with a
model size of 55.30 MB.
                                            132


                      Network Architecture for Learning CelebA.
                    Layer          Output Shape # of Parameters
                    Conv2d-1       64 × 14 × 14              9,408
                    MaxPool2d-4       64 × 7 × 7                 0
                    Conv2d-5          64 × 7 × 7             4,096
                    Conv2d-8          64 × 7 × 7            36,864
                    Conv2d-11        256 × 7 × 7            16,384
                    Conv2d-13        256 × 7 × 7            16,384
                    Conv2d-17        128 × 7 × 7            32,768
                    Conv2d-20        128 × 4 × 4           147,456
                    Conv2d-23        512 × 4 × 4            65,536
                    Conv2d-25        512 × 4 × 4          131, 072
                    Conv2d-29         256 ×4 × 4           131,072
                    Conv2d-32        256 × 2 × 2           589,824
                    Conv2d-35      1024 × 2 × 2            262,144
                    Conv2d-37      1024 × 2 × 2            524,288
                    Conv2d-41          512 × 2, 2          524,288
                    Conv2d-44        512 × 1 × 1         2,359,296
                    Conv2d-47      2048 × 1 × 1          1,048,576
                    Conv2d-49      2048 × 1 × 1          2,097,152
                    AvgPool2d-53   2048 × 1 × 1                  0
                    Linear-54                  10           20,490
Table C.9: ResNet Architecture for Learning CelebA, omitting BatchNorm and ReLU
layers.
                    Network Architecture for Learning DigitsFive.
                   Layer            Output Shape # of Parameters
                   Conv2d-1          64 × 28 × 28             4,864
                   BatchNorm2d-2     64 × 28 × 28               128
                   Conv2d-3          64 × 14 × 14           102,464
                   BatchNorm2d-4     64 × 14 × 14               128
                   Conv2d-5           128 × 7 × 7           204,928
                   BatchNorm2d-6      128 × 7 × 7               256
                   Linear-7                   2048       12,847,104
                   Linear-8                    512        1,049,088
                   Linear-9                     10            5,130
              Table C.10: Model Architecture for Learning DigitsFive.
                                          133


     (a) ×0.25 model       (b) ×0.5 model        (c) ×0.75 model       (d) ×1.0 model
     Figure C.12: 100% training data, CelebA, uniform architecture, 0.2 ≤ er ≤ 0.3.
     (a) ×0.25 model        (b) ×0.5 model       (c) ×0.75 model       (d) ×1.0 model
      Figure C.13: 100% training data, CelebA, cluster architecture, 0.2 ≤ er ≤ 0.3.
C.0.6.2    Optimizer Implementation for Progressive Learning
In practice, we customzie the default SGD optimizer implemented in Pytorch to realize
progressive gradient updates. A trainable neural layer consits of parameter tensors, each of
which can be treated as a weight matrix. When calculating gradients for the neural layer,
we specify the columns to be updated, which corresponds to a set of indices for elements in
the weight matrix. When applying the gradients, we mask out the gradients of parameters
that were not included in the specified columns.
C.0.6.3    Hyper-parameter Configurations
We summarize in Table C.11 the hyper-parameters used in our experiments.
                                            134


        Hyper-parameter Configurations
Domain                   Hyper-parameter        Value
                                 Optimizer       SGD
                              learning rate        0.1
                                Momentum           0.9
                                  Nesterov      True
                             Weight decay        10−4
Shared       Track training in BatchNorm       False
                        Share BatchNorm         True
                            Data category           10
                         # of active users           5
                Random seeds for training      3, 5, 7
                                Batch Size          32
CelebA                     Training Epoch          300
                          # of total users          20
                       Used training data 100%, 20%
                Column Granularity for P        ×0.05
DigitsFive                 Training Epoch          100
                          # of total users          10
                      # users per domain             2
                       Used training data          5%
                Column Granularity for P       ×0.125
   Table C.11: Configurations of Hyper-parameters.
                         135


BIBLIOGRAPHY
      136


                                   BIBLIOGRAPHY
[1]  Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement
     learning. In Proceedings of the twenty-first international conference on Machine learn-
     ing, page 1. ACM, 2004.
[2]  Naman Agarwal, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, and H Brendan
     Mcmahan. cpsgd: Communication-efficient and differentially-private distributed sgd.
     Advances in Neural Information Processing Systems, 2018.
[3]  Divyansh Aggarwal, Jiayu Zhou, and Anil K Jain. Fedface: Collaborative learning of
     face recognition model. Proceedings of International Joint Conference on Biometrics,
     2021.
[4]  Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd:
     Communication-efficient sgd via gradient quantization and encoding. Advances in
     Neural Information Processing Systems, 2017.
[5]  Mohammad Mohammadi Amiri, Deniz Gunduz, Sanjeev R Kulkarni, and H Vin-
     cent Poor. Federated learning with quantized global model updates. arXiv preprint
     arXiv:2006.10672, 2020.
[6]  Muhammad Ammad-Ud-Din, Elena Ivannikova, Suleiman A Khan, Were Oyomno,
     Qiang Fu, Kuan Eeik Tan, and Adrian Flanagan. Federated collaborative fil-
     tering for privacy-preserving personalized recommendation system. arXiv preprint
     arXiv:1901.09888, 2019.
[7]  Haitham Bou Ammar and Matthew E. Taylor. Reinforcement learning transfer via
     common subspaces. In Proceedings of the 11th International Conference on Adaptive
     and Learning Agents, ALA’11, pages 21–36, Berlin, Heidelberg, 2012. Springer-Verlag.
[8]  Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav
     Choudhary.      Federated learning with personalization layers.          arXiv preprint
     arXiv:1912.00818, 2019.
[9]  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative ad-
     versarial networks. In International conference on machine learning, pages 214–223.
     PMLR, 2017.
[10] Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, and Nando
     de Freitas. Playing hard exploration games by watching youtube. In Advances in
     Neural Information Processing Systems, pages 2930–2941, 2018.
[11] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep?             Advances in
     neural information processing systems, 27:2654–2662, 2014.
                                            137


[12] André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van
     Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In
     Advances in neural information processing systems, pages 4055–4065, 2017.
[13] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and
     Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances
     in neural information processing systems, pages 1471–1479, 2016.
[14] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade
     learning environment: An evaluation platform for general agents. Journal of Artificial
     Intelligence Research, 47:253–279, 2013.
[15] Richard Bellman. A markovian decision process. Journal of mathematics and mechan-
     ics, pages 679–684, 1957.
[16] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and
     Jennifer Wortman Vaughan. A theory of learning from different domains. Machine
     learning, 79(1-2):151–175, 2010.
[17] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of
     representations for domain adaptation. In Advances in neural information processing
     systems, pages 137–144, 2007.
[18] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman.
     Learning bounds for domain adaptation. In Advances in neural information processing
     systems, pages 129–136, 2008.
[19] John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with struc-
     tural correspondence learning. In Proceedings of the 2006 conference on empirical
     methods in natural language processing, pages 120–128, 2006.
[20] Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado van
     Hasselt, David Silver, and Tom Schaul. Universal successor features approximators.
     arXiv preprint arXiv:1812.07626, 2018.
[21] Daniel S Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation
     learning via fast bayesian reward inference from preferences. ICML, 2020.
[22] Daniel S Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating
     beyond suboptimal demonstrations via inverse reinforcement learning from observa-
     tions. arXiv preprint arXiv:1904.06387, 2019.
[23] Daniel S Brown, Wonjoon Goo, and Scott Niekum. Better-than-demonstrator imitation
     learning via automatically-ranked demonstrations. In Conference on Robot Learning,
     pages 330–359, 2020.
[24] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
     Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
     Language models are few-shot learners. Advances in Neural Information Processing
     Systems, 2020.
                                            138


[25] Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova, Matthew E Taylor,
     and Ann Nowé. Reinforcement learning from demonstration through shaping. In
     Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
[26] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression.
     In Proceedings of the 12th ACM SIGKDD international conference on Knowledge dis-
     covery and data mining, pages 535–541, 2006.
[27] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by
     random network distillation. ICLR, 2019.
[28] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-
     all: Train one network and specialize it for efficient deployment. arXiv preprint
     arXiv:1908.09791, 2020.
[29] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečnỳ,
     H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for
     federated settings. arXiv preprint arXiv:1812.01097, 2018.
[30] Hong-You Chen and Wei-Lun Chao. Fedbe: Making bayesian model ensemble appli-
     cable to federated learning. ICLR, 2021.
[31] Mingzhe Chen, Zhaohui Yang, Walid Saad, Changchuan Yin, H Vincent Poor, and
     Shuguang Cui. Performance optimization of federated learning over wireless networks.
     In 2019 IEEE Global Communications Conference (GLOBECOM), pages 1–6. IEEE,
     2019.
[32] Mingzhe Chen, Zhaohui Yang, Walid Saad, Changchuan Yin, H Vincent Poor, and
     Shuguang Cui. A joint learning and communications framework for federated learning
     over wireless networks. IEEE Transactions on Wireless Communications, 20(1):269–
     283, 2020.
[33] Yujing Chen, Yue Ning, Martin Slawski, and Huzefa Rangwala. Asynchronous online
     federated learning for edge devices with non-iid data. In 2020 IEEE International
     Conference on Big Data (Big Data), pages 15–24. IEEE, 2020.
[34] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario
     Amodei. Deep reinforcement learning from human preferences. In Advances in Neural
     Information Processing Systems, pages 4299–4307, 2017.
[35] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist:
     Extending mnist to handwritten letters. In 2017 International Joint Conference on
     Neural Networks (IJCNN), pages 2921–2926. IEEE, 2017.
[36] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training
     without losing performance. arXiv preprint arXiv:1907.04840, 2019.
[37] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
     training of deep bidirectional transformers for language understanding. arXiv preprint
     arXiv:1810.04805, 2018.
                                            139


[38] Sam Michael Devlin and Daniel Kudenko. Dynamic potential-based reward shaping.
     In Proceedings of the 11th International Conference on Autonomous Agents and Mul-
     tiagent Systems, pages 433–440. IFAAMAS, 2012.
[39] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication
     efficient federated learning for heterogeneous clients. arXiv preprint arXiv:2010.01264,
     2020.
[40] Canh T Dinh, Nguyen H Tran, and Tuan Dung Nguyen. Personalized federated learn-
     ing with moreau envelopes. 34th Conference on Neural Information Processing Systems
     (NeurIPS 2020), 2020.
[41] Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain
     markov process expectations for large time, i. Communications on Pure and Applied
     Mathematics, 28(1):1–47, 1975.
[42] Zhaoyang Du, Celimuge Wu, Tsutomu Yoshinaga, Kok-Lim Alvin Yau, Yusheng Ji,
     and Jie Li. Federated learning for vehicular internet of things: Recent advances and
     open issues. IEEE Open Journal of the Computer Society, 1:45–61, 2020.
[43] John C Duchi, Michael I Jordan, and Martin J Wainwright. Privacy aware learning.
     Journal of the ACM (JACM), 61(6):1–57, 2014.
[44] Ashley D Edwards, Himanshu Sahni, Yannick Schroecker, and Charles L Isbell. Imi-
     tating latent policies from observation. ICML, 2019.
[45] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learn-
     ing with theoretical guarantees: A model-agnostic meta-learning approach. Advances
     in Neural Information Processing Systems, 33, 2020.
[46] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
     adaptation of deep networks. In International Conference on Machine Learning, pages
     1126–1135. PMLR, 2017.
[47] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband,
     Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy
     networks for exploration. arXiv preprint arXiv:1706.10295, 2017.
[48] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse,
     trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
[49] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial
     inverse reinforcement learning. ICLR, 2017.
[50] Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation
     error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
[51] Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, and Sang-Hyeun Park.
     Preference-based reinforcement learning: a formal framework and a policy iteration
     algorithm. Machine learning, 89(1-2):123–156, 2012.
                                              140


[52] Tanmay Gangwani and Jian Peng. State-only imitation with transition dynamics
     mismatch. ICLR, 2020.
[53] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backprop-
     agation. In International conference on machine learning, pages 1180–1189. PMLR,
     2015.
[54] Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Rein-
     forcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313,
     2018.
[55] Seyed Kamyar Seyed Ghasemipour, Shane Gu, and Richard Zemel. Understanding
     the relation between maximum-entropy inverse reinforcement learning and behaviour
     cloning. ICLR 2019 Workshop, 2019.
[56] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
     Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
     Advances in neural information processing systems, pages 2672–2680, 2014.
[57] Xinran Gu, Kaixuan Huang, Jingzhao Zhang, and Longbo Huang. Fast federated
     learning in the presence of arbitrary device unavailability. 35th Conference on Neural
     Information Processing Systems (NeurIPS), 2021.
[58] Neel Guha, Ameet Talwalkar, and Virginia Smith. One-shot federated learning. arXiv
     preprint arXiv:1902.11175, 2019.
[59] Yijie Guo, Junhyuk Oh, Satinder Singh, and Honglak Lee. Generative adversarial
     self-imitation learning. arXiv preprint arXiv:1812.00950, 2018.
[60] Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Learn-
     ing invariant feature spaces to transfer skills with reinforcement learning. International
     Conference on Learning Representations (ICLR), 2017.
[61] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic:
     Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv
     preprint arXiv:1801.01290, 2018.
[62] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep
     neural networks with pruning, trained quantization and huffman coding. ICLR, 2016.
[63] Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Françoise Beau-
     fays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. Federated
     learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.
[64] Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge trans-
     fer: Federated learning of large cnns at the edge. Advances in Neural Information
     Processing Systems, 33, 2020.
                                             141


[65] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
     for image recognition. In Proceedings of the IEEE conference on computer vision and
     pattern recognition, pages 770–778, 2016.
[66] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot,
     Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from
     demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[67] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural
     network. arXiv preprint arXiv:1503.02531, 2015.
[68] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Ad-
     vances in neural information processing systems, pages 4565–4573, 2016.
[69] Samuel Horvath, Stefanos Laskaridis, Mario Almeida, Ilias Leontiadis, Stylianos I Ve-
     nieris, and Nicholas D Lane. Fjord: Fair and accurate federated learning under het-
     erogeneous targets with ordered dropout. 35th Conference on Neural Information
     Processing Systems (NeurIPS)., 2021.
[70] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter
     Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural
     Information Processing Systems, pages 1109–1117, 2016.
[71] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of
     non-identical data distribution for federated visual classification. arXiv preprint
     arXiv:1909.06335, 2019.
[72] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Trans-
     actions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
[73] Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Vladimir Braverman, Ion Stoica, and
     Raman Arora. Communication-efficient distributed sgd with sketching. Advances in
     Neural Information Processing Systems 32 (NeurIPS), 2019.
[74] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive
     mixtures of local experts. Neural computation, 3(1):79–87, 1991.
[75] Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-
     Lyun Kim. Communication-efficient on-device machine learning: Federated distillation
     and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479, 2018.
[76] Jing Jiang and ChengXiang Zhai. Instance weighting for domain adaptation in nlp. In
     Proceedings of the 45th annual meeting of the association of computational linguistics,
     pages 264–271, 2007.
[77] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis,
     Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel
     Cummings, et al. Advances and open problems in federated learning. arXiv preprint
     arXiv:1912.04977, 2019.
                                           142


[78] Sravanthi Kanchi, Shubhrika Sandilya, Deesha Bhosale, Adwait Pitkar, and Mayur
     Gondhalekar. Overview of lte-a technology. In 2013 IEEE global high tech congress on
     electronics, pages 195–200. IEEE, 2013.
[79] Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy optimization with demonstrations.
     In International Conference on Machine Learning, pages 2474–2483, 2018.
[80] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian
     Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for
     federated learning. In International Conference on Machine Learning, pages 5132–
     5143. PMLR, 2020.
[81] Latif U Khan, Walid Saad, Zhu Han, Ekram Hossain, and Choong Seon Hong. Feder-
     ated learning for internet of things: Recent advances, taxonomy, and open challenges.
     IEEE Communications Surveys & Tutorials, 2021.
[82] Rinat Khoussainov, Andreas Heß, and Nicholas Kushmerick. Ensembles of biased
     classifiers. In Proceedings of the 22nd international conference on Machine learning,
     pages 425–432, 2005.
[83] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.
[84] B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab,
     Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous
     driving: A survey. arXiv preprint arXiv:2002.00444, 2020.
[85] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A
     survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
[86] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha
     Suresh, and Dave Bacon. Federated learning: Strategies for improving communication
     efficiency. arXiv preprint arXiv:1610.05492, 2016.
[87] Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and
     Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and
     reward bias in adversarial imitation learning. ICLR, 2019.
[88] Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy
     distribution matching. ICLR, 2020.
[89] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny
     images. 2009.
[90] Andras Kupcsik, David Hsu, and Wee Sun Lee. Learning dynamic robot-to-human
     object handover from human feedback. In Robotics research, pages 161–176. Springer,
     2018.
[91] Alessandro Lazaric. Transfer in reinforcement learning: a framework and a survey. In
     Reinforcement Learning, pages 143–173. Springer, 2012.
                                            143


[92] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.               Deep learning.    Nature,
      521(7553):436–444, May 2015.
[93] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learn-
      ing applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[94] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
[95] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot net-
      work pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
[96] Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model
      distillation. arXiv preprint arXiv:1910.03581, 2019.
[97] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learn-
      ing: Challenges, methods, and future directions. IEEE Signal Processing Magazine,
      37(3):50–60, 2020.
[98] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and
      Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of
      Machine Learning and Systems, 2:429–450, 2020.
[99] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the
      convergence of fedavg on non-iid data. ICLR, 2020.
[100] Xiaoxiao et al. Li. Fedbn: Federated learning on non-iid features via local batch
      normalization. arXiv preprint arXiv:2102.07623, 2021.
[101] Xuejun Liao, Ya Xue, and Lawrence Carin. Logistic regression with an auxiliary data
      source. In Proceedings of the 22nd international conference on Machine learning, pages
      505–512, 2005.
[102] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,
      Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep rein-
      forcement learning. ICLR, 2016.
[103] Tao Lin, Lingjing Kong, Sebastian U Stich, and Martin Jaggi. Ensemble distillation
      for robust model fusion in federated learning. arXiv preprint arXiv:2006.07242, 2020.
[104] Bing Liu. Sentiment analysis and opinion mining. Synthesis lectures on human language
      technologies, 5(1):1–167, 2012.
[105] Fangchen Liu, Zhan Ling, Tongzhou Mu, and Hao Su. State alignment-based imitation
      learning. ICLR, 2019.
[106] Yang Liu, Anbu Huang, Yun Luo, He Huang, Youzhi Liu, Yuanyuan Chen, Lican Feng,
      Tianjian Chen, Han Yu, and Qiang Yang. Fedvision: An online visual object detection
      platform powered by federated learning. In AAAI, 2020.
                                             144


[107] YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from
      observation: Learning to imitate behaviors from raw video via context translation.
      In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages
      1118–1125. IEEE, 2018.
[108] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking
      the value of network pruning. ICLR, 2019.
[109] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes
      in the wild. In Proceedings of International Conference on Computer Vision (ICCV),
      December 2015.
[110] Raphael Gontijo Lopes, Stefano Fenu, and Thad Starner. Data-free knowledge distil-
      lation for deep neural networks. arXiv preprint arXiv:1710.07535, 2017.
[111] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural net-
      works through l_0 regularization. ICLR, 2018.
[112] Xiaojian Ma, Mingxuan Jing, Wenbing Huang, Fuchun Sun, Chao Yang, Bin Fang,
      and Huaping Liu. Reinforcement learning from imperfect demonstrations under soft
      expert guidance.
[113] Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three ap-
      proaches for personalization with applications to federated learning. arXiv preprint
      arXiv:2002.10619, 2020.
[114] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode
      seeking generative adversarial networks for diverse image synthesis. In Proceedings
      of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
      1429–1437, 2019.
[115] Roger McFarlane. A survey of exploration strategies in reinforcement learning. McGill
      University, http://www. cs. mcgill. ca/cs526/roger. pdf, accessed: April, 2018.
[116] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera
      y Arcas. Communication-efficient learning of deep networks from decentralized data.
      In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
[117] Paul Micaelli and Amos J Storkey. Zero-shot knowledge transfer via adversarial belief
      matching. In Advances in Neural Information Processing Systems, pages 9551–9561,
      2019.
[118] Jeff Michels, Ashutosh Saxena, and Andrew Y Ng. High speed obstacle avoidance using
      monocular vision and reinforcement learning. In Proceedings of the 22nd international
      conference on Machine learning, pages 593–600, 2005.
[119] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,
      Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Os-
      trovski, et al. Human-level control through deep reinforcement learning. Nature,
      518(7540):529, 2015.
                                               145


[120] Jaemin Na, Heechul Jung, Hyung Jin Chang, and Wonjun Hwang. Fixbi: Bridging
      domain spaces for unsupervised domain adaptation. In Proceedings of the IEEE/CVF
      Conference on Computer Vision and Pattern Recognition, pages 1094–1103, 2021.
[121] Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic
      estimation of discounted stationary distribution corrections. In Advances in Neural
      Information Processing Systems, pages 2315–2325, 2019.
[122] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans.
      Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074,
      2019.
[123] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter
      Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In
      2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6292–
      6299. IEEE, 2018.
[124] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
      Ng. Reading digits in natural images with unsupervised feature learning. 2011.
[125] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward
      transformations: Theory and application to reward shaping. In ICML, volume 99,
      pages 278–287, 1999.
[126] Khoa Nguyen, Steve Drew, Changcheng Huang, and Jiayu Zhou. Edgepv: collabora-
      tive edge computing framework for task offloading. In ICC 2021-IEEE International
      Conference on Communications, pages 1–6. IEEE, 2021.
[127] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence
      functionals and the likelihood ratio by convex risk minimization. IEEE Transactions
      on Information Theory, 56(11):5847–5861, 2010.
[128] Takayuki Nishio and Ryo Yonetani. Client selection for federated learning with hetero-
      geneous resources in mobile edge. In ICC 2019-2019 IEEE International Conference
      on Communications (ICC), pages 1–7. IEEE, 2019.
[129] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative
      neural samplers using variational divergence minimization. In Advances in neural
      information processing systems, pages 271–279, 2016.
[130] Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning.
      arXiv preprint arXiv:1806.05635, 2018.
[131] Malayandi Palan, Nicholas C Landolfi, Gleb Shevchuk, and Dorsa Sadigh. Learning
      reward functions by integrating human demonstrations and preferences. arXiv preprint
      arXiv:1906.08928, 2019.
[132] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions
      on knowledge and data engineering, 22(10):1345–1359, 2009.
                                             146


[133] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven
      exploration by self-supervised prediction. In Proceedings of the IEEE Conference on
      Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
[134] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Mo-
      ment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF
      International Conference on Computer Vision, pages 1406–1415, 2019.
[135] Xingchao Peng, Zijun Huang, Yizhe Zhu, and Kate Saenko. Federated adversarial
      domain adaptation. arXiv preprint arXiv:1911.02054, 2019.
[136] Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine.
      Variational discriminator bottleneck: Improving imitation learning, inverse rl, and
      gans by constraining information flow. ICLR, 2019.
[137] Huy X Pham, Hung M La, David Feil-Seifer, and Luan V Nguyen. Autonomous uav
      navigation using reinforcement learning. arXiv preprint arXiv:1801.05086, 2018.
[138] Dean A Pomerleau. Efficient training of artificial neural networks for autonomous
      navigation. Neural computation, 3(1):88–97, 1991.
[139] Zhaonan Qu, Kaixiang Lin, Jayant Kalagnanam, Zhaojian Li, Jiayu Zhou, and
      Zhengyuan Zhou. Federated learning’s blessing: Fedavg has linear speedup. arXiv
      preprint arXiv:2007.05690, 2020.
[140] Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mo-
      hammad Rastegari. What’s hidden in a randomly weighted neural network? In Pro-
      ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
      pages 11893–11902, 2020.
[141] Salman Raza, Wei Liu, Manzoor Ahmed, Muhammad Rizwan Anwar, Muham-
      mad Ayzed Mirza, Qibo Sun, and Shangguang Wang. An efficient task offloading
      scheme in vehicular edge computing. Journal of Cloud Computing, 9:1–14, 2020.
[142] Siddharth Reddy, Anca D Dragan, and Sergey Levine. Sqil: Imitation learning via
      reinforcement learning with sparse rewards. arXiv preprint arXiv:1905.11108, 2019.
[143] Amirhossein Reisizadeh, Isidoros Tziotis, Hamed Hassani, Aryan Mokhtari, and
      Ramtin Pedarsani. Straggler-resilient federated learning: Leveraging the interplay be-
      tween statistical accuracy and system heterogeneity. arXiv preprint arXiv:2012.14453,
      2020.
[144] Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albar-
      qouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein,
      et al. The future of digital health with federated learning. NPJ digital medicine,
      3(1):1–7, 2020.
[145] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning
      and structured prediction to no-regret online learning. In Proceedings of the fourteenth
      international conference on artificial intelligence and statistics, pages 627–635, 2011.
                                              147


[146] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins,
      James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia
      Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
[147] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep
      reinforcement learning framework for autonomous driving. Electronic Imaging,
      2017(19):70–76, 2017.
[148] Fumihiro Sasaki, Tetsuya Yohira, and Atsuo Kawaguchi. Sample efficient imitation
      learning for continuous control. ICLR, 2019.
[149] Fumihiro Sasaki, Tetsuya Yohira, and Atsuo Kawaguchi. Sample efficient imitation
      learning for continuous control. ICLR, 2019.
[150] Felix et al. Sattler. Fedaux: Leveraging unlabeled auxiliary data in federated learning.
      arXiv preprint arXiv:2102.02514, 2021.
[151] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive
      sciences, 3(6):233–242, 1999.
[152] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A
      core-set approach. ICLR, 2018.
[153] Hyowoon Seo, Jihong Park, Seungeun Oh, Mehdi Bennis, and Seong-Lyun Kim. Fed-
      erated knowledge distillation. arXiv preprint arXiv:2011.02367, 2020.
[154] Micah J Sheller, Brandon Edwards, G Anthony Reina, Jason Martin, Sarthak Pati,
      Aikaterini Kotrotsou, Mikhail Milchenko, Weilin Xu, Daniel Marcus, Rivka R Colen,
      et al. Federated learning in medicine: facilitating multi-institutional collaborations
      without sharing patient data. Scientific reports, 10(1):1–12, 2020.
[155] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin
      Riedmiller. Deterministic policy gradient algorithms. 2014.
[156] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,
      Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mas-
      tering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
[157] Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning.
      ICLR, 2017.
[158] Benyuan Sun, Hongxing Huo, Yi Yang, and Bo Bai. Partialfed: Cross-domain per-
      sonalized federated learning via partial initialization. Advances in Neural Information
      Processing Systems, 34, 2021.
[159] Lichao Sun and Lingjuan Lyu. Federated model distillation with noise-free differential
      privacy. arXiv preprint arXiv:2009.05537, 2020.
[160] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search:
      Combining reinforcement learning & imitation learning. ICLR, 2018.
                                             148


[161] Wen Sun, Anirudh Vemula, Byron Boots, and J Andrew Bagnell. Provably efficient
      imitation learning from observation alone. arXiv preprint arXiv:1905.10948, 2019.
[162] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bag-
      nell. Deeply aggrevated: Differentiable imitation learning for sequential prediction.
      In Proceedings of the 34th International Conference on Machine Learning-Volume 70,
      pages 3309–3318. JMLR. org, 2017.
[163] Richard S Sutton. Integrated architectures for learning, planning, and reacting based
      on approximating dynamic programming. In Machine learning proceedings 1990, pages
      216–224. Elsevier, 1990.
[164] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction.
      pages 45–47. MIT press, 2018.
[165] Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship
      learning. In Advances in neural information processing systems, pages 1449–1456,
      2008.
[166] Akihito Taya, Takayuki Nishio, Masahiro Morikura, and Koji Yamamoto. Decentral-
      ized and model-free federated learning: Consensus-based distillation in function space.
      arXiv preprint arXiv:2104.00352, 2021.
[167] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning do-
      mains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
[168] Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia
      Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement
      learning. In Advances in Neural Information Processing Systems, pages 4496–4506,
      2017.
[169] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation.
      Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI
      2018), 2018.
[170] Faraz Torabi, Garrett Warnell, and Peter Stone. Adversarial imitation learning from
      state-only demonstrations. In Proceedings of the 18th International Conference on Au-
      tonomous Agents and MultiAgent Systems, pages 2229–2231. International Foundation
      for Autonomous Agents and Multiagent Systems, 2019.
[171] Faraz Torabi, Garrett Warnell, and Peter Stone. Generative adversarial imitation from
      observation. ICML Workshop on Imitation, Intent, and Interaction (I3), 2019.
[172] Faraz Torabi, Garrett Warnell, and Peter Stone. Recent advances in imitation learning
      from observation. IJICAI, 2019.
[173] Ivor W Tsang, James T Kwok, and Pak-Ming Cheung. Core vector machines: Fast svm
      training on very large data sets. Journal of Machine Learning Research, 6(Apr):363–
      392, 2005.
                                             149


[174] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discrimina-
      tive domain adaptation. In Proceedings of the IEEE conference on computer vision
      and pattern recognition, pages 7167–7176, 2017.
[175] Matej Večerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal
      Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Lever-
      aging demonstrations for deep reinforcement learning on robotics problems with sparse
      rewards. arXiv preprint arXiv:1707.08817, 2017.
[176] Hansong Wang, Xi Li, Hong Ji, and Heli Zhang. Federated offloading scheme to min-
      imize latency in mec-enabled vehicular networks. In 2018 IEEE Globecom Workshops
      (GC Wkshps), pages 1–6. IEEE, 2018.
[177] Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh
      Agarwal, Jy-yong Sohn, Kangwook Lee, and Dimitris Papailiopoulos. Attack
      of the tails: Yes, you really can backdoor federated learning. arXiv preprint
      arXiv:2007.05084, 2020.
[178] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman
      Khazaeni. Federated learning with matched averaging. ICLR, 2020.
[179] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocom-
      puting, 312:135–153, 2018.
[180] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distil-
      lation. arXiv preprint arXiv:1811.10959, 2018.
[181] Paul Weng. Markov decision processes with ordinal rewards: Reference point-based
      preferences. In Twenty-First International Conference on Automated Planning and
      Scheduling, 2011.
[182] Eric Wiewiora, Garrison W Cottrell, and Charles Elkan. Principled methods for advis-
      ing reinforcement learning agents. In Proceedings of the 20th International Conference
      on Machine Learning (ICML-03), pages 792–799, 2003.
[183] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey
      of preference-based reinforcement learning methods. The Journal of Machine Learning
      Research, 18(1):4945–4990, 2017.
[184] Stephen J Wright.      Coordinate descent algorithms.     Mathematical Programming,
      151(1):3–34, 2015.
[185] Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, and Masashi
      Sugiyama. Imitation learning from imperfect demonstration. ICML, 2019.
[186] Huang Xiao, Michael Herman, Joerg Wagner, Sebastian Ziesche, Jalal Etesami,
      and Thai Hong Linh. Wasserstein adversarial imitation learning. arXiv preprint
      arXiv:1906.08113, 2019.
                                             150


[187] Chenhao Xu, Youyang Qu, Yong Xiang, and Longxiang Gao. Asynchronous federated
      learning on heterogeneous devices: A survey. arXiv preprint arXiv:2109.04269, 2021.
[188] Yikai Yan, Chaoyue Niu, Yucheng Ding, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shao-
      jie Tang, and Zhihua Wu. Distributed non-convex optimization with sublinear speedup
      under intermittent client availability. arXiv preprint arXiv:2002.07399, 2020.
[189] Chao Yang, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu, Junzhou Huang,
      and Chuang Gan. Imitation learning from observations by minimizing inverse dynamics
      disagreement. In Advances in Neural Information Processing Systems, pages 239–249,
      2019.
[190] Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. Rethinking the smaller-norm-
      less-informative assumption in channel pruning of convolution layers. arXiv preprint
      arXiv:1802.00124, 2018.
[191] Haiyan Yin and Sinno Jialin Pan. Knowledge transfer for deep reinforcement learning
      with hierarchical experience replay. In Thirty-First AAAI Conference on Artificial
      Intelligence, 2017.
[192] Xuwang Yin, Soheil Kolouri, and Gustavo K Rohde. Gat: Generative adversarial
      training for adversarial example detection and robust classification. In International
      Conference on Learning Representations, 2019.
[193] Jaemin Yoo, Minyong Cho, Taebum Kim, and U Kang. Knowledge extraction with
      no observable data. In Advances in Neural Information Processing Systems, pages
      2705–2714, 2019.
[194] Tehrim et al. Yoon. Fedmix: Approximation of mixup under mean augmented feder-
      ated learning. ICLR, 2021.
[195] Jiahui Yu and Thomas S Huang. Universally slimmable networks and improved training
      techniques. In Proceedings of the IEEE/CVF International Conference on Computer
      Vision, pages 1803–1811, 2019.
[196] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable
      neural networks. arXiv preprint arXiv:1812.08928, 2018.
[197] Xiao-Tong Yuan and Ping Li. On convergence of distributed approximate newton meth-
      ods: Globalization, sharper bounds and beyond. arXiv preprint arXiv:1908.02246,
      2019.
[198] Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia
      Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural
      networks. In International Conference on Machine Learning, pages 7252–7261. PMLR,
      2019.
[199] Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. Self-distillation: Towards efficient
      and compact neural networks. IEEE Transactions on Pattern Analysis and Machine
      Intelligence, 2021.
                                              151


[200] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng
      Ma. Be your own teacher: Improve the performance of convolutional neural networks
      via self distillation. In Proceedings of the IEEE/CVF International Conference on
      Computer Vision, pages 3713–3722, 2019.
[201] Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gendice: Generalized offline
      estimation of stationary values. International Conference on Learning Representations
      (ICLR), 2020.
[202] Xiaoqin Zhang, Yunfei Li, Huimin Ma, and Xiong Luo. Pretrain soft q-learning with
      imperfect demonstrations. arXiv preprint arXiv:1905.03501, 2019.
[203] Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and
      Geoffrey J Gordon. Adversarial multiple source domain adaptation. Advances in neural
      information processing systems, 31:8559–8570, 2018.
[204] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-
      image translation using cycle-consistent adversarial networks. In Proceedings of the
      IEEE international conference on computer vision, pages 2223–2232, 2017.
[205] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for
      heterogeneous federated learning. International Conference on Machine Learning,
      2021.
[206] Zhuangdi Zhu, Kaixiang Lin, and Jiayu Zhou. Transfer learning in deep reinforcement
      learning: A survey. arXiv preprint arXiv:2009.07888, 2020.
[207] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu,
      Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of
      the IEEE, 109(1):43–76, 2020.
[208] Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Modeling interaction via the
      principle of maximum causal entropy. 2010.
[209] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum
      entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago,
      IL, USA, 2008.
                                             152