EFFICIENT AND SECURE MESSAGE PASSING FOR MACHINE LEARNING
                                     By
                                 Xiaorui Liu
                            A DISSERTATION
                                Submitted to
                        Michigan State University
                in partial fulfillment of the requirements
                             for the degree of
               Computer Science – Doctor of Philosophy
                                    2022


                                               ABSTRACT
        EFFICIENT AND SECURE MESSAGE PASSING FOR MACHINE LEARNING
                                                    By
                                                Xiaorui Liu
Machine learning (ML) techniques have brought revolutionary impact to human society, and they
will continue to act as technological innovators in the future. To broaden its impact, it is urgent to
solve the emerging and critical challenges in machine learning, such as efficiency and security issues.
On the one hand, ML models have become increasingly powerful due to big data and models, but it
also brings tremendous challenges in designing efficient optimization algorithms to train the big ML
models from big data. The most effective way for large-scale ML is to parallelize the computation
tasks on distributed systems composed of many computational devices. However, in practice, the
scalability and efficiency of the systems are greatly limited by information synchronization since the
message passing between the devices dominates the total running time. In other words, the major
bottleneck lies in the high communication cost between devices, especially when the scale of the
system and the models becomes larger while the communication bandwidth is relatively limited.
This communication bottleneck often limits the practical speedup of distributed ML systems.
    On the other hand, recent research has generally revealed that many ML models suffer from
security vulnerabilities. In particular, deep learning models can be easily deceived by the unnoticeable
perturbations in data. Meanwhile, graph is a kind of prevalent data structure for many real-world data
that encodes pairwise relations between entities such as social networks, transportation networks,
and chemical molecules. Graph neural networks (GNNs) generalize and extend the representation
learning power of traditional deep neural networks (DNNs) from regular grids, such as image,
video, and text, to irregular graph-structured data through message passing frameworks. Therefore,
many important applications on these data can be treated as computational tasks on graphs, such as
recommender systems, social network analysis, traffic prediction, etc. Unfortunately, the vulnerability
of deep learning models also translates to GNNs, which raises significant concerns about their


applications, especially in safety-critical areas. Therefore, it is critical to design intrinsically secure
ML models for graph-structured data.
    The primary objective of this dissertation is to figure out the solutions to solve these challenges
via innovative research and principled methods. In particular, we propose multiple distributed
optimization algorithms with efficient message passing to mitigate the communication bottleneck
and speed up ML model training in distributed ML systems. We also propose multiple secure
message passing schemes as the building blocks of graph neural networks aiming to significantly
improve the security and robustness of ML models.


To my parents and family for their love and support.
                       iv


                                    ACKNOWLEDGEMENTS
This dissertation is impossible without the help and support from my advisor Dr. Jiliang Tang.
I would like to express my deepest appreciation to him for his invaluable advice, inspiration,
encouragement, and support during my Ph.D. research. I have learned many skills from him that
benefit my life: how to identify important and challenging research problems, how to write papers
and give presentations, how to collaborate, how to mentor students, how to write proposals, how to
volunteer and serve the community, how to establish my career and build a big vision. Dr. Tang is
not only an advisor in research but also a sincere friend who guided me with his experience in many
aspects of my life. Dr. Tang, I cannot thank you enough.
    I would like to thank my committee members, Dr. Ming Yan, Dr. Anil K. Jain, and Dr. Charu
Aggarwal, for their helpful suggestions and insightful comments. I enrolled in the Optimization
course from Dr. Ming Yan in my first year at MSU, which prepared me with a solid technical
background and benefited my Ph.D. research a lot. I always consider Dr. Ming Yan as my secondary
advisor because many of my research ideas were initialized with his insightful discussions and
a large portion of this dissertation was done with his collaboration. Dr. Anil K. Jain provided
numerous suggestions for my research and career, and his pursuit of broader impact greatly reshapes
my research vision. I was fortunate to collaborate with Dr. Charu Aggarwal on multiple research
papers, and I am impressed by his knowledge and insights. In addition, his dedication to book
writing and online education is very inspiring to me.
    I was fortunate to work as an intern at JD.com, Kwai AI Lab, and TAL AI Lab with amazing
colleagues and mentors: Dr. Dawei Yin, Dr. Hongshen Chen, Dr. Ziheng Jiang, and Dr. Zhaochun
Ren from JD.com; Dr. Ji Liu and Dr. Xiangru Lian from Kwai AI Lab; and Dr. Zitao Liu from TAL
AI Lab. I enjoyed the productive and wonderful summers with you.
    During my Ph.D. study, many friends and colleagues provided me with consistent support,
encouragement, and happiness. I am thankful to my friends and colleagues from the Data Science
and Engineering Lab: Dr. Tyler Derr, Dr. Zhiwei Wang, Dr. Wenqi Fan, Dr. Xiangyu Zhao, Dr.
                                                   v


Hamid Karimi, Dr. Yao Ma, Chenxing Wang, Daniel K.O-Dankwa, Hansheng Zhao, Haochen Liu,
Han Xu, Jamell Dacon, Wentao Wang, Wei Jin, Yaxin Li, Yiqi Wang, Juanhui Li, Harry Shomer, Jie
Ren, Jiayuan Ding, Haoyu Han, Hongzhi Wen, Yuxuan Wan, Pengfei He, Hua Liu, Dr. Xin Wang,
Dr. Jiangtao Huang, Dr. Xiaoyang Wang, Dr. Meznah Almutairy, Namratha Shah, Norah Alfadhli.
In particular, thanks to my DSE collaborators: Dr. Tyler Derr, Dr. Zhiwei Wang, Dr. Wenqi Fan, Dr.
Hamid Karimi, Dr. Xiangyu Zhao, Dr. Yao Ma, Haochen Liu, Han Xu, Wentao Wang, Wei Jin,
Yaxin Li, Yiqi Wang, Jiayuan Ding, Haoyu Han, Hua Liu, and Yuxuan Wan. I would like to extend
my sincere thanks to all my collaborators from outside the Data Science and Engineering Lab: Dr.
Ming Yan, Dr. Yao Li, Dr. Rongrong Wang, Dr. Charu Aggarwal, Dr. Anil K. Jain, Dr. Dawei Yin,
Dr. Hongshen Chen, Dr. Qing Li, Dr. Suhang Wang, Dr. Xianfeng Tang, Dr. Hui Liu, Dr. Zitao
Liu, Dr. Kuan Yuan, and Dr. Neil Shah.
    Finally, I would like to thank my family for their love and support. I also dedicate this dissertation
to Xinyi Lu for supporting me all the way!
                                                   vi


                                  TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   xi
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
CHAPTER 1 INTRODUCTION            . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.1 Research Challenges . .    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3
   1.2 Contributions . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4
   1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5
CHAPTER 2    A DOUBLE RESIDUAL COMPRESSION                    ALGORITHM FOR           DIS-
             TRIBUTED LEARNING . . . . . . . . . .            . . . . . . . . . . . . . . . . .  6
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
   2.2 Related Work . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . .  8
   2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . 10
       2.3.1 Proposed Algorithm . . . . . . . . . . . .       . . . . . . . . . . . . . . . . . 11
       2.3.2 Discussion . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . 13
   2.4 Convergence Analysis . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . 15
       2.4.1 Strongly Convex Case . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . 15
       2.4.2 Nonconvex Case . . . . . . . . . . . . . .       . . . . . . . . . . . . . . . . . 17
   2.5 Experiment . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . 18
       2.5.1 Strongly Convex . . . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . 19
       2.5.2 Nonconvex . . . . . . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . 20
       2.5.3 Communication Efficiency . . . . . . . . .       . . . . . . . . . . . . . . . . . 22
   2.6 Conclusion . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . 23
CHAPTER 3    LINEAR CONVERGENT DECENTRALIZED OPTIMIZATION WITH
             COMPRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 24
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
   3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 26
   3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 27
   3.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
   3.5 Numerical Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 36
   3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 41
CHAPTER 4 GRAPH NEURAL NETWORKS WITH ADAPTIVE RESIDUAL .                              . . . . . 42
   4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
   4.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . 44
       4.2.1 Preliminary Study . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . 45
       4.2.2 Understandings . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . 46
   4.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . 48
                                              vii


       4.3.1 Design Motivation . . . . . . . . . . . . . . . . .      . . . . . . . . . . . . .  48
       4.3.2 Adaptive Message Passing . . . . . . . . . . . . .       . . . . . . . . . . . . .  49
       4.3.3 Interpretation of AMP . . . . . . . . . . . . . . .      . . . . . . . . . . . . .  52
       4.3.4 Model Architecture . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . .  53
   4.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . .  54
       4.4.1 Experimental Settings . . . . . . . . . . . . . . . .    . . . . . . . . . . . . .  54
       4.4.2 Performance Comparison with Noisy Features . . .         . . . . . . . . . . . . .  56
       4.4.3 Performance Comparison with Adversarial Features         . . . . . . . . . . . . .  57
       4.4.4 Adaptive Residual for Abnormal & Normal Nodes .          . . . . . . . . . . . . .  58
       4.4.5 Performance in the Clean Setting . . . . . . . . . .     . . . . . . . . . . . . .  59
   4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . .  59
   4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . .  60
CHAPTER 5 ELASTIC GRAPH NEURAL NETWORKS                       . . . . . . . . . . . . . . . . .  62
   5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  62
   5.2 Preliminary . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . .  64
       5.2.1 GNNs as Graph Signal Denoising . . . . .         . . . . . . . . . . . . . . . . .  65
       5.2.2 Graph Trend Filtering . . . . . . . . . . . .    . . . . . . . . . . . . . . . . .  66
   5.3 Algorithm . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . .  67
       5.3.1 Elastic Graph Signal Estimator . . . . . . .     . . . . . . . . . . . . . . . . .  67
       5.3.2 Elastic Message Passing . . . . . . . . . .      . . . . . . . . . . . . . . . . .  69
       5.3.3 Elastic GNNs . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . .  75
   5.4 Experiment . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . .  76
       5.4.1 Experimental Settings . . . . . . . . . . . .    . . . . . . . . . . . . . . . . .  76
       5.4.2 Performance on Benchmark Datasets . . . .        . . . . . . . . . . . . . . . . .  77
       5.4.3 Robustness Under Adversarial Attack . . .        . . . . . . . . . . . . . . . . .  78
       5.4.4 Ablation Study . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . .  80
   5.5 Related Work . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . .  81
   5.6 Conclusion . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . .  83
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
   6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
   6.2 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 87
   APPENDIX A         A DOUBLE RESIDUAL COMPRESSION ALGORITHM FOR
                      DISTRIBUTED LEARNING . . . . . . . . . . . . . . . . . . . .            . 88
   APPENDIX B         LINEAR CONVERGENT DECENTRALIZED OPTIMIZATION
                     WITH COMPRESSION . . . . . . . . . . . . . . . . . . . . . . .           . 100
   APPENDIX C         GRAPH NEURAL NETWORKS WITH ADAPTIVE RESIDUAL                            . 121
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
                                              viii


                                           LIST OF TABLES
Table 2.1: A comparison between related algorithms. DORE is able to converges linearly to
           the O (𝜎) neighborhood of optimal point like full-precision SGD and DIANA in
           the strongly convex case while achieving much better communication efficiency.
           DORE also admits linear speedup in the nonconvex case like DoubleSqueeze
           but DORE doesn’t require the assumptions of bounded compression error or
           bounded gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Table 4.1: Data statistics on benchmark datasets. . . . . . . . . . . . . . . . . . . . . . . . 55
Table 4.2: Dataset statistics for adversarially attacked datasets. . . . . . . . . . . . . . . . . 55
Table 4.3: Average adaptive score (𝛽) and residual weight (1 − 𝛽) in the noisy feature
           scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Table 4.4: Average adaptive score (𝛽) and residual weight (1 − 𝛽) in the adversarial feature
           scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Table 4.5: Comparison between AirGNN, APPNP, and Robust GCN in the clean setting.                 . 59
Table 5.1: Statistics of benchmark datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Table 5.2: Dataset Statistics for adversarially attacked graph. . . . . . . . . . . . . . . . . . 76
Table 5.3: Classification accuracy (%) on benchmark datasets with 10 times random data splits. 78
Table 5.4: Classification accuracy (%) under different perturbation rates of adversarial
           graph attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Table 5.5: Ratio between average node differences along wrong and correct edges. . . . . . 81
Table 5.6: Sparsity ratio (i.e., ∥( Δ̃F)𝑖 ∥ 2 < 0.1) in node differences Δ̃F. . . . . . . . . . . . 81
Table B.1: Parameter settings for the linear regression problem. . . . . . . . . . . . . . . . 104
Table B.2: Parameter settings for the logistic regression problem (full-batch gradient). . . . 104
Table B.3: Parameter settings for the logistic regression problem (mini-batch gradient). . . . 105
Table B.4: Parameter settings for the deep neural network. (* means divergence for all
           options we try). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
                                                     ix


Table C.1: Comparison between APPNP and AirGNN on abnormal (noisy) nodes (Cora). . 126
Table C.2: Comparison between APPNP and AirGNN on normal nodes (Cora).    . . . . . . 126
Table C.3: Comparison between APPNP and AirGNN on all nodes (Cora). . . . . . . . . . 127
                                         x


                                       LIST OF FIGURES
Figure 1.1: Distributed ML systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     2
Figure 1.2: Graph representation learning for node-focus tasks. . . . . . . . . . . . . . . .       3
Figure 2.1: An Illustration of DORE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 2.2: Linear regression on synthetic data. When the learning rate is 0.05, Dou-
            bleSqueeze diverges. In both cases, DORE, SGD, and DIANA converge
            linearly to the optimal point, while QSGD, MEM-SGD, DoubleSqueeze, and
            DoubleSqueeze (topk) only converge to the neighborhood even when full
            gradient is available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 2.3: The norm of variable being compressed in the linear regression experiment. . . 20
Figure 2.4: LeNet trained on MNIST. DORE converges similarly as most baselines. It
            outperforms DoubleSqueeze using the same compression method while has
            similar performance as DoubleSqueeze (topk). . . . . . . . . . . . . . . . . . . 21
Figure 2.5: Resnet18 trained on CIFAR10. DORE achieves similar convergence and
            accuracy as most baselines. DoubeSuqeeze converges slower and suffers from
            the higher loss but it works well with topk compression. . . . . . . . . . . . . 21
Figure 2.6: Per iteration time cost on Resnet18 for SGD, QSGD, and DORE. It is tested in
            a shared cluster environment connected by Gigabit Ethernet interface. DORE
            speeds up the training process significantly by mitigating the communication
            bottleneck. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 3.1: Linear regression problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 3.2: Logistic regression problem in the heterogeneous case (full-batch gradient). . . 39
Figure 3.3: Logistic regression in the heterogeneous case (mini-batch gradient). . . . . . . . 39
Figure 3.4: Stochastic optimization on deep neural network (∗ means divergence). . . . . . 39
Figure 3.5: Parameter analysis on linear regression problem. . . . . . . . . . . . . . . . . . 40
Figure 4.1: Node classification accuracy on abnormal nodes (Cora). . . . . . . . . . . . . . 45
Figure 4.2: Node classification accuracy on normal nodes (Cora). . . . . . . . . . . . . . . 46
                                                  xi


Figure 4.3: Diagram of Adaptive Message Passing. . . . . . . . . . . . . . . . . . . . . . . 49
Figure 4.4: Adaptive Message Passing (AMP). . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 4.5: Node classification accuracy on abnormal (noisy) nodes. . . . . . . . . . . . . . 56
Figure 4.6: Node classification accuracy on normal nodes. . . . . . . . . . . . . . . . . . . 57
Figure 4.7: Node classification accuracy on adversarial nodes. . . . . . . . . . . . . . . . . 58
Figure 5.1: Elastic Message Passing (EMP). F0 = Xin and Z0 = 0𝑚×𝑑 .           . . . . . . . . . . . 71
Figure 5.2: Classification accuracy under different propagation steps. . . . . . . . . . . . . 82
Figure 5.3: Convergence of the objective value for the problem in Eq. (5.8) during message
            passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Figure A.1: Linear regression on synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure A.2: LeNet trained on MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure A.3: Resnet18 trained on CIFAR10 dataset. . . . . . . . . . . . . . . . . . . . . . . 89
Figure A.4: Resnet18 trained on CIFAR10 dataset with 1Gbps network bandwidth. . . . . . 89
Figure A.5: Resnet18 trained on CIFAR10 dataset with 200Mbps network bandwidth. . . . . 89
Figure A.6: Training under different compression block sizes. . . . . . . . . . . . . . . . . 90
Figure A.7: Training under different 𝛼. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Figure A.8: Training under different 𝛽. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Figure A.9: Training under different 𝜂. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
                                         ∥x−𝑄(x) ∥ 2
Figure B.1: Relative compression error      ∥x∥ 2    for p-norm b-bit quantization. . . . . . . . 102
Figure B.2: Comparison of compression error ∥x−𝑄(x)   ∥x∥ 2
                                                            ∥2
                                                               between different compression
            methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Figure B.3: Logistic regression in the homogeneous case (full-batch gradient). . . . . . . . 103
Figure B.4: Logistic regression in the homogeneous case (mini-batch gradient). . . . . . . . 104
                                                  xii


Figure C.1: Node classification accuracy on abnormal nodes (CiteSeer). . . . . . . . . . . . 121
Figure C.2: Node classification accuracy on normal nodes (CiteSeer). . . . . . . . . . . . . 121
Figure C.3: Node classification accuracy on abnormal nodes (PubMed). . . . . . . . . . . . 122
Figure C.4: Node classification accuracy on normal nodes (PubMed). . . . . . . . . . . . . 122
Figure C.5: Node classification accuracy in noisy features scenario (Coauthor CS). . . . . . 123
Figure C.6: Node classification accuracy in noisy features scenario (Coauthor Physics). . . . 123
Figure C.7: Node classification accuracy in noisy features scenario (Amazon Computers). . 124
Figure C.8: Node classification accuracy in noisy features scenario (Amazon Photo). . . . . 124
Figure C.9: Node classification accuracy in noisy features scenario (ogbn-arxiv). . . . . . . 124
Figure C.10: Node classification accuracy in noisy features scenario with adjustment
             (Coauthor CS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Figure C.11: Node classification accuracy in noisy features scenario (Cora). . . . . . . . . . 127
Figure C.12: Node classification accuracy in noisy features scenario (CiteSeer). . . . . . . . 127
Figure C.13: Node classification accuracy in noisy features scenario (PubMed). . . . . . . . 128
                                                xiii


                             LIST OF ALGORITHMS
Algorithm 1: DORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Algorithm 2: DORE with 𝑅(x) = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Algorithm 3: LEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Algorithm 4: LEAD in Agent’s Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 32
                                          xiv


                                            CHAPTER 1
                                         INTRODUCTION
Machine learning (ML) techniques have brought revolutionary impact to human society, and they
will continue to act as technological innovators in the future. In recent years, critical challenges
in machine learning such as efficiency and security issues broadly emerge. These issues greatly
limit the applications of ML techniques in many scientific and application domains. To broaden the
impact of ML techniques, it is urgent to solve these emerging and critical research challenges.
    On the one hand, ML models have become increasingly powerful due to big data and models,
but it also brings tremendous challenges in designing efficient optimization algorithms to train
the big ML models from big data. The most effective way for large-scale ML is to parallelize
the computation tasks on distributed systems composed of many computational devices. There
are two mainstream trends of distributed and parallel ML systems: (1) large-scale ML models
trained by powerful distributed computing systems in data centers; and (2) on-device distributed
training by resourced-limited edge devices (e.g., smartphones, AR/VR headsets, drones, billions of
Internet of Things) as well as the massive amount of data they generate on a daily basis. In both
cases, the scalability and efficiency of the systems are greatly limited since the slow information
synchronization between the devices dominates the total running time. In other words, the major
bottleneck lies in the high communication cost between devices, especially when the scale of the
system and the models becomes larger while the communication bandwidth is relatively limited.
For instance, in centralized distributed ML systems as shown in Figure 1.1a, every computing
node needs to frequently synchronize information with other nodes through the central server
by passing the message (e.g., model or gradient information) [4, 92, 114, 68]. But this message
passing process becomes dominating when the communication network bandwidth is limited and
the number of computing nodes is massive. In decentralized distributed ML systems as shown
in Figure 1.1b, every computing node only needs to synchronize information with the one-hope
neighbors by passing the message (e.g., the model information). Although it is more scalable in
                                                  1


terms of the number of computing nodes, the communication bottleneck still exists under limited
network bandwidth [101, 51, 50, 75]. The communication bottleneck often limits the theoretical
speedup of distributed ML systems. Therefore, how to design distributed learning algorithms and
systems with efficient message passing becomes a promising and key research direction for solving
the efficiency challenge in ML.
                    (a) Centralized Learning                  (b) Decentralized Learning
                                   Figure 1.1: Distributed ML systems.
    On the other hand, recent research has generally revealed that many ML models suffer from
security vulnerabilities. In particular, deep learning models can be easily deceived by the unnoticeable
perturbations in data [30, 99, 21]. Meanwhile, graph is a kind of prevalent data structure for many real-
world data that encodes pairwise relations between entities such as social networks, transportation
networks, and chemical molecules. Graph neural networks (GNNs) generalize and extend the
representation learning power of traditional deep neural networks (DNNs) from regular grids, such
as image, video, and text, to irregular graph-structured data as shown in Figure 1.2. Therefore, many
important applications on these data can be treated as computational tasks on graphs [74, 32]. For
instance, product recommendation in e-commerce and friend recommendation in social network
analysis can be formulated as link prediction tasks on graphs; Traffic prediction in transportation
systems can be formulated as node classification or regression on graphs. The key building
block for such generalization is the message passing framework that propagates features from
neighboring nodes in the graph. The message passing layer offers the node permutation invariance
                                                     2


and the support for arbitrary neighboring sizes in graphs. Despite the promising performance
of GNNs in clean data settings, unfortunately, the vulnerability of deep learning models also
translates to GNNs [134, 135, 37, 41] when the data contains adversarial perturbations. For
instance, the performance greatly degrades when the node features or graph structure are modified
by adversaries [41]. These raise significant concerns about the applications of GNNs, especially
in safety-critical areas. Therefore, it is critical to design intrinsically secure ML models for
graph-structured data.
                                     Representation                              Node-focus
                                        Learning                                   Tasks
                                                      Node representations
                   Figure 1.2: Graph representation learning for node-focus tasks.
1.1    Research Challenges
From the above research background, we can summarize the research challenges as follows:
    • How to design scalable distributed ML systems and algorithms with efficient message passing
      between computing devices such that the communication bottleneck can be largely mitigated?
    • How to maintain the convergence behaviors of the optimization algorithm when communication
      efficiency is improved both theoretically and empirically?
    • How to design intrinsically secure ML models that are more robust to potential threats such as
      feature or graph structure attacks by adversaries?
    • How to bypass the tradeoff between the performance under clean and adversarial settings?
      In other words, can we maintain good performance when the data are clean while providing
      strong security when the data are adversarially perturbed?
                                                    3


1.2    Contributions
The primary objective of this dissertation is to figure out the solutions to solve these challenges
via innovative research and principled methods. In particular, we propose multiple distributed
optimization algorithms with efficient message passing to mitigate the communication bottleneck
and speed up ML model training in distributed ML systems. We also propose multiple secure
message passing schemes as the building blocks of graph neural networks aiming to significantly
improve the security and robustness of ML models. The contributions of this dissertation are
summarized as:
    • To fundamentally improve the efficiency of distributed ML systems, I proposed a series of
      innovative algorithms to break through the communication bottleneck. In particular, when the
      communication network is a start network as shown in Figure 1.1a, I proposed DORE [68], a
      double residual compression algorithm, to compress the bi-directional communication between
      client devices and the server such that over 95% of the communication bits can be reduced.
      This is the first algorithm that reduces that much communication cost while maintaining the
      superior convergence complexities (e.g., linear convergence) as the uncompressed counterpart,
      both theoretically and numerically.
    • When the communication network is of any general topology (as long as it is connected)
      as shown in Figure 1.1b, I proposed LEAD [69], the first linear convergent decentralized
      optimization algorithm with communication compression, which only requires point-to-point
      compressed communication between neighboring devices over communication networks.
      Theoretically, we prove that under certain compression ratios, the convergence complexity
      of the proposed algorithm does not depend on the compression operator. In other words, it
      achieves better communication efficiency for free.
    • To design intrinsically secure ML models against feature attacks, I investigate to denoise the
      hidden features in neural network layers caused by the adversarial perturbation using the graph
      structural information. This is achieved by the proposed AirGNN [66] in which the adaptive
                                                   4


      message passing denoises perturbed features by feature aggregations and maintains feature
      separability by adaptive residuals. The proposed algorithm has a clear design principle and
      interpretation as well strong as performance both in the clean and adversarial data settings.
    • To design intrinsically secure ML models against graph structure attacks, I investigate a new
      prior knowledge of smoothness in the design of graph neural networks. In particular, we
      derive an elastic message passing scheme to model the piecewise constant signal in graph
      data. We demonstrate its stronger resilience to adversarial structure attacks and superior
      performance when the data is clean through a comprehensive empirical study on the proposed
      model ElasticGNN [67].
1.3    Organization
The remainder of this dissertation is organized as follows. In Chapter 2, we introduce DORE, a
centralized distributed optimization algorithm with communication compression. In Chapter 3,
we introduce LEAD, the first linear convergent decentralized distributed optimization algorithm
with communication compression. In these two chapters, we demonstrate how to significantly
improve the efficiency and scalability of distributed ML systems by the co-design of efficient
message passing and optimization algorithms. In Chapter 4, we investigate the possibility to utilize
the graph structural information in defending against abnormal features with noise or adversarial
perturbations. We derive a novel adaptive message passing scheme from a principled graph signal
denoising perspective. In Chapter 5, we study a new smoothness prior knowledge, i.e., piecewise
constant signal, for graph representation learning. We derive the elastic message passing to model
the adaptive local smoothness in graph data. In these two chapters, we demonstrate how these
secure message passing algorithms can be used as fundamental building blocks in the design of
graph neural networks to defend against feature and graph structure attacks through the examples
of AirGNN and ElasticGNN, respectively. We conclude the dissertation and discuss the broader
impact and promising research directions in Chapter 6.
                                                 5


                                             CHAPTER 2
A DOUBLE RESIDUAL COMPRESSION ALGORITHM FOR DISTRIBUTED LEARNING
Large-scale machine learning models are often trained by parallel stochastic gradient descent
algorithms. However, the message passing cost of gradient aggregation and model synchronization
between the master and worker nodes becomes the major obstacle for efficient learning as the
number of workers and the dimension of the model increase. In this chapter, we propose DORE, a
DOuble REsidual compression stochastic gradient descent algorithm, to reduce over 95% of the
overall communication in message passing such that the obstacle can be immensely mitigated. Our
theoretical analyses demonstrate that the proposed strategy has superior convergence properties for
both strongly convex and nonconvex objective functions. The experimental results validate that
DORE achieves the best communication efficiency while maintaining similar model accuracy and
convergence speed in comparison with start-of-the-art baselines.
2.1     Introduction
Stochastic gradient algorithms [8] are efficient at minimizing the objective function 𝑓 : R𝑑 → R
which is usually defined as 𝑓 (x) := E𝜉∼D [ℓ(x, 𝜉)], where ℓ(x, 𝜉) is the objective function defined
on data sample 𝜉 and model parameter x. A basic stochastic gradient descent (SGD) repeats the
gradient “descent” step x 𝑘+1 = x 𝑘 − 𝛾g(x 𝑘 ) where x 𝑘 is the current iteration and 𝛾 is the step
size. The stochastic gradient g(x 𝑘 ) is computed based on an i.i.d. sampled mini-batch from the
distribution of the training data D and serves as the estimator of the full gradient ∇ 𝑓 (x 𝑘 ). In the
context of large-scale machine learning, the number of data samples and the model size are usually
very large. Distributed learning utilizes a large number of computers/cores to perform the stochastic
algorithms aiming at reducing the training time. It has attracted extensive attention due to the
demand for highly efficient model training [1, 17, 54, 122].
    In this work, we focus on the data-parallel SGD [22, 61, 133], which provides a scalable solution
to speed up the training process by distributing the whole data to multiple computing nodes. The
                                                    6


objective can be written as:
                                                       𝑛
                                                    1  Í
                         minimize 𝑓 (x) + 𝑅(x) =    𝑛      E𝜉∼D𝑖 [ℓ(x, 𝜉)] +𝑅(x),
                           x∈R𝑑                       𝑖=1 |       {z        }
                                                                 B 𝑓𝑖 (x)
where each 𝑓𝑖 (x) is a local objective function of the worker node 𝑖 defined based on the allocated
data under distribution D𝑖 and 𝑅 : R𝑑 → R is usually a closed convex regularizer.
    In the well-known parameter server framework [54, 133], during each iteration, each worker
node evaluates its own stochastic gradient { e   ∇ 𝑓𝑖 (x 𝑘 )}𝑖=1
                                                             𝑛 and send it to the master node, which
                                                             Í𝑛 e
collects all gradients and calculates their average (1/𝑛) 𝑖=1       ∇ 𝑓𝑖 (x 𝑘 ). Then the master node further
takes the gradient descent step with the averaged gradient and broadcasts the new model parameter
x 𝑘+1 to all worker nodes. It makes use of the computational resources from all nodes. In reality, the
network bandwidth is often limited. Thus, the communication cost for the gradient transmission and
model synchronization becomes the dominating bottlenecks as the number of nodes and the model
size increase, which hinders the scalability and efficiency of SGD.
    One common way to reduce the communication cost is to compress the gradient information
by either gradient sparsification or quantization [4, 92, 97, 98, 110, 112, 114, 116] such that many
fewer bits of information are needed to be transmitted. However, little attention has been paid on
how to reduce the communication cost for model synchronization and the corresponding theoretical
guarantees. Obviously, the model shares the same size as the gradient, so does the communication
cost. Thus, merely compressing the gradient can reduce at most 50% of the communication cost,
which suggests the importance of model compression. Notably, the compression of model parameters
is much more challenging than gradient compression. One key obstacle is that its compression error
cannot be well controlled by the step size 𝛾 and thus it cannot diminish like that in the gradient
compression [101]. In this work, we aim to bridge this gap by investigating algorithms to compress
the full communication in the optimization process and understanding their theoretical properties.
Our contributions can be summarized as:
     • We proposed DORE, which can compress both the gradient and the model information such
       that more than 95% of the communication cost can be reduced.
                                                    7


     • We provided theoretical analyses to guarantee the convergence of DORE under strongly
       convex and nonconvex assumptions without the bounded gradient assumption.
     • Our experiments demonstrate the superior efficiency of DORE comparing with the state-of-art
       baselines without degrading the convergence speed and the model accuracy.
2.2     Related Work
Recently, many works try to reduce the communication cost to speed up the distributed learning,
especially for deep learning applications, where the size of the model is typically very large (so is
the size of the gradient) while the network bandwidth is relatively limited. Below we briefly review
relevant papers.
    Gradient quantization and sparsification. Recent works [4, 92, 114, 77, 7] have shown that
the information of the gradient can be quantized into a lower-precision vector such that fewer bits
are needed in communication without loss of accuracy. [92] proposed 1Bit SGD that keeps the sign
of each element in the gradient only. It empirically works well, and [7] provided theoretical analysis
systematically. QSGD [4] utilizes an unbiased multi-level random quantization to compress the
gradient while Terngrad [114] quantizes the gradient into ternary numbers {0, ±1}. In DIANA [77],
the gradient difference is compressed and communicated contributing to the estimator of the gradient
in the master node.
    Another effective strategy to reduce the communication cost is sparsification. [112] proposed a
convex optimization formulation to minimize the coding length of stochastic gradients. A more
aggressive sparsification method is to keep the elements with relatively larger magnitude in gradients,
such as top-k sparsification [97, 98, 3].
    Model synchronization. The typical way for model synchronization is to broadcast model
parameters to all worker nodes. Some works [110, 42] have been proposed to reduce model size by
enforcing sparsity, but it cannot be applied to general optimization problems. Some alternatives
including QSGD [4] and ECQ-SGD [116] choose to broadcast all quantized gradients to all other
workers such that every worker can perform model update independently. However, all-to-all
                                                   8


communication is not efficient since the number of transmitted bits increases dramatically in
large-scale networks. DoubleSqueeze [104] applies compression on the averaged gradient with error
compensation to speed up model synchronization.
    Error compensation. [92] applied error compensation on 1Bit-SGD and achieved negligible
loss of accuracy empirically. Recently, error compensation was further studied [116, 97, 45] to
mitigate the error caused by compression. The general idea is to add the compressed error to the
next compression step:
                                  ĝ = 𝑄(g + e), e = (g + e) − ĝ.
However, to the best of our knowledge, most of the algorithms with error compensation [116, 97,
45, 104] need to assume bounded gradient, i.e., E∥g∥ 2 ≤ 𝐵, and the convergence rate depends on
this bound.
    Contributions of DORE. The most related papers to DORE are DIANA [77] and Doub-
leSqueeze [104]. Similarly, DIANA compresses gradient difference on the worker side and achieves
good convergence rate. However, it doesn’t consider the compression in model synchronization, so
at most 50% of the communication cost can be saved. DoubleSqueeze applies compression with
error compensation on both worker and server sides, but it only considers non-convex objective
functions. Moreover, its analysis relies on a bounded gradient assumption, i.e., E∥g∥ 2 ≤ 𝐵, and the
convergence error has a dependency on the gradient bound like most existed error compensation
works.
    In general, the uniform bound on the norm of the stochastic gradient is a strong assumption
which might not hold in some cases. For example, it is violated in the strongly convex case [82, 31].
In this work, we design DORE, the first algorithm which utilizes gradient and model compression
with error compensation without assuming bounded gradients. Unlike existing error compensation
works, we provide a linear convergence rate to the O (𝜎) neighborhood of the optimal solution for
strongly convex functions and a sublinear rate to the stationary point for nonconvex functions with
linear speedup. In Table 2.1, we compare the asymptotic convergence rates of different quantized
SGDs with DORE.
                                                  9


2.3      Algorithm
In this section, we introduce the proposed DOuble REsidual compression SGD (DORE) algorithm.
Before that, we introduce a common assumption for the compression operator.
     In this work, we adopt an assumption from [4, 114, 77] that the compression variance is linearly
proportional to the magnitude.
Assumption 1. The stochastic compression operator 𝑄 : R𝑑 → R𝑑 is unbiased, i.e., E𝑄(x) = x
and satisfies
                                              E∥𝑄(x) − x∥ 2 ≤ 𝐶 ∥x∥ 2 ,                                      (2.1)
for a nonnegative constant 𝐶 that is independent of x. We use x̂ to denote the compressed x, i.e.,
x̂ ∼ 𝑄(x).
     Many feasible compression operators can be applied to our algorithm since our theoretical
analyses are built on this common assumption. Some examples of feasible stochastic compression
operators include:
 • No Compression: 𝐶 = 0 when there is no compression.
 • Stochastic Quantization: A real number 𝑥 ∈ [𝑎, 𝑏], (𝑎 < 𝑏) is set to be 𝑎 with probability                  𝑏−𝑥
                                                                                                               𝑏−𝑎
                                  𝑥−𝑎
    and 𝑏 with probability        𝑏−𝑎 ,  where 𝑎 and 𝑏 are predefined quantization levels [4]. It satisfies
    Assumption 1 when 𝑎𝑏 > 0 and 𝑎 < 𝑏.
 • Stochastic Sparsification: A real number 𝑥 is set to be 0 with probability 1 − 𝑝 and                    𝑥
                                                                                                           𝑝 with
    probability 𝑝 [114]. It satisfies Assumption 1 with 𝐶 = (1/𝑝) − 1.
 • 𝑝-norm Quantization: A vector x is quantized element-wisely by 𝑄 𝑝 (x) = ∥x∥ 𝑝 sign(x) ◦𝜉, where
                                                                                                             |𝑥 𝑖 |
    ◦ is the Hadamard product and 𝜉 is a Bernoulli random vector satisfying 𝜉𝑖 ∼ Bernoulli( ∥x∥                   𝑝
                                                                                                                    ).
                                                             ∥x∥ 1 ∥x∥ 𝑝
    It satisfies Assumption 1 with 𝐶 = maxx∈R𝑑                 ∥x∥ 22
                                                                          − 1 [77]. To decrease the constant 𝐶
    for a higher accuracy, a vector x ∈ R𝑑 can be further decomposed into blocks, i.e., x =
    (x(1) ⊤ , x(2) ⊤ , · · · , x(𝑚) ⊤ ) ⊤ with x(𝑙) ∈ R𝑑𝑙 and 𝑙=1
                                                               Í𝑚
                                                                         𝑑𝑙 = 𝑑, and the blocks can be compressed
    independently.
                                                          10


                                                                Master
                                                                           Gr
                                                        al                     ad
                                                es idu                             ien
                                              r                                        tr
                                           nt                al         M                 esi
                                     ra die               idu
                                                                          od                  du
                                   G                   es        …          el
                                                                                r esi
                                                                                                al
                                                     r
                                                  el                                  du
                                              od                                        al
                                            M
                                                                 …
                             Worker                  Worker            Worker                    Worker
                                   Figure 2.1: An Illustration of DORE.
2.3.1   Proposed Algorithm
Many previous works [4, 92, 114] reduce the communication cost of P-SGD by quantizing the
stochastic gradient before sending it to the master node, but there are several intrinsic issues.
    First, these algorithms will incur extra optimization error intrinsically. Let’s consider the case
when the algorithm converges to the optimal point x∗ where we have (1/𝑛) 𝑖=1                               ∇ 𝑓𝑖 (x∗ ) = 0.
                                                                                                        Í𝑛
However, the data distributions may be different for different worker nodes in general, and thus we
may have ∇ 𝑓𝑖 (x∗ ) ≠ ∇ 𝑓 𝑗 (x∗ ), ∀𝑖, 𝑗 ∈ {1, . . . , 𝑛} and 𝑖 ≠ 𝑗. In other words, each individual ∇ 𝑓𝑖 (x∗ )
may be far away from zero. This will cause large compression variance according to Assumption 1,
which indicates that the upper bound of compression variance E∥𝑄(x) − x∥ 2 is linearly proportional
to the magnitude of x.
    Second, most existing algorithms [92, 4, 114, 7, 116, 77] need to broadcast the model or gradient
to all worker nodes in each iteration. It is a considerable bottleneck for efficient optimization since
the amount of bits to transmit is the same as the uncompressed gradient. DoubleSqueeze [104] is
able to apply compression on both worker and server sides. However, its analysis depends on a
strong assumption on bounded gradient. Meanwhile, no theoretical guarantees are provided for the
convex problems.
    We proposed DORE to address all aforementioned issues. Our motivation is that the gradient
should change smoothly for smooth functions so that each worker node can keep a state variable h𝑖𝑘
to track its previous gradient information. As a result, the residual between new gradient and the
                                                                 11


Algorithm 1 DORE
  1: Input: Stepsize 𝛼, 𝛽, 𝛾, 𝜂, initialize h0 = h𝑖0 = 0𝑑 , x̂𝑖0 = x̂0 , ∀𝑖 ∈ {1, . . . , 𝑛}.
  2: for 𝑘 = 1, 2, · · · , 𝐾 − 1 do
  3:    For each worker 𝑖 ∈ {1, 2, · · · , 𝑛}:             12: For the master:
  4:    Sample such that E[g𝑖𝑘 | x̂𝑖𝑘 ] =
                  g𝑖𝑘                        ∇ 𝑓𝑖 ( x̂𝑖𝑘 ) 13: Receive {Δ̂𝑖𝑘 } from workers
        Gradient residual: Δ𝑖𝑘 = g𝑖𝑘 − h𝑖𝑘
                                                                            Í
  5:                                                       14: Δ̂ 𝑘 = 1/𝑛 𝑖𝑛 Δ̂𝑖𝑘
        Compression: Δ̂𝑖𝑘 = 𝑄(Δ𝑖𝑘 )
                                                                                         Í
  6:                                                       15: ĝ 𝑘 = h 𝑘 + Δ̂ 𝑘 {= 1/𝑛 𝑖𝑛 ĝ𝑖𝑘 }
  7:    h𝑖𝑘+1 = h𝑖𝑘 + 𝛼Δ̂𝑖𝑘                                16: x 𝑘+1 = prox𝛾𝑅 ( x̂ 𝑘 − 𝛾 ĝ 𝑘 )
  8:    { ĝ𝑖𝑘 = h𝑖𝑘 + Δ̂𝑖𝑘 }                              17: h 𝑘+1 = h 𝑘 + 𝛼Δ̂ 𝑘
  9:    Send Δ̂𝑖𝑘 to the master                            18: Model residual: q 𝑘 = x 𝑘+1 − x̂ 𝑘 + 𝜂e 𝑘
10:     Receive q̂ 𝑘 from the master                       19: Compression: q̂ 𝑘 = 𝑄(q 𝑘 )
11:     x̂𝑖𝑘+1 = x̂𝑖𝑘 + 𝛽q̂ 𝑘                              20: e 𝑘+1 = q 𝑘 − q̂ 𝑘
                                                           21: x̂ 𝑘+1 = x̂ 𝑘 + 𝛽q̂ 𝑘
23:  end for                                               22: Broadcast q̂ 𝑘 to workers
24:  Output: x̂𝐾 or any x̂𝑖𝐾
state h𝑖𝑘 should decrease, and the compression variance of the residual can be well bounded. On
the other hand, as the algorithm converges, the model would only change slightly. Therefore, we
propose to compress the model residual such that the compression variance can be minimized and
also well bounded. We also compensate the model residual compression error into next iteration to
achieve a better convergence. Due to the advantages of the proposed double residual compression
scheme, we can derive the fastest convergence rate through analyses without the bounded gradient
assumption. Note that in Algorithm 1, equations in the curly bracket are just notations for the proof
but does not need to computed actually. Below are some key steps of our algorithm as showed in
Algorithm 1 and Figure 2.1:
       [lines 4-9]: each worker node sends the compressed gradient residual (Δ̂𝑖𝑘 ) to the master node
       and updates its state h𝑖𝑘 with Δ̂𝑖𝑘 ;
       [lines 13-15]: the master node gathers the compressed gradient residual ({Δ̂𝑖𝑘 )} from all
       worker nodes and recovers the averaged gradient ĝ 𝑘 based on its state h 𝑘 ;
       [lines 16]: the master node applies gradient descent algorithms (possibly with the proximal
       operator);
                                                           12


       [lines 18-22]: the master node broadcasts the compressed model residual with error compen-
       sation (q̂ 𝑘 ) to all worker nodes and updates the model;
       [lines 10-11]: each worker node receives the compressed model residual (q̂ 𝑘 ) and updates its
       model x𝑖𝑘 .
    In the algorithm, the state h𝑖𝑘 serves as an exponential moving average of the local gradient in
expectation, i.e., E𝑄 h𝑖𝑘+1 = (1 − 𝛼)h𝑖𝑘 + 𝛼g𝑖𝑘 , as proved in Lemma 7. Therefore, as the iteration
approaches the optimum, h𝑖𝑘 will also approach the local gradient ∇ 𝑓𝑖 (x∗ ) rapidly which contributes
to small gradient residual and consequently small compression variance. Similar difference
compression techniques are also proposed in DIANA and its variance-reduced variant [77, 36].
2.3.2    Discussion
In this subsection, we provide more detailed discussions about DORE including model initialization,
model update, the special smooth case as well as the compression rate of communication.
    Initialization. It is important to take the identical initialization x̂0 for all worker and master
nodes. It is easy to be ensured by either setting the same random seed or broadcasting the model
once at the beginning. In this way, although we don’t need to broadcast the model parameters
directly, every worker node updates the model x̂ 𝑘 in the same way. Thus we can keep their model
parameters identical. Otherwise, the model inconsistency needs to be considered.
    Model update. It is worth noting that although we can choose an accurate model x 𝑘+1 as the
next iteration in the master node, we use x̂ 𝑘+1 instead. In this way, we can ensure that the gradient
descent algorithm is applied based on the exact stochastic gradient which is evaluated on x̂𝑖𝑘 at each
worker node. This dispels the intricacy to deal with inexact gradient evaluated on x 𝑘 and thus it
simplifies the convergence analysis.
    Smooth case. In the smooth case, i.e., 𝑅 = 0, Algorithm 1 can be simplified. The master node
quantizes the recovered averaged gradient with error compensation and broadcasts it to all worker
nodes. This simplified algorithm is shown in Algorithm 2.
                                                    13


Algorithm 2 DORE with 𝑅(x) = 0
  1:  Input: Stepsize 𝛼, 𝛽, 𝛾, 𝜂, initialize h0 = h𝑖0 = 0𝑑 , x̂𝑖0 = x̂0 , ∀𝑖 ∈ {1, . . . , 𝑛}.
  2:  for 𝑘 = 1, 2, · · · , 𝐾 − 1 do
  3:    For each worker {𝑖 = 1, 2, · · · , 𝑛}:               12: For the master:
  4:    Sample such that E[g𝑖𝑘 | x̂𝑖𝑘 ] =
                   g𝑖𝑘                       ∇ 𝑓𝑖 ( x̂𝑖𝑘 )   13: Receive Δ̂𝑖𝑘 s from workers
        Gradient residual: Δ𝑖𝑘 = g𝑖𝑘 − h𝑖𝑘
                                                                             Í
  5:                                                         14: Δ̂ 𝑘 = 1/𝑛 𝑖𝑛 Δ̂𝑖𝑘
        Compression: Δ̂𝑖𝑘 = 𝑄(Δ𝑖𝑘 )
                                                                                            Í
  6:                                                         15: ĝ 𝑘 = h 𝑘 + Δ̂ 𝑘 {= 1/𝑛 𝑖𝑛 ĝ𝑖𝑘 }
  7:    h𝑖𝑘+1 = h𝑖𝑘 + 𝛼Δ̂𝑖𝑘                                  16: h 𝑘+1 = h 𝑘 + 𝛼Δ̂ 𝑘
  8:     { ĝ𝑖𝑘 = h𝑖𝑘 + Δ̂𝑖𝑘 }                               17: q 𝑘 = −𝛾 ĝ 𝑘 + 𝜂e 𝑘
  9:    Sent Δ̂𝑖𝑘 to the master                              18: Compression: q̂ 𝑘 = 𝑄(q 𝑘 )
10:     Receive q̂ 𝑘 from the master                         19: e 𝑘+1 = q 𝑘 − q̂ 𝑘
11:     x̂𝑖𝑘+1 = x̂𝑖𝑘 + 𝛽q̂ 𝑘                                20: Broadcast q̂ 𝑘 to workers
21: end for
22: Output: any x̂𝑖𝐾
     Compression rate. The compression of the gradient information can reduce at most 50% of
the communication cost since it only considers compression during gradient aggregation while
ignoring the model synchronization. However, DORE can further cut down the remaining 50%
communication.
     Taking the blockwise 𝑝-norm quantization as an example, every element of x can be represented
     3
by   2 bits using the simple ternary coding {0, ±1}, along with one magnitude for each block. For
example, if we consider the uniform block size 𝑏, the number of bits to represent a 𝑑-dimension
vector of 32 bit float-point numbers can be reduced from 32𝑑 bits to 32 𝑑𝑏 + 32 𝑑 bits. As long as the
block size 𝑏 is relatively large with respect to the constant 32, the cost 32 𝑑𝑏 for storing the float-point
number is relatively small such that the compression rate is close to 32𝑑/( 32 𝑑) ≈ 21.3 times (for
example, 19.7 times when 𝑏 = 256).
     Applying this quantization, QSGD, Terngrad, MEM-SGD, and DIANA need to transmit
(32𝑑 + 32 𝑑𝑏 + 32 𝑑) bits per iteration and thus they are able to cut down 47% of the overall 2 × 32𝑑 bits
per iteration through gradient compression when 𝑏 = 256. But with DORE, we only need to transmit
2(32 𝑑𝑏 + 32 𝑑) bits per iteration. Thus DORE can reduce over 95% of the total communication by
compressing both the gradient and model transmission. More efficient coding techniques such as
                                                           14


Elias coding [26] can be applied to further reduce the number of bits per iteration.
2.4    Convergence Analysis
To show the convergence of DORE, we will make the following commonly used assumptions when
needed.
Assumption 2. Each worker node samples an unbiased estimator of the gradient stochastically with
bounded variance, i.e., for 𝑖 = 1, 2, · · · , 𝑛 and ∀x ∈ R𝑑 ,
                             E[g𝑖 |x] = ∇ 𝑓𝑖 (x),       E∥g𝑖 − ∇ 𝑓𝑖 (x)∥ 2 ≤ 𝜎𝑖2 ,                  (2.2)
                                                                                 1
where g𝑖 is the estimator of ∇ 𝑓𝑖 at x. In addition, we define 𝜎 2 =                     𝜎𝑖2 .
                                                                                    Í𝑛
                                                                                 𝑛   𝑖=1
Assumption 3. Each 𝑓𝑖 is 𝐿-Lipschitz differentiable, i.e., for 𝑖 = 1, 2, · · · , 𝑛 and ∀x, y ∈ R𝑑 ,
                             𝑓𝑖 (x) ≤ 𝑓𝑖 (y) + ⟨∇ 𝑓𝑖 (y), x − y⟩ + 𝐿2 ∥x − y∥ 2 .                   (2.3)
Assumption 4. Each 𝑓𝑖 is 𝜇-strongly convex (𝜇 ≥ 0), i.e., for 𝑖 = 1, 2, · · · , 𝑛 and ∀x, y ∈ R𝑑 ,
                             𝑓𝑖 (x) ≥ 𝑓𝑖 (y) + ⟨∇ 𝑓𝑖 (y), x − y⟩ + 𝜇2 ∥x − y∥ 2 .                   (2.4)
    For simplicity, we use the same compression operator for all worker nodes, and the master node
can apply a different compression operator. We denote the constants in Assumption 1 as 𝐶𝑞 and 𝐶𝑞𝑚
for the worker and master nodes, respectively. Then we set 𝛼 and 𝛽 in both algorithms to satisfy
                                     √︃                            √︃
                                          4𝐶 (𝐶 +1)                     4𝐶 (𝐶 +1)
                                  1− 1− 𝑞 𝑛𝑐𝑞                  1+ 1− 𝑞 𝑛𝑐𝑞
                                       2(𝐶𝑞 +1)       ≤𝛼≤            2(𝐶𝑞 +1)      ,
                                                                   1
                                                   0<𝛽≤        𝐶𝑞𝑚 +1 ,                             (2.5)
          4𝐶𝑞 (𝐶𝑞 +1)
with 𝑐 ≥       𝑛      . We consider two scenarios in the following two subsections: 𝑓 is strongly
convex with a convex regularizer 𝑅 and 𝑓 is non-convex with 𝑅 = 0.
2.4.1   Strongly Convex Case
Theorem 1. Under Assumptions 1-4, if 𝛼 and 𝛽 in Algorithm 1 satisfy (2.5), 𝜂 and 𝛾 satisfy
                                                       √
                                                  −𝐶𝑞𝑚 + (𝐶𝑞𝑚 ) 2 +4(1−(𝐶𝑞𝑚 +1) 𝛽)
                                    𝜂 < min                    2𝐶𝑞𝑚                ,
                                                        15


                                                                       
                                                         4𝜇𝐿
                                                (𝜇+𝐿) 2 (1+𝑐𝛼)−4𝜇𝐿
                                                                          ,                                                 (2.6)
                                  𝜂(𝜇+𝐿)                    2
                                 2(1+𝜂)𝜇𝐿   ≤𝛾 ≤     (1+𝑐𝛼) (𝜇+𝐿) ,                                                         (2.7)
then we have
                                                         (1+𝜂) (1+𝑛𝑐𝛼)
                                  V 𝑘+1 ≤ 𝜌 𝑘 V1 +           𝑛(1−𝜌)        𝛽𝛾 2 𝜎 2 ,                                       (2.8)
with
                       V 𝑘 =𝛽(1 − (𝐶𝑞𝑚 + 1) 𝛽)E∥q 𝑘−1 ∥ 2 + E∥ x̂ 𝑘 − x∗ ∥ 2
                                        2 Í
                             + (1+𝜂)𝑐𝛽𝛾
                                   𝑛
                                             𝑛
                                             𝑖=1 E∥h𝑖 − ∇ 𝑓𝑖 (x )∥ ,
                                                       𝑘              ∗ 2
                                    (𝜂2 +𝜂)𝐶 𝑚                                                
                         𝜌 = max 1−(𝐶𝑞𝑚 +1)𝑞 𝛽 , 1 + 𝜂𝛽 − 2(1+𝜂)      𝜇+𝐿
                                                                          𝛽𝛾𝜇𝐿
                                                                                  , 1   −   𝛼     < 1.
Corollary 1. When there is no error compensation and we set 𝜂 = 0, then 𝜌 = max(1 − 2𝛽𝛾𝜇𝐿                            𝜇+𝐿 , 1 − 𝛼).
If we further set
                                       1                   1                 4𝐶𝑞 (𝐶𝑞 +1)
                              𝛼=    2(𝐶𝑞 +1) ,    𝛽=     𝐶𝑞𝑚 +1 ,    𝑐=             𝑛        ,                              (2.9)
                                                 2
and choose the largest step-size 𝛾 =     (𝜇+𝐿) (1+2𝐶𝑞 /𝑛) ,    the convergent factor is
                                                                                    2
                                                                                                    
                                                                        1) (𝜇+𝐿)          1      𝐶𝑞
                      (1 −  𝜌) −1  = max 2(𝐶𝑞 + 1),           (𝐶𝑞𝑚  +         2𝜇𝐿         2 +     𝑛     .                  (2.10)
Remark 1. In particular, suppose {Δ𝑖 }𝑖=1  𝑛 are compressed using the Bernoulli 𝑝-norm quantization
                                                   1                                                      ∥x∥ 22
with the largest block size 𝑑max , then 𝐶𝑞 =      𝛼𝑤  − 1, with 𝛼 𝑤 = min0≠x∈R𝑑max                     ∥x∥ 1 ∥x∥ 𝑝 ≤ 1. Similarly,
                                                                                                     1
q is compressed using the Bernoulli 𝑝-norm quantization with 𝐶𝑞𝑚 =                                  𝛼𝑚   − 1. Then the linear
convergent factor is
                                                 n                  2
                                                                                             o
                           (1 − 𝜌) −1 = max 𝛼2𝑤 , 𝛼1𝑚 (𝜇+𝐿)      𝜇𝐿
                                                                         1
                                                                         2  −   2
                                                                                𝑛  +      2
                                                                                        𝑛𝛼 𝑤      .                        (2.11)
                                               n                                 o
While the result of DIANA in [77] is max 𝛼2𝑤 , 𝜇+𝐿       𝜇
                                                               1
                                                               2  − 1
                                                                    𝑛   +   𝑛𝛼
                                                                              1
                                                                               𝑤       , which is better than (2.11) with
𝛼𝑚 = 1 (no compression for the model). When there is no compression for Δ𝑖 , i.e., 𝛼 𝑤 = 1, the
algorithm reduces to the gradient descent, and the linear convergent factor is the same as that of the
gradient descent for strongly convex functions.
                                                         16


Remark 2. Although error compensation often improves the convergence empirically, in theory, no
compensation, i.e., 𝜂 = 0, provides the best convergence rate. This is because we don’t have much
information of the error being compensated. Filling this gap will be an interesting future direction.
2.4.2    Nonconvex Case
Theorem 2. Under Assumptions 1-3 and the additional assumption that each worker samples the
gradient from the full dataset, we set 𝛼 and 𝛽 according to (2.5). By choosing
                                              √︂
                                                    48𝐿 2 𝛽 2 (𝐶𝑞𝑚 +1) 2
                                        n −1+    1+         𝐶𝑞𝑚
                                                                                                 o
                                                                                     1
                             𝛾 ≤ min           12𝐿 𝛽(𝐶𝑞𝑚 +1)             , 6𝐿 𝛽(1+𝑐𝛼)  (𝐶𝑞 +1) ,
                                                                                         𝑚
we have
                               𝛽
                               2 − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 ∑︁                𝐾
                                                                                  E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2
                                                𝐾                            𝑘=1
                              Λ1   −  Λ𝐾+1      3(𝐶𝑞𝑚      + 1)(1 + 𝑛𝑐𝛼)𝐿 𝛽2 𝜎 2 𝛾
                            ≤                +                                                 ,                  (2.12)
                                    𝛾𝐾                                   𝑛
where
                                 Λ 𝑘 =(𝐶𝑞𝑚 + 1)𝐿 𝛽2 ∥q 𝑘−1 ∥ 2 + 𝑓 ( x̂ 𝑘 ) − 𝑓 ∗
                                                                               𝑛
                                                                   2 21
                                                                             ∑︁
                                        + 3𝑐(𝐶𝑞𝑚     + 1)𝐿 𝛽 𝛾                    E∥h𝑖𝑘 ∥ 2 .                     (2.13)
                                                                          𝑛   𝑖=1
                             1              1                       4𝐶𝑞 (𝐶𝑞 +1)
Corollary 2. Let 𝛼 =      2(𝐶𝑞 +1) , 𝛽=   𝐶𝑞𝑚 +1 , and 𝑐 =                 𝑛      , then 1 + 𝑛𝑐𝛼 is a fixed constant. If
              1 √
𝛾=                      , when K is relatively large, we have
     12𝐿(1+𝑐𝛼) (1+ 𝐾/𝑛)
                                          𝐾
                                      1 ∑︁                                 1        1
                                              E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 ≲ + √ .                                           (2.14)
                                     𝐾 𝑘=1                                𝐾         𝐾𝑛
                                                              √
Remark 3. The dominant term in (2.14) is 𝑂 (1/ 𝐾𝑛), which implies that the sample complexity of
each worker node is 𝑂 (1/(𝑛𝜖 2 )) in average to achieve an 𝜖-accurate solution. It shows that, same as
DoubleSqueeze in [104], DORE is able to perform linear speedup. Furthermore, this convergence
result is the same as the P-SGD without compression. Note that DoubleSqueeze has an extra term
       2
(1/𝐾) 3 , and its convergence requires the bounded variance of the compression operator.
                                                             17


        Algorithm      Compression      Compression Assumed      Linear rate   Nonconvex Rate
                                                                                      1
         QSGD              Grad          2-norm Quantization        N/A              𝐾 +𝐵
         DIANA             Grad          𝑝-norm Quantization         ✓              √1 + 1
                                                                                      𝐾𝑛  𝐾
     DoubleSqueeze     Grad+Model         Bounded Variance          N/A         √ 1 + 12/3 + 1
                                                                                  𝐾𝑛    𝐾     𝐾
                                                                                      1   1
         DORE          Grad+Model            Assumption 1            ✓              √    +𝐾
                                                                                      𝐾𝑛
Table 2.1: A comparison between related algorithms. DORE is able to converges linearly to the
O (𝜎) neighborhood of optimal point like full-precision SGD and DIANA in the strongly convex
case while achieving much better communication efficiency. DORE also admits linear speedup in
the nonconvex case like DoubleSqueeze but DORE doesn’t require the assumptions of bounded
compression error or bounded gradient.
2.5     Experiment
In this section, we validate the theoretical results and demonstrate the superior performance of
DORE. Our experimental results demonstrate that (1) DORE achieves similar convergence speed
as full-precision SGD and state-of-art quantized SGD baselines and (2) its iteration time is much
smaller than most existing algorithms, supporting the superior communication efficiency of DORE.
    To make a fair comparison, we choose the same Bernoulli ∞-norm quantization as described in
Section 2.3 and the quantization block size is 256 for all experiments if not being explicitly stated
because ∞-norm quantization is unbiased and commonly used. The parameters 𝛼, 𝛽, 𝜂 for DORE
are chosen to be 0.1, 1 and 1, respectively.
    The baselines we choose to compare include SGD, QSGD [4], MEM-SGD [97], DIANA [77],
DoubleSqueeze and DoubleSqueeze (topk) [104]. SGD is the vanilla SGD without any compression
and QSGD quantizes the gradient directly. MEM-SGD is the QSGD with error compensation.
DIANA, which only compresses and transmits the gradient difference, is a special case of the
proposed DORE. DoubleSqueeze quantizes both the gradient on the workers and the averaged
gradient on the server with error compensation. Although DoubleSqueeze is claimed to work well
with both biased and unbiased compression, in our experiment it converges much slower and suffers
the loss of accuracy with unbiased compression. Thus, we also compare with DoubleSqueeze using
the Top-k compression as presented in [104].
                                                 18


2.5.1                 Strongly Convex
To verify the convergence for strongly convex and smooth objective functions, we conduct the
experiment on a linear regression problem: 𝑓 (x) = ∥Ax−b∥ 2 +𝜆∥x∥ 2 . The data matrix A ∈ R1200×500
and optimal solution x∗ ∈ R500 are randomly synthesized. Then we generate the prediction b
by sampling from a Gaussian distribution whose mean is Ax∗ . The rows of the data matrix A
are allocated evenly to 20 worker nodes. To better verify the linear convergence to the O (𝜎)
neighborhood around the optimal solution, we take the full gradient in each node for all algorithms
to exclude the effect of the gradient variance (𝜎 = 0).
     As showed in Figure 2.2, with full gradient and a constant learning rate, DORE converges linearly,
same as SGD and DIANA, but QSGD, MEM-SGD, DoubleSqueeze, as well as DoubleSqueeze
(topk) converge to a neighborhood of the optimal point. This is because these algorithms assume
the bounded gradient and their convergence errors depend on that bound. Although they converge to
the optimal solution using a diminishing step size, their converge rates will be much slower.
                  1
             10                                                                             1
                                                                                       10
                  2
            10                                                                              1
                                                                                       10
                  5
            10                                                                              3
  x * ||2                                                                    x * ||2
                                                                                       10
                  8
            10                                                                              5
                                                                                       10
   ||xk                                                                      ||xk
                 11
            10                                                                              7
                                                                                       10
                 14
            10                                                                              9
                                                                                       10
                       0    2000   4000           6000   8000   10000                           0   2500   5000   7500    10000 12500 15000 17500 20000
                                      Iteration                                                                          Iteration
                            (a) Learning rate=0.05                                                     (b) Learning rate=0.025
Figure 2.2: Linear regression on synthetic data. When the learning rate is 0.05, DoubleSqueeze
diverges. In both cases, DORE, SGD, and DIANA converge linearly to the optimal point, while
QSGD, MEM-SGD, DoubleSqueeze, and DoubleSqueeze (topk) only converge to the neighborhood
even when full gradient is available.
Compression error The property of the compression operator indicates that the compression
error is linearly proportional to the norm of the variable being compressed: E∥𝑄(x) − x∥ 2 ≤ 𝐶 ∥x∥ 2 .
                                                                        19


We visualize the norm of the variables being compressed, i.e., the gradient residual (the worker
side) and model residual (the master side) for DORE as well as error compensated gradient (the
worker side) and averaged gradient (the master side) for DoubleSqueeze. As showed in Figure 2.3,
the gradient and model residual of DORE decrease exponentially and the compression errors vanish.
However, for DoubleSqueeze, their norms only decrease to some certain value and the compression
error doesn’t vanish. It explains why algorithms without residual compression cannot converge
linearly to the O (𝜎) neighborhood of the optimal solution in the strongly convex case.
              4
         10                                                                                    3
                                                                                          10
              1                                                                                1
         10                                                                               10
                                                                                               1
                                                                                         10
              2
        10
                                                                                               3
                                                                                         10
 Norm                                                                             Norm
              5
        10                                                                                     5
                                                                                         10
                                                                                               7
              8                                                                          10
        10
                                                                                               9
                                                                                         10
             11
        10
                                                                                              11
                                                                                         10
                  0   2500   5000    7500    10000 12500 15000 17500 20000                         0   2500   5000    7500    10000 12500 15000 17500 20000
                                            Iteration                                                                        Iteration
                                    (a) Worker side                                                                  (b) Master side
              Figure 2.3: The norm of variable being compressed in the linear regression experiment.
2.5.2             Nonconvex
To verify the convergence in the nonconvex case, we test the proposed DORE with two classical
deep neural networks on two representative datasets, respectively, i.e., LeNet [52] on MNIST and
Resnet18 [35] on CIFAR10. In the experiment, we use 1 parameter server and 10 workers, each of
which is equipped with an NVIDIA Tesla K80 GPU. The batch size for each worker node is 256.
We use 0.1 and 0.01 as the initial learning rates for LeNet and Resnet18, and decrease them by a
factor of 0.1 after every 25 and 100 epochs, respectively. All parameter settings are the same for all
algorithms.
                                                                             20


                   2.0                                                                                 2.0
                   1.5                                                                                 1.5
   Training Loss   1.0
                                                                                           Test Loss
                                                                                                       1.0
                   0.5                                                                                 0.5
                   0.0                                                                                 0.0
                         0   10        20      30    40         50   60    70                                0   10        20    30    40         50   60    70
                                                    Epoch                                                                             Epoch
                                            (a) Training loss                                                                   (b) Test Loss
Figure 2.4: LeNet trained on MNIST. DORE converges similarly as most baselines. It outperforms
DoubleSqueeze using the same compression method while has similar performance as DoubleSqueeze
(topk).
                                                                                                       4.5
                   4.5
                                                                                                       4.0
                   4.0
                                                                                                       3.5
                   3.5
   Training Loss                                                                           Test Loss
                                                                                                       3.0
                   3.0
                                                                                                       2.5
                   2.5
                   2.0                                                                                 2.0
                   1.5                                                                                 1.5
                   1.0                                                                                 1.0
                         0        50          100         150        200        250                          0        50        100         150        200        250
                                                    Epoch                                                                             Epoch
                                            (a) Training Loss                                                                   (b) Test Loss
Figure 2.5: Resnet18 trained on CIFAR10. DORE achieves similar convergence and accuracy as
most baselines. DoubeSuqeeze converges slower and suffers from the higher loss but it works well
with topk compression.
   Figures 2.4 and 2.5 show the training loss and test loss for each epoch during the training
of LeNet on the MNIST dataset and Resnet18 on CIFAR10 dataset. The results indicate that in
the nonconvex case, even with both compressed gradient and model information, DORE can still
achieve similar convergence speed as full-precision SGD and other quantized SGD variants. DORE
achieves much better convergence speed than DoubleSqueeze using the same compression method
                                                                                      21


                                2.5
                                2.0
                      Seconds
                                1.5
                                1.0
                                0.5
                                0.0
                                  0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 0.0040 0.0045 0.0050
                                                         Bandwidth (1s/Mbit)
Figure 2.6: Per iteration time cost on Resnet18 for SGD, QSGD, and DORE. It is tested in a shared
cluster environment connected by Gigabit Ethernet interface. DORE speeds up the training process
significantly by mitigating the communication bottleneck.
and converges similarly with DoubleSqueeze with Topk compression as presented in [104]. We also
validate via parameter sensitivity in Appendix A.1.2 that DORE performs consistently well under
different parameter settings such as compression block size, 𝛼, 𝛽 and 𝜂.
2.5.3   Communication Efficiency
In terms of communication cost, DORE enjoys the benefit of extremely efficient communication.
As one example, under the same setting as the Resnet18 experiment described in the previous
section, we test the time cost per iteration for SGD, QSGD, and DORE under varied network
bandwidth. We didn’t test MEM-SGD, DIANA, and DoubleSqueeze because MEM-SGD, DIANA
have similar time cost as QSGD while DoubleSqueeze has similar time cost as DORE. The result
showed in Figure 2.6 indicates that as the bandwidth becomes worse, with both gradient and model
compression, the advantage of DORE becomes more remarkable compared to the baselines that
don’t apply compression for model synchronization. In Appendix A.1.1, we also demonstrate the
communication efficiency in terms of communication bits and running time, which clearly suggests
the benefit of the proposed algorithm.
                                                               22


2.6    Conclusion
Message passing is the dominating bottleneck for distributed training of modern large-scale machine
learning models. Extensive works have compressed the gradient information to be transferred during
the training process, but model compression is rather limited due to its intrinsic difficulty. In this
work, we proposed the Double Residual Compression SGD named DORE to compress both gradient
and model communication that can mitigate this bottleneck prominently. The theoretical analyses
suggest good convergence rate of DORE under weak assumptions. Furthermore, DORE is able to
reduce 95% of the communication cost in message passing while maintaining similar convergence
rate and model accuracy compared with the full-precision SGD.
                                                23


                                             CHAPTER 3
LINEAR CONVERGENT DECENTRALIZED OPTIMIZATION WITH COMPRESSION
Communication compression has become a key strategy to speed up the message passing in
distributed optimization. However, existing decentralized algorithms with compression mainly
focus on compressing DGD-type algorithms. They are unsatisfactory in terms of convergence rate,
stability, and the capability to handle heterogeneous data. Motivated by primal-dual algorithms, in
this chapter, we propose the first LinEAr convergent Decentralized algorithm with compression,
LEAD. Our theory describes the coupled dynamics of the inexact primal and dual update as well
as compression error, and we provide the first consensus error bound in such settings without
assuming bounded gradients. This is also the first work that proves in certain compression regime,
the message compression in message passing do not hurt the convergence, which means it achieves
better communication efficiency for free. Experiments on convex problems validate our theoretical
analysis, and empirical study on deep neural nets shows that LEAD is applicable to non-convex
problems as well.
3.1     Introduction
Distributed optimization solves the following optimization problem
                                                            𝑛
                                   ∗
                                               h         1 ∑︁         i
                                  x := arg min 𝑓 (x) :=        𝑓𝑖 (x)                            (3.1)
                                         x∈R𝑑            𝑛 𝑖=1
with 𝑛 computing agents and a communication network. Each 𝑓𝑖 (x) : R𝑑 → R is a local objective
function of agent 𝑖 and typically defined on the data D𝑖 settled at that agent. The data distributions
{D𝑖 } can be heterogeneous depending on the applications such as in federated learning. The variable
x ∈ R𝑑 often represents model parameters in machine learning. A distributed optimization algorithm
seeks an optimal solution that minimizes the overall objective function 𝑓 (x) collectively. According
to the communication topology, existing algorithms can be conceptually categorized into centralized
and decentralized ones. Specifically, centralized algorithms require global communication between
                                                  24


agents (through central agents or parameter servers). While decentralized algorithms only require
local communication between connected agents and are more widely applicable than centralized
ones. In both paradigms, the computation can be relatively fast with powerful computing devices;
efficient communication is the key to improve algorithm efficiency and system scalability, especially
when the network bandwidth is limited.
     In recent years, various communication compression techniques, such as quantization and
sparsification, have been developed to reduce communication costs. Notably, extensive studies [92,
4, 7, 97, 45, 77, 104, 68] have utilized gradient compression to significantly boost communication
efficiency for centralized optimization. They enable efficient large-scale optimization while
maintaining comparable convergence rates and practical performance with their non-compressed
counterparts. This great success has suggested the potential and significance of communication
compression in decentralized algorithms.
     While extensive attention has been paid to centralized optimization, communication compression
is relatively less studied in decentralized algorithms because the algorithm design and analysis are
more challenging in order to cover general communication topologies. There are recent efforts trying
to push this research direction. For instance, DCD-SGD and ECD-SGD [101] introduce difference
compression and extrapolation compression to reduce model compression error. [88, 89] introduce
QDGD and QuanTimed-DSGD to achieve exact convergence with small stepsize. DeepSqueeze [102]
directly compresses the local model and compensates the compression error in the next iteration.
CHOCO-SGD [51, 50] presents a novel quantized gossip algorithm that reduces compression error
by difference compression and preserves the model average. Nevertheless, most existing works focus
on the compression of primal-only algorithms, i.e., reduce to DGD [80, 123] or P-DSGD [62]. They
are unsatisfying in terms of convergence rate, stability, and the capability to handle heterogeneous
data. Part of the reason is that they inherit the drawback of DGD-type algorithms, whose convergence
rate is slow in heterogeneous data scenarios where the data distributions are significantly different
from agent to agent.
     In the literature of decentralized optimization, it has been proved that primal-dual algorithms
                                                     25


can achieve faster converge rates and better support heterogeneous data [63, 96, 59, 124]. However,
it is unknown whether communication compression is feasible for primal-dual algorithms and how
fast the convergence can be with compression. In this work, we attempt to bridge this gap by
investigating the communication compression for primal-dual decentralized algorithms. Our major
contributions can be summarized as:
 • We delineate two key challenges in the algorithm design for communication compression in
    decentralized optimization, i.e., data heterogeneity and compression error, and motivated by
    primal-dual algorithms, we propose a novel decentralized algorithm with compression, LEAD.
 • We prove that for LEAD, a constant stepsize in the range (0, 2/(𝜇+ 𝐿)] is sufficient to ensure linear
    convergence for strongly convex and smooth objective functions. To the best of our knowledge,
    LEAD is the first linear convergent decentralized algorithm with compression. Moreover, LEAD
    provably works with unbiased compression of arbitrary precision.
 • We further prove that if the stochastic gradient is used, LEAD converges linearly to the 𝑂 (𝜎 2 )
    neighborhood of the optimum with constant stepsize. LEAD is also able to achieve exact
    convergence to the optimum with diminishing stepsize.
 • Extensive experiments on convex problems validate our theoretical analyses, and the empirical
    study on training deep neural nets shows that LEAD is applicable for nonconvex problems.
    LEAD achieves state-of-art computation and communication efficiency in all experiments and
    significantly outperforms the baselines on heterogeneous data. Moreover, LEAD is robust to
    parameter settings and needs minor effort for parameter tuning.
3.2     Related Work
Decentralized optimization can be traced back to the work by [107]. DGD [80] is the most classical
decentralized algorithm. It is intuitive and simple but converges slowly due to the diminishing
stepsize that is needed to obtain the optimal solution [123]. Its stochastic version D-PSGD [62] has
been shown effective for training nonconvex deep learning models. Algorithms based on primal-dual
                                                  26


formulations or gradient tracking are proposed to eliminate the convergence bias in DGD-type
algorithms and improve the convergence rate, such as D-ADMM [78], DLM [63], EXTRA [96],
NIDS [59], 𝐷 2 [103], Exact Diffusion [125], OPTRA [120], DIGing [79], GSGT [87], etc.
    Recently, communication compression is applied to decentralized settings by [101]. It proposes
two algorithms, i.e., DCD-SGD and ECD-SGD, which require compression of high accuracy and
are not stable with aggressive compression. [88, 89] introduce QDGD and QuanTimed-DSGD to
achieve exact convergence with small stepsize and the convergence is slow. DeepSqueeze [102]
compensates the compression error to the compression in the next iteration. Motivated by the
quantized average consensus algorithms, such as [13], the quantized gossip algorithm CHOCO-
Gossip [51] converges linearly to the consensual solution. Combining CHOCO-Gossip and D-PSGD
leads to a decentralized algorithm with compression, CHOCO-SGD, which converges sublinearly
under the strong convexity and gradient boundedness assumptions. Its nonconvex variant is further
analyzed in [50]. A new compression scheme using the modulo operation is introduced in [71]
for decentralized optimization. A general algorithmic framework aiming to maintain the linear
convergence of distributed optimization under compressed communication is considered in [75]. It
requires a contractive property that is not satisfied by many decentralized algorithms including the
algorithm in this work.
3.3    Algorithm
We first introduce notations and definitions used in this work. We use bold upper-case letters such
as X to define matrices and bold lower-case letters such as x to define vectors. Let 1 and 0 be
vectors with all ones and zeros, respectively. Their dimensions will be provided when necessary.
Given two matrices X, Y ∈ R𝑛×𝑑 , we define their inner product as ⟨X, Y⟩ = tr(X⊤ Y) and the norm
           √︁                                                               √︁
as ∥X∥ = ⟨X, X⟩. We further define ⟨X, Y⟩P = tr(X⊤ PY) and ∥X∥ P = ⟨X, X⟩ P for any given
symmetric positive semidefinite matrix P ∈ R𝑛×𝑛 . For simplicity, we will majorly use the matrix
notation in this work. For instance, each agent 𝑖 holds an individual estimate x𝑖 ∈ R𝑑 of the global
variable x ∈ R𝑑 . Let X 𝑘 and ∇F(X 𝑘 ) be the collections of {x𝑖𝑘 }𝑖=1
                                                                    𝑛 and {∇ 𝑓 (x 𝑘 )} 𝑛 which are
                                                                                𝑖 𝑖 𝑖=1
                                                   27


defined below:
                                   ⊤                                                  ⊤
             X 𝑘 = x1𝑘 , . . . , x𝑛𝑘 ∈ R𝑛×𝑑 , ∇F(X 𝑘 ) = ∇ 𝑓1 (x1𝑘 ), . . . , ∇ 𝑓𝑛 (x𝑛𝑘 ) ∈ R𝑛×𝑑 .       (3.2)
We use ∇F(X 𝑘 ; 𝜉 𝑘 ) to denote the stochastic approximation of ∇F(X 𝑘 ). With these notations,
the update X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) means that x𝑖𝑘+1 = x𝑖𝑘 − 𝜂∇ 𝑓𝑖 (x𝑖𝑘 ; 𝜉𝑖𝑘 ) for all 𝑖. In this
work, we need the average of all rows in X 𝑘 and ∇F(X 𝑘 ), so we define X 𝑘 = (1⊤ X 𝑘 )/𝑛 and
∇F(X 𝑘 ) = (1⊤ ∇F(X 𝑘 ))/𝑛. They are row vectors, and we will take a transpose if we need a column
vector. The pseudoinverse of a matrix M is denoted as M† . The largest, 𝑖th-largest, and smallest
nonzero eigenvalues of a symmetric matrix M are 𝜆 max (M), 𝜆𝑖 (M), and 𝜆min (M).
Assumption 5 (Mixing matrix). The connected network G = {V, E} consists of a node set
V = {1, 2, . . . , 𝑛} and an undirected edge set E. The primitive symmetric doubly-stochastic matrix
W = [𝑤 𝑖 𝑗 ] ∈ R𝑛×𝑛 encodes the network structure such that 𝑤 𝑖 𝑗 = 0 if nodes 𝑖 and 𝑗 are not connected
and cannot exchange information.
    Assumption 5 implies that −1 < 𝜆 𝑛 (W) ≤ 𝜆 𝑛−1 (W) ≤ · · · 𝜆 2 (W) < 𝜆 1 (W) = 1 and W1 =
1 [118, 96]. The matrix multiplication X 𝑘+1 = WX 𝑘 describes that agent 𝑖 takes a weighted sum
from its neighbors and itself, i.e., x𝑖𝑘+1 = 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑖 𝑗 x 𝑘𝑗 , where N𝑖 denotes the neighbors of agent 𝑖.
                                             Í
    The proposed algorithm LEAD to solve problem (3.1) is showed in Alg. 3 with matrix notations
for conciseness. We will refer to the line number in the analysis. A complete algorithm description
from the agent’s perspective can be found in Algorithm 4. The motivation behind Alg. 3 is to achieve
two goals: (a) consensus (x𝑖𝑘 − (X 𝑘 ) ⊤ → 0) and (b) convergence ((X 𝑘 ) ⊤ → x∗ ). We first discuss
how goal (a) leads to goal (b) and then explain how LEAD fulfills goal (a).
    In essence, LEAD runs the approximate SGD globally and reduces to the exact SGD under
consensus. One key property for LEAD is 1⊤      𝑛×1 D = 0, regardless of the compression error in Ŷ .
                                                       𝑘                                                     𝑘
It holds because that for the initialization, we require D1 = (I − W)Z for some Z ∈ R𝑛×𝑑 , e.g.,
D1 = 0𝑛×𝑑 , and that the update of D 𝑘 ensures D 𝑘 ∈ Range(I − W) for all 𝑘 and 1⊤          𝑛×1 (I − W) = 0 as
we will explain later. Therefore, multiplying (1/𝑛)1⊤       𝑛×1 on both sides of Line 7 leads to a global
                                                    28


Algorithm 3 LEAD
Input: Stepsize 𝜂, parameter (𝛼, 𝛾), X0 , H1 , D1 = (I − W)Z for any Z
                              Í𝑛
Output: X𝐾 or 1/𝑛 𝑖=1                X𝑖𝐾
  1:  H1𝑤 = WH1                                                   9:   procedure COMM(Y, H, H𝑤 ):
  2:  X1 = X0 − 𝜂∇F(X0 ; 𝜉 0 )                                   10:       Q = Compress(Y − H)
  3:  for 𝑘 = 1, 2, · · · , 𝐾 − 1 do                             11:       Ŷ = H + Q
  4:    Y 𝑘 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘                       12:       Ŷ𝑤 = H𝑤 + WQ
  5:    Ŷ 𝑘 , Ŷ𝑤𝑘 , H 𝑘+1 , H𝑤𝑘+1 = COMM(Y 𝑘 , H 𝑘 , H𝑤𝑘 )     13:       H = (1 − 𝛼)H + 𝛼Ŷ
  6:    D 𝑘+1 = D 𝑘 + 2𝜂     𝛾
                               ( Ŷ 𝑘 − Ŷ𝑤𝑘 )                   14:       H𝑤 = (1 − 𝛼)H𝑤 + 𝛼Ŷ𝑤
                                                                 15:       return: Ŷ, Ŷ𝑤 , H, H𝑤
  7:    X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘+1
                                                                 16:   end procedure
  8:  end for
average view of Alg. 3:
                                               X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ),                           (3.3)
which doesn’t contain the compression error. Note that this is an approximate SGD step because,
as shown in (3.2), the gradient ∇F(X 𝑘 ; 𝜉 𝑘 ) is not evaluated on a global synchronized model
X 𝑘 . However, if the solution converges to the consensus solution, i.e., x𝑖𝑘 − (X 𝑘 ) ⊤ → 0, then
E𝜉 𝑘 [∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇ 𝑓 (X 𝑘 ; 𝜉 𝑘 )] → 0 and (3.3) gradually reduces to exact SGD.
     With the establishment of how consensus leads to convergence, the obstacle becomes how to
achieve consensus under local communication and compression challenges. It requires addressing
two issues, i.e., data heterogeneity and compression error. To deal with these issues, existing
algorithms, such as DCD-SGD, ECD-SGD, QDGD, DeepSqueeze, Moniqua, and CHOCO-SGD,
need a diminishing or constant but small stepsize depending on the total number of iterations.
However, these choices unavoidably cause slower convergence and bring in the difficulty of parameter
tuning. In contrast, LEAD takes a different way to solve these issues, as explained below.
     Data heterogeneity. It is common in distributed settings that there exists data heterogeneity
among agents, especially in real-world applications where different agents collect data from different
scenarios. In other words, we generally have 𝑓𝑖 (x) ≠ 𝑓 𝑗 (x) for 𝑖 ≠ 𝑗. The optimality condition of
problem (3.1) gives 1⊤                   ∗                ∗    ∗         ∗
                              𝑛×1 ∇F(X ) = 0, where X = [x , · · · , x ] is a consensual and optimal solution.
The data heterogeneity and optimality condition imply that there exist at least two agents 𝑖 and 𝑗
                                                            29


such that ∇ 𝑓𝑖 (x∗ ) ≠ 0 and ∇ 𝑓 𝑗 (x∗ ) ≠ 0. As a result, a simple D-PSGD algorithm cannot converge
to the consensual and optimal solution as X∗ ≠ WX∗ − 𝜂E𝜉 ∇F(X∗ ; 𝜉) even when the stochastic
gradient variance is zero.
    Gradient correction. Primal-dual algorithms or gradient tracking algorithms are able to
convergence much faster than DGD-type algorithms by handling the data heterogeneity issue, as
introduced in Section 3.2. Specifically, LEAD is motivated by the design of primal-dual algorithm
NIDS [59] and the relation becomes clear if we consider the two-step reformulation of NIDS adopted
in [57]:
                                              I−W 𝑘
                             D 𝑘+1 = D 𝑘 +          (X − 𝜂∇F(X 𝑘 ) − 𝜂D 𝑘 ),                      (3.4)
                                               2𝜂
                             X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ) − 𝜂D 𝑘+1 ,                                   (3.5)
where X 𝑘 and D 𝑘 represent the primal and dual variables respectively. The dual variable D 𝑘 plays the
role of gradient correction. As 𝑘 → ∞, we expect D 𝑘 → −∇F(X∗ ) and X 𝑘 will converge to X∗ via
the update in (3.5) since D 𝑘+1 corrects the nonzero gradient ∇F(X 𝑘 ) asymptotically. The key design
of Alg. 3 is to provide compression for the auxiliary variable defined as Y 𝑘 = X 𝑘 − 𝜂∇F(X 𝑘 ) − 𝜂D 𝑘 .
Such design ensures that the dual variable D 𝑘 lies in Range(I−W), which is essential for convergence.
Moreover, it achieves the implicit error compression as we will explain later. To stabilize the
algorithm with inexact dual update, we introduce a parameter 𝛾 to control the stepsize in the dual
update. Therefore, if we ignore the details of the compression, Alg. 3 can be concisely written as
                                      Y 𝑘 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘                          (3.6)
                                                   𝛾
                                    D 𝑘+1 = D 𝑘 +    (I − W) Ŷ 𝑘                                 (3.7)
                                                  2𝜂
                                    X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘+1                        (3.8)
where Ŷ 𝑘 represents the compression of Y 𝑘 and F(X 𝑘 ; 𝜉 𝑘 ) denote the stochastic gradients.
    Nevertheless, how to compress the communication and how fast the convergence we can
attain with compression error are unknown. In the following, we propose to carefully control
the compression error by difference compression and error compensation such that the inexact
                                                     30


dual update (Line 6) and primal update (Line 7) can still guarantee the convergence as proved in
Section 3.4.
    Compression error. Different from existing works, which typically compress the primal variable
X 𝑘 or its difference, LEAD first construct an intermediate variable Y 𝑘 and apply compression to
obtain its coarse representation Ŷ 𝑘 as shown in the procedure 𝐶𝑜𝑚𝑚Y, H, H𝑤 :
   • Compress the difference between Y and the state variable H as Q;
   • Q is encoded into the low-bit representation, which enables the efficient local communication
     step Ŷ𝑤 = H𝑤 + WQ. It is the only communication step in each iteration.
   • Each agent recovers its estimate Ŷ by Ŷ = H + Q and we have Ŷ𝑤 = WŶ.
   • States H and H𝑤 are updated based on Ŷ and Ŷ𝑤 , respectively. We have H𝑤 = WH.
    By this procedure, we expect when both Y 𝑘 and H 𝑘 converge to X∗ , the compression error vanishes
asymptotically due to the assumption we make for the compression operator in Assumption 6.
Remark 4. Note that difference compression is also applied in DCD-PSGD [101] and CHOCO-
SGD [51], but their state update is the simple integration of the compressed difference. We find this
update is usually too aggressive and cause instability as showed in our experiments. Therefore, we
adopt a momentum update H = (1 − 𝛼)H + 𝛼Ŷ motivated from DIANA [77], which reduces the
compression error for gradient compression in centralized optimization.
    Implicit error compensation. On the other hand, even if the compression error exists, LEAD
essentially compensates for the error in the inexact dual update (Line 6), making the algorithm more
stable and robust. To illustrate how it works, let E 𝑘 = Ŷ 𝑘 − Y 𝑘 denote the compression error and e𝑖𝑘
be its 𝑖-th row. The update of D 𝑘 gives
                               𝛾                          𝛾              𝛾
                 D 𝑘+1 = D 𝑘 +    ( Ŷ 𝑘 − Ŷ𝑤𝑘 ) = D 𝑘 + (I − W)Y 𝑘 + (E 𝑘 − WE 𝑘 )
                               2𝜂                         2𝜂             2𝜂
                                                                            Í
where −WE 𝑘 indicates that agent 𝑖 spreads total compression error − 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑗𝑖 e𝑖𝑘 = −e𝑖𝑘 to all
agents and E 𝑘 indicates that each agent compensates this error locally by adding e𝑖𝑘 back. This error
compensation also explains why the global view in (3.3) doesn’t involve compression error.
                                                       31


Remark 5. Note that in LEAD, the compression error is compensated into the model X 𝑘+1 through
Line 6 and Line 7 such that the gradient computation in the next iteration is aware of the compression
error. This has some subtle but important difference from the error compensation or error feedback
in [92, 116, 97, 45, 104, 68, 102], where the error is stored in the memory and only compensated
after gradient computation and before the compression.
LEAD in agent’s perspective In Algorithm 3, we described the algorithm with matrix notations
for concision. Here we further provide a complete algorithm description from the agents’ perspective.
Algorithm 4 LEAD in Agent’s Perspective
input: stepsize 𝜂, compression parameters (𝛼, 𝛾), initial values x𝑖0 , h𝑖1 , z𝑖 , ∀𝑖 ∈ {1, 2, . . . , 𝑛}
                                                Í𝑛
                                                   x𝐾
output:     x𝑖𝐾 ,  ∀𝑖 ∈ {1, 2, . . . , 𝑛} or 𝑖=1𝑛 𝑖
  1: for each agent 𝑖 ∈ {1, 2, . . . , 𝑛} do
       d𝑖1 = z𝑖 − 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑖 𝑗 z 𝑗
                      Í
  2:
        (h𝑤 )𝑖1 = 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑖 𝑗 (h𝑤 ) 1𝑗
                     Í
  3:
  4:   x𝑖1 = x𝑖0 − 𝜂∇ 𝑓𝑖 (x𝑖0 ; 𝜉𝑖0 )
  5: end for
  6: for 𝑘 = 1, 2, . . . , 𝐾 − 1 (in parallel for all agents 𝑖 ∈ {1, 2, . . . , 𝑛}) do
  7:   compute ∇ 𝑓𝑖 (x𝑖𝑘 ; 𝜉𝑖𝑘 )                                                      ▷ Gradient computation
  8:   y𝑖 = x𝑖 − 𝜂∇ 𝑓𝑖 (x𝑖 ; 𝜉𝑖 ) − 𝜂d𝑖
          𝑘        𝑘              𝑘   𝑘          𝑘
  9:   q𝑖𝑘 = Compress(y𝑖𝑘 − h𝑖𝑘 )                                                            ▷ Compression
10:    ŷ𝑖𝑘 = h𝑖𝑘 + q𝑖𝑘
11:    for neighbors 𝑗 ∈ N𝑖 do
12:         Send q𝑖𝑘 and receive q 𝑘𝑗                                                      ▷ Communication
13:    end for
                                 Í
14:     ( ŷ𝑤 ) 𝑖𝑘 = (h𝑤 ) 𝑖𝑘 + 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑖 𝑗 q 𝑘𝑗
15:    h𝑖𝑘+1 = (1 − 𝛼)h𝑖𝑘 + 𝛼ŷ𝑖𝑘
16:     (h𝑤 ) 𝑖𝑘+1 = (1 − 𝛼)(h𝑤 ) 𝑖𝑘 + 𝛼( ŷ𝑤 ) 𝑖𝑘
                          𝛾
17:    d𝑖𝑘+1 = d𝑖𝑘 + 2𝜂       ŷ𝑖𝑘 − ( ŷ𝑤 ) 𝑖𝑘
18:    x𝑖𝑘+1 = x𝑖𝑘 − 𝜂∇ 𝑓𝑖 (x𝑖𝑘 ; 𝜉𝑖𝑘 ) − 𝜂d𝑖𝑘+1                                             ▷ Model update
19: end for
Connections with exiting algorithms The non-compressed variant of LEAD in Alg. 3 recovers
NIDS [59], 𝐷 2 [103] and Exact Diffusion [125] as shown in Proposition 1. In Corollary 3, we show
that the convergence rate of LEAD exactly recovers the rate of NIDS when 𝐶 = 0, 𝛾 = 1 and 𝜎 = 0.
                                                       32


Proposition 1 (Connection to NIDS, 𝐷 2 and Exact Diffusion). When there is no communication
compression (i.e., Ŷ 𝑘 = Y 𝑘 ) and 𝛾 = 1, Alg. 3 recovers 𝐷 2 :
                            I+W 𝑘                                                       
                  X 𝑘+1 =          2X − X 𝑘−1 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) + 𝜂∇F(X 𝑘−1 ; 𝜉 𝑘−1 ) .               (3.9)
                              2
Furthermore, if the stochastic estimator of the gradient ∇F(X 𝑘 ; 𝜉 𝑘 ) is replaced by the full gradient,
it recovers NIDS and Exact Diffusion with specific settings.
Corollary 3 (Consistency with NIDS). When 𝐶 = 0 (no communication compression), 𝛾 = 1 and
𝜎 = 0 (full gradient), LEAD has the convergence consistent with NIDS with 𝜂 ∈ (0, 2/(𝜇 + 𝐿)]:
                                                                                 
                                                     2              1
                     L  𝑘+1
                            ≤ max 1 − 𝜇(2𝜂 − 𝜇𝜂 ), 1 −                        †
                                                                                    L𝑘 .             (3.10)
                                                           2𝜆 max ((I − W) )
     See the proof in B.3.5.
Proof of Proposition 1. Let 𝛾 = 1 and Ŷ 𝑘 = Y 𝑘 . Combing Lines 4 and 6 of Alg. 3 gives
                                          I−W 𝑘
                            D 𝑘+1 = D 𝑘 +        (X − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘 ).                      (3.11)
                                           2𝜂
Based on Line 7, we can represent 𝜂D 𝑘 from the previous iteration as
                                 𝜂D 𝑘 = X 𝑘−1 − X 𝑘 − 𝜂∇F(X 𝑘−1 ; 𝜉 𝑘−1 ).                           (3.12)
Eliminating both D 𝑘 and D 𝑘+1 by substituting (3.11)-(3.12) into Line 7, we obtain
      𝑘+1    𝑘            𝑘   𝑘
                                   
                                        𝑘  I−W 𝑘                 𝑘 𝑘            𝑘
                                                                                   
    X     = X − 𝜂∇F(X ; 𝜉 ) − 𝜂D +                (X − 𝜂∇F(X ; 𝜉 ) − 𝜂D )              (from (3.11))
                                              2
            I+W 𝑘                           I+W 𝑘
          =       (X − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 )) −          𝜂D
              2                                2
            I+W 𝑘                           I + W 𝑘−1
          =       (X − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 )) −          (X    − X 𝑘 − 𝜂∇F(X 𝑘−1 ; 𝜉 𝑘−1 )) (from (3.12))
              2                                2
            I+W
          =       (2X 𝑘 − X 𝑘−1 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) + 𝜂∇F(X 𝑘−1 ; 𝜉 𝑘−1 )),                            (3.13)
              2
                                                                          I+W
which is exactly 𝐷 2 . It also recovers Exact Diffusion with A =            2   and M = 𝜂I in Eq. (97)
of [125].
                                                    33


3.4     Theoretical Analysis
In this section, we show the convergence rate for the proposed algorithm LEAD. Before showing
the main theorem, we make some assumptions, which are commonly used for the analysis of
decentralized optimization algorithms. All proofs are provided in Appendix B.3.
Assumption 6 (Unbiased and 𝐶-contracted operator). The compression operator 𝑄 : R𝑑 → R𝑑 is
unbiased, i.e., E𝑄(x) = x, and there exists 𝐶 ≥ 0 such that E∥x − 𝑄(x)∥ 22 ≤ 𝐶 ∥x∥ 22 for all x ∈ R𝑑 .
Assumption 7 (Stochastic gradient). The stochastic gradient ∇ 𝑓𝑖 (x; 𝜉) is unbiased, i.e., E𝜉 ∇ 𝑓𝑖 (x; 𝜉) =
∇ 𝑓𝑖 (x), and the stochastic gradient variance is bounded: E𝜉 ∥∇ 𝑓𝑖 (x; 𝜉) − ∇ 𝑓𝑖 (x)∥ 22 ≤ 𝜎𝑖2 for all
𝑖 ∈ [𝑛]. Denote 𝜎 2 = 𝑛1 𝑖=1      𝜎𝑖2 .
                             Í𝑛
Assumption 8. Each 𝑓𝑖 is 𝐿-smooth and 𝜇-strongly convex with 𝐿 ≥ 𝜇 > 0, i.e., for 𝑖 = 1, 2, . . . , 𝑛
and ∀x, y ∈ R𝑑 , we have
                                        𝜇                                                   𝐿
         𝑓𝑖 (y) + ⟨∇ 𝑓𝑖 (y), x − y⟩ +     ∥x − y∥ 2 ≤ 𝑓𝑖 (x) ≤ 𝑓𝑖 (y) + ⟨∇ 𝑓𝑖 (y), x − y⟩ + ∥x − y∥ 2 .
                                        2                                                   2
Theorem 3 (Constant stepsize). Let {X 𝑘 , H 𝑘 , D 𝑘 } be the sequence generated from Alg. 3 and X∗
is the optimal solution with D∗ = −∇F(X∗ ). Under Assumptions 5-8, for any constant stepsize
𝜂 ∈ (0, 2/(𝜇 + 𝐿)], if the compression parameters 𝛼 and 𝛾 satisfy
                                                                                  o
                                            n      2            2𝜇𝜂(2 − 𝜇𝜂)
                              𝛾 ∈ 0, min                  ,                           ,                 (3.14)
                                              (3𝐶 + 1) 𝛽 [2 − 𝜇𝜂(2 − 𝜇𝜂)]𝐶 𝛽
                                                        n 2 − 𝛽𝛾               o
                                      𝐶 𝛽𝛾       1
                             𝛼∈                ,    min            , 𝜇𝜂(2 − 𝜇𝜂) ,                       (3.15)
                                    2(1 + 𝐶) 𝑎 1            4 − 𝛽𝛾
with 𝛽 B 𝜆 max (I − W). Then, in total expectation we have
                                           1              1
                                             EL 𝑘+1 ≤ 𝜌 EL 𝑘 + 𝜂2 𝜎 2 ,                                 (3.16)
                                           𝑛              𝑛
where
       L 𝑘 B (1 − 𝑎 1 𝛼)∥X 𝑘 − X∗ ∥ 2 + (2𝜂2 /𝛾)E∥D 𝑘 − D∗ ∥ 2(I−W) † + 𝑎 1 ∥H 𝑘 − X∗ ∥ 2 ,             (3.17)
                                                                          
                      1 − 𝜇𝜂(2 − 𝜇𝜂)                     𝛾                                4(1 + 𝐶)
        𝜌 B max                           ,1−                    †
                                                                    , 1 − 𝛼 < 1, 𝑎 1 B                  (3.18)
                           1 − 𝑎1 𝛼             2𝜆 max ((I − W) )                         𝐶 𝛽𝛾 + 2
The result holds for 𝐶 → 0.
                                                         34


Corollary 4 (Complexity bounds). Define the condition numbers of the objective function and
                                                   𝐿                 𝜆max (I−W)
communication graph as 𝜅 𝑓 =                       𝜇  and 𝜅 𝑔 =      𝜆+min (I−W) , respectively. Under the same setting in
Theorem 3, we can choose 𝜂 = 𝐿1 , 𝛾 = min{ 𝐶 𝛽𝜅                    1
                                                                      𝑓
                                                                        ,     1
                                                                          (1+3𝐶) 𝛽 },
                                                                                                     1
                                                                                      and 𝛼 = O ( (1+𝐶)𝜅 𝑓
                                                                                                           ) such that
                                                        1                       1             1 
                  𝜌 = max 1 − O                                   ,1− O                     ,1− O               .
                                                   (1 + 𝐶)𝜅 𝑓                  (1 + 𝐶)𝜅 𝑔           𝐶𝜅 𝑓 𝜅 𝑔
With full-gradient (i.e., 𝜎 = 0), we obtain the following complexity bounds:
     • LEAD converges to the 𝜖-accurate solution with the iteration complexity
                                                                                             1
                                                   O (1 + 𝐶)(𝜅 𝑓 + 𝜅 𝑔 ) + 𝐶𝜅 𝑓 𝜅 𝑔 log .
                                                                                               𝜖
     • When 𝐶 = 0 (i.e., there is no compression), we obtain 𝜌 = max{1 − O ( 𝜅1𝑓 ), 1 − O ( 𝜅1𝑔 )}, and
                                                                         
       the iteration complexity O (𝜅 𝑓 + 𝜅 𝑔 ) log 1𝜖 . This exactly recovers the convergence rate of
       NIDS [59].
                                                                                                        
                           𝜅 𝑓 +𝜅 𝑔
     • When 𝐶 ≤       𝜅 𝑓 𝜅 𝑔 +𝜅 𝑓 +𝜅 𝑔 , the asymptotical complexity is O (𝜅 𝑓 + 𝜅 𝑔 ) log            1
                                                                                                       𝜖   , which also recovers
       that of NIDS [59] and indicates that the compression doesn’t harm the convergence in this
       case.
                                             𝜅 𝑓 +𝜅 𝑔                                                                       11⊤
     • With 𝐶 = 0 (or 𝐶 ≤               𝜅 𝑓 𝜅 𝑔 +𝜅 𝑓 +𝜅 𝑔 ) and fully connected communication graph (i.e., W =               𝑛 ),
       we have 𝛽 = 1 and 𝜅 𝑔 = 1. Therefore, we obtain 𝜌 = 1 − O ( 𝜅1𝑓 ) and the complexity bound
       O (𝜅 𝑓 𝑙𝑜𝑔 1𝜖 ). This recovers the convergence rate of gradient descent [81].
Remark 6. Under the setting in Theorem 3, LEAD converges linearly to the O (𝜎 2 ) neighborhood
of the optimum and converges linearly exactly to the optimum if full gradient is used, e.g., 𝜎 = 0.
The linear convergence of LEAD holds when 𝜂 < 2/𝐿, but we omit the proof.
Remark 7 (Arbitrary compression precision). Pick any 𝜂 ∈ (0, 2/(𝜇 + 𝐿)], based on the compression-
related constant 𝐶 and the network-related constant 𝛽, we can select 𝛾 and 𝛼 in certain ranges
to achieve the convergence. It suggests that LEAD supports unbiased compression with arbitrary
precision, i.e., any 𝐶 > 0.
                                                                        35


                                                                                           1 Í𝑛
Corollary 5 (Consensus error). Under the same setting in Theorem 3 , let x 𝑘 =             𝑛   𝑖=1 x𝑖
                                                                                                         𝑘  be the
averaged model and H0 = H1 , then all agents achieve consensus at the rate
                                            𝑛
                                      1 ∑︁                 2    2L 0 𝑘 2𝜎 2 2
                                               E x𝑖𝑘 − x 𝑘   ≤      𝜌 +      𝜂 .                            (3.19)
                                      𝑛 𝑖=1                      𝑛       1−𝜌
where 𝜌 is defined as in Corollary 4 with appropriate parameter settings.
Theorem 4 (Diminishing stepsize). Let {X 𝑘 , H 𝑘 , D 𝑘 } be the sequence generated from Alg. 3 and
                                                                                                    2𝜃 5
X∗ is the optimal solution with D∗ = −∇F(X∗ ). Under Assumptions 5-8, if 𝜂 𝑘 =                𝜃 3 𝜃 4 𝜃 5 𝑘+2 and
                                     𝐶 𝛽𝛾 𝑘
𝛾 𝑘 = 𝜃 4 𝜂 𝑘 , by taking 𝛼𝑘 =      2(1+𝐶) ,  in total expectation we have
                                                   𝑛                    
                                               1 ∑︁              2       1
                                                     E x𝑖𝑘 − x∗    ≲O                                       (3.20)
                                               𝑛 𝑖=1                     𝑘
where 𝜃 1 , 𝜃 2 , 𝜃 3 , 𝜃 4 and 𝜃 5 are constants defined in the proof. The complexity bound for arriving at
the 𝜖-accurate solution is O ( 1𝜖 ).
Remark 8. Compared with CHOCO-SGD, LEAD requires unbiased compression and the conver-
gence under biased compression is not investigated yet. The analysis of CHOCO-SGD relies on
the bounded gradient assumptions, i.e., ∥∇ 𝑓𝑖 (x)∥ 2 ≤ 𝐺, which is restrictive because it conflicts
with the strong convexity while LEAD doesn’t need this assumption. Moreover, in the theorem of
CHOCO-SGD, it requires a specific point set of 𝛾 while LEAD only requires 𝛾 to be within a rather
large range. This may explain the advantages of LEAD over CHOCO-SGD in terms of robustness to
parameter setting.
3.5     Numerical Experiment
We consider three machine learning problems – ℓ2 -regularized linear regression, logistic regression,
and deep neural network. The proposed LEAD is compared with QDGD [88], DeepSqueeze [102],
CHOCO-SGD [51], and two non-compressed algorithms DGD [123] and NIDS [59].
    Setup. We consider eight machines connected in a ring topology network. Each agent can only
exchange information with its two 1-hop neighbors. The mixing weight is simply set as 1/3. For
                                                             36


compression, we use the unbiased 𝑏-bits quantization method with ∞-norm
                                                            2 (𝑏−1) |x|     
                                             −(𝑏−1)
                          𝑄 ∞ (x) := ∥x∥ ∞ 2        sign(x) ·              +u ,                      (3.21)
                                                                  ∥x∥ ∞
where · is the Hadamard product, |x| is the elementwise absolute value of x, and u is a random vector
uniformly distributed in [0, 1] 𝑑 . Only sign(x), norm ∥x∥ ∞ , and integers in the bracket need to be
transmitted. Note that this quantization method is similar to the quantization used in QSGD [4] and
CHOCO-SGD [51], but we use the ∞-norm scaling instead of the 2-norm. This small change brings
significant improvement on compression precision as justified both theoretically and empirically in
Appendix B.1. In this section, we choose 2-bit quantization and quantize the data blockwise (block
size = 512).
     For all experiments, we tune the stepsize 𝜂 from {0.01, 0.05, 0.1, 0.5}. For QDGD, CHOCO-SGD
and Deepsqueeze, 𝛾 is tuned from {0.01, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0}. Note that different notations
are used in their original papers. Here we uniformly denote the stepsize as 𝜂 and the additional
parameter in these algorithms as 𝛾 for simplicity. For LEAD, we simply fix 𝛼 = 0.5 and 𝛾 = 1.0 for
all experiments since we find LEAD is robust to parameter settings as we validate in the parameter
sensitivity analysis in the below. This indicates the minor effort needed for tuning LEAD. Detailed
parameter settings for all experiments are summarized in Appendix B.2.2.
                                                                   (∥A𝑖 x − b𝑖 ∥ 2 + 𝜆∥x∥ 2 ). Data matrices
                                                             Í𝑛
     Linear regression. We consider the problem: 𝑓 (x) = 𝑖=1
A𝑖 ∈ R200×200 and the true solution x′ is randomly synthesized. The values b𝑖 are generated by adding
Gaussian noise to A𝑖 x′. We let 𝜆 = 0.1 and the optimal solution of the linear regression problem
be x∗ . We use full-batch gradient to exclude the impact of gradient variance. The performance is
showed in Fig. 3.1. The distance to x∗ in Fig. 3.1a and the consensus error in Fig. 3.1c verify that
LEAD converges exponentially to the optimal consensual solution. It significantly outperforms
most baselines and matches NIDS well under the same number of iterations. Fig. 3.1b demonstrates
the benefit of compression when considering the communication bits. Fig. 3.1d shows that the
compression error vanishes for both LEAD and CHOCO-SGD while the compression error is pretty
large for QDGD and DeepSqueeze because they directly compress the local models.
                                                    37


                                     2                                                                                2
                                10                                                                               10
                                     0                                                                                0
                                10                                                                               10
        ||X k X * ||2F                                                                 ||X k X * ||2F
                                     2                                                                                2
                          10                                                                               10
                                     4       DGD ( 32 bits)                                                           4                             DGD ( 32 bits)
                          10                 NIDS ( 32 bits)                                               10                                       NIDS ( 32 bits)
                                             QDGD ( 2 bits)                                                                                         QDGD ( 2 bits)
                                     6       DeepSqueeze ( 2 bits)                                                    6                             DeepSqueeze ( 2 bits)
                          10                 CHOCO-SGD ( 2 bits)                                           10                                       CHOCO-SGD ( 2 bits)
                                             LEAD ( 2 bits)                                                                                         LEAD ( 2 bits)
                                         0   25      50    75     100 125 150 175                                             0       1000000 2000000 3000000 4000000 5000000
                                                                Epoch                                                                              Bits transmitted
                                               (a)   ∥X 𝑘   −    X∗ ∥   𝐹                                                                (b) ∥X 𝑘 − X∗ ∥ 𝐹
                                                                                                                          2
                                     1                                                                             10
                                10                                                                                 10
                                                                                                                      0
                                                                                             Compression Error
                                                                                                                      2
              Consensus Error
                                     1
                                10                                                                                10
                                                                                                                      4
                                10
                                     3                                                                            10
                                              DGD ( 32 bits)                                                          6
                                     5        NIDS ( 32 bits)                                                     10
                                10            QDGD ( 2 bits)
                                                                                                                  10
                                                                                                                      8                QDGD ( 2 bits)
                                              DeepSqueeze ( 2 bits)                                                                    DeepSqueeze ( 2 bits)
                                     7        CHOCO-SGD ( 2 bits)                                                    10                CHOCO-SGD ( 2 bits)
                                10            LEAD ( 2 bits)                                                     10                    LEAD ( 2 bits)
                                         0    25     50     75       100 125 150 175                                              0    25     50      75    100 125 150 175
                                                                 Epoch                                                                                     Epoch
                                             (c) ∥X 𝑘 − 1𝑛×1 X 𝑘 ∥ 𝐹                                                                  (d) Compression error
                                                                 Figure 3.1: Linear regression problem.
   Logistic regression. We further consider a logistic regression problem on the MNIST dataset.
The regularization parameter is 10−4 . We consider both homogeneous and heterogeneous data
settings. In the homogeneous setting, the data samples are randomly shuffled before being uniformly
partitioned among all agents such that the data distribution from each agent is very similar. In
the heterogeneous setting, the samples are first sorted by their labels and then partitioned among
agents. Due to the space limit, we mainly present the results in heterogeneous setting here and defer
the homogeneous setting to Appendix B.2.1. The results using full-batch gradient and mini-batch
gradient (the mini-batch size is 512 for each agent) are showed in Fig. 3.2 and Fig. 3.3 respectively
and both settings shows the faster convergence and higher precision of LEAD.
   Neural network. We empirically study the performance of LEAD in optimizing deep neural
network by training AlexNet (240 MB) on CIFAR10 dataset. The mini-batch size is 64 for each
agents. Both the homogeneous and heterogeneous case are showed in Fig. 3.4. In the homogeneous
                                                                                       38


                                                                         DGD (32 bits)                                                                                           DGD (32 bits)
                                                                         NIDS (32 bits)                                                                                          NIDS (32 bits)
                           0
                                                                         QDGD (2 bits)                                                                                           QDGD (2 bits)
                                                                                                                                0
                      10                                                                                                   10
                                                                         DeepSqueeze (2 bits)                                                                                    DeepSqueeze (2 bits)
                                                                         CHOCO-SGD (2 bits)                                                                                      CHOCO-SGD (2 bits)
                                                                         LEAD (2 bits)                                                                                           LEAD (2 bits)
         Loss   6 × 10
                         −1                                                                                   Loss            −1
                                                                                                                     6 × 10
                         −1
                4 × 10                                                                                                        −1
                                                                                                                     4 × 10
                         −1
                3 × 10
                                                                                                                              −1
                                                                                                                     3 × 10
                                0           200            400           600              800         1000                            0.00    0.25    0.50       0.75    1.00     1.25   1.50     1.75      2.00
                                                                  Epoch                                                                                          Bits transmitted                            1e10
                                                 (a) Loss 𝑓 (X 𝑘 )                                                                                   (b) Loss 𝑓 (X 𝑘 )
     Figure 3.2: Logistic regression problem in the heterogeneous case (full-batch gradient).
                1.1                                                                                                  1.1
                                                                        DGD (32 bits)                                                                                           DGD (32 bits)
                1.0                                                                                                  1.0
                                                                        NIDS (32 bits)                                                                                          NIDS (32 bits)
                0.9                                                     QDGD (2 bits)                                0.9                                                        QDGD (2 bits)
                0.8                                                     DeepSqueeze (2 bits)                                                                                    DeepSqueeze (2 bits)
                                                                                                                     0.8
                                                                        CHOCO-SGD (2 bits)                                                                                      CHOCO-SGD (2 bits)
                0.7                                                                                                  0.7
                                                                        LEAD (2 bits)                                                                                           LEAD (2 bits)
         Loss                                                                                                 Loss
                0.6
                                                                                                                     0.6
                0.5
                                                                                                                     0.5
                0.4
                                                                                                                     0.4
                0.3
                                                                                                                     0.3
                       0             10         20    30          40      50         60          70    80                           0.0      0.2     0.4         0.6      0.8      1.0     1.2        1.4
                                                                 Epoch                                                                                       Bits transmitted                                1e9
                                                 (a) Loss 𝑓 (X 𝑘 )                                                                                   (b) Loss 𝑓 (X 𝑘 )
        Figure 3.3: Logistic regression in the heterogeneous case (mini-batch gradient).
                                                  Homogeneous data                                                                                   Heterogeneous data
                                                                        DGD (32 bits)                                                                                         DGD (32 bits)
                2.0
                                                                        NIDS (32 bits)                               2.5                                                      NIDS (32 bits)
                                                                        QDGD (2 bits)                                                                                         QDGD* (2 bits)
                                                                        DeepSqueeze (2 bits)                                                                                  DeepSqueeze* (2 bits)
                                                                                                                     2.0
                1.5                                                     CHOCO-SGD (2 bits)                                                                                    CHOCO-SGD* (2 bits)
                                                                        LEAD (2 bits)                                                                                         LEAD (2 bits)
         Loss                                                                                                 Loss
                1.0                                                                                                  1.5
                0.5                                                                                                  1.0
                               0.0        0.2        0.4          0.6          0.8         1.0        1.2                           0.0        0.2         0.4          0.6        0.8          1.0         1.2
                                                     Bits transmitted                                 1e14
                                                                                                                                                           Bits transmitted                                 1e14
                                                 (a) Loss 𝑓 (X 𝑘 )                                                                                   (b) Loss 𝑓 (X 𝑘 )
       Figure 3.4: Stochastic optimization on deep neural network (∗ means divergence).
case, CHOCO-SGD, DeepSqueeze and LEAD perform similarly and outperform the non-compressed
variants in terms of communication efficiency, but CHOCO-SGD and DeepSqueeze need more
efforts for parameter tuning because their convergence is sensitive to the setting of 𝛾. In the
                                                                                                             39


heterogeneous cases, LEAD achieves the fastest and most stable convergence. Note that in this
setting, sufficient information exchange is more important for convergence because models from
different agents are moving to significantly diverse directions. In such case, DGD only converges
with smaller stepsize and its communication compressed variants, including QDGD, DeepSqueeze
and CHOCO-SGD, diverge in all parameter settings we try.
   Parameter sensitivity. In the linear regression problem, the convergence of LEAD under
different parameter settings of 𝛼 and 𝛾 are tested. The result showed in Figure 3.5 indicates that
LEAD performs well in most settings and is robust to the parameter setting. Therefore, in this work,
we simply set 𝛼 = 0.5 and 𝛾 = 1.0 for LEAD in all experiment, which indicates the minor effort
needed for parameter tuning.
                                2                               LEAD (   = 0.2)                           2                              LEAD (   = 0.2)
                           10                                   LEAD (   = 0.4)                      10                                  LEAD (   = 0.4)
                                                                                                        1
                              1                                 LEAD (   = 0.6)                      10                                  LEAD (   = 0.6)
                           10                                   LEAD (   = 0.8)                         0                                LEAD (   = 0.8)
                                                                LEAD (   = 1)                        10                                  LEAD (   = 1)
         ||X k X * ||2F                                                            ||X k X * ||2F
                              0                                                                         1
                           10                                                                       10
                              1                                                                         2
                          10                                                                        10
                                                                                                        3
                          10
                              2                                                                     10
                                                                                                        4
                              3                                                                     10
                          10                                                                        10
                                                                                                        5
                                    0    25   50   75    100 125 150 175                                      0   25   50   75    100 125 150 175
                                                        Epoch                                                                    Epoch
                                              (a) 𝛾 = 0.4                                                              (b) 𝛾 = 0.6
                                2                                        = 0.2)                           2                                       = 0.2)
                           10                                   LEAD (
                                                                LEAD (   = 0.4)
                                                                                                     10                                  LEAD (
                                                                                                                                         LEAD (   = 0.4)
                                0                               LEAD (   = 0.6)                           0                              LEAD (   = 0.6)
                           10                                   LEAD (   = 0.8)                      10                                  LEAD (   = 0.8)
                                                                LEAD (   = 1)                                                            LEAD (   = 1)
         ||X k X * ||2F                                                            ||X k X * ||2F
                                                                                                          2
                          10
                                2                                                                   10
                                                                                                          4
                          10
                                4                                                                   10
                                                                                                          6
                                6                                                                   10
                          10
                                    0    25   50   75    100 125 150 175                                      0   25   50   75    100 125 150 175
                                                        Epoch                                                                    Epoch
                                              (c) 𝛾 = 0.8                                                              (d) 𝛾 = 1.0
                                        Figure 3.5: Parameter analysis on linear regression problem.
   In summary, our experiments verify our theoretical analysis and show that LEAD is able to
handle data heterogeneity very well. Furthermore, the performance of LEAD is robust to parameter
                                                                                  40


settings and needs less effort for parameter tuning, which is critical in real-world applications.
3.6     Conclusion
In this work, we investigate the communication compression in message passing for decentralized op-
timization. Motivated by primal-dual algorithms, a novel decentralized algorithm with compression,
LEAD, is proposed to achieve faster convergence rate and to better handle heterogeneous data while
enjoying the benefit of efficient communication. The nontrivial analyses on the coupled dynamics
of inexact primal and dual updates as well as compression error establish the linear convergence
of LEAD when full gradient is used and the linear convergence to the O (𝜎 2 ) neighborhood of the
optimum when stochastic gradient is used. Extensive experiments validate the theoretical analysis
and demonstrate the state-of-the-art efficiency and robustness of LEAD. LEAD is also applicable to
non-convex problems as empirically verified in the neural network experiments. In addition, we also
proposed a linear convergent decentralized algorithm with compression (ProxLEAD) for composite
optimization problems [56].
                                                  41


                                             CHAPTER 4
                GRAPH NEURAL NETWORKS WITH ADAPTIVE RESIDUAL
Graph neural networks (GNNs) have shown the power in graph representation learning for numerous
tasks. In this chapter, we discover an interesting phenomenon that although residual connections
in the message passing of GNNs help improve the performance, they immensely amplify GNNs’
vulnerability against abnormal node features. This is undesirable because in real-world applications,
node features in graphs could often be abnormal such as being naturally noisy or adversarially
manipulated. We analyze possible reasons to understand this phenomenon and aim to design GNNs
with stronger resilience to abnormal features. Our understandings motivate us to propose and
derive a simple, efficient, interpretable, and adaptive message passing scheme, leading to a novel
GNN with Adaptive residual, AirGNN. Extensive experiments under various abnormal feature
scenarios demonstrate the effectiveness of the proposed algorithm. The implementation is available
at https://github.com/lxiaorui/AirGNN.
4.1      Introduction
Recent years have witnessed the great success of graph neural networks (GNNs) in representation
learning for graph structure data [74]. Essentially, GNNs generalize deep neural networks (DNNs)
from regular grids, such as image, video and text, to irregular data such as social, energy,
transportation, citation, and biological networks. Such data can be naturally represented as graphs
with nodes and edges. The key building block for such generalization is the neural message passing
framework [29]:
                                  x𝑢(𝑘+1) = UPDATE (𝑘) x𝑢(𝑘) , mN
                                                                (𝑘)   
                                                                  (𝑢)
                                                                                               (4.1)
where x𝑢(𝑘) ∈ R𝑑 denotes the feature vector of node 𝑢 in the 𝑘-th iteration of message passing, and
   (𝑘)
mN   (𝑢)
         is the message aggregated from 𝑢’s neighborhood N (𝑢). The specific design of message
passing scheme can be motivated from spectral domain [48, 23] or spatial domain [33, 109, 91, 29].
                                                  42


It usually linearly smooths the features in a local neighborhood on the graph.
    GNNs have achieved superior performance in a large number of benchmark datasets [117] where
the node features are assumed to be complete and informative. However, in real-world applications,
some node features could be abnormal from various aspects. For instance, in social networks,
new users might not have complete profile before they make connections with others, leading to
missing user features. In transportation networks, node features can be noisy since there exist certain
uncertainty and dynamics in the observation of the traffic information. What is worse, node features
can be adversarially chosen by the attacker to maliciously manipulate the prediction made by GNNs.
Therefore, it is greatly desired to design GNN models with stronger resilience to abnormal node
features.
    In this work, we first perform empirical investigations on how representative GNN models
behave on graphs with abnormal features. Specifically, based upon standard benchmark datasets,
we simulate the abnormal features by replacing the features of randomly selected nodes with
random Gaussian noise. Then the performance of node classification on abnormal features and
normal features are examined separately. From our preliminary study in Section 4.2, we reveal
two interesting observations: (1) Feature aggregation can boost the resilience to abnormal features,
but too many aggregations could hurt the performance on both normal and abnormal features; and
(2) Residual connection helps GNNs benefit from more layers for normal features, while making
GNNs more fragile to abnormal features. We then provide possible explanations to understand these
observed phenomena from the perspective of graph Laplacian smoothing. Our analyses imply that
there might exist an intrinsic tension between feature aggregation and residual connection, which
results in a performance tradeoff between normal features and abnormal features.
    Motivated by these findings and understandings, we aim to design new GNNs with stronger
resilience to abnormal features while largely maintaining the performance on normal features. Our
contributions can be summarized as follows:
     • We discover an intrinsic tension between feature aggregation and residual connection in
       GNNs, and the corresponding performance tradeoff between abnormal and normal features.
                                                  43


      We also analyze possible reasons to explain and understand these findings.
    • We propose a simple, efficient, principled and adaptive message passing scheme, which leads
      to a novel GNN model with adaptive residual, named as AirGNN.
    • Extensive experiments under various abnormal feature scenarios demonstrate the superiority
      of the proposed algorithm. The ablation study demonstrates how the adaptive residuals
      mitigate the impact of abnormal features.
4.2    Preliminary
Before introducing the preliminary study, we first define the notations used throughout the paper.
    Notations. We use bold upper-case letters such as X to denote matrices. Given a matrix
X ∈ R𝑛×𝑑 , we use X𝑖 to denote its 𝑖-th row and X𝑖 𝑗 to denote its element in 𝑖-th row and 𝑗-th
                                                                                            √︃Í
column. The Frobenius norm and ℓ21 norm of a matrix X are defined as ∥X∥ 𝐹 =                          2
                                                                                                𝑖 𝑗 X𝑖 𝑗 and
           Í               Í √︃Í 2
∥X∥ 21 = 𝑖 ∥X𝑖 ∥ 2 = 𝑖            𝑗 X𝑖 𝑗 , respectively. We define ∥X∥ 2 = 𝜎max (X) where 𝜎max (X) is the
largest singular value of X.
    Let G = {V, E} be a graph with the node set V = {𝑣 1 , . . . , 𝑣 𝑛 } and the undirected edge set
E = {𝑒 1 , . . . , 𝑒 𝑚 }. We use N (𝑣 𝑖 ) to denote the neighboring nodes of node 𝑣 𝑖 , including 𝑣 𝑖 itself.
Suppose that each node is associated with a 𝑑-dimensional feature vector, and the features for all
nodes are denoted as Xfea ∈ R𝑛×𝑑 . The graph structure G can be represented as an adjacent matrix
A ∈ R𝑛×𝑛 , where A𝑖 𝑗 = 1 when there exists an edge between nodes 𝑣 𝑖 and 𝑣 𝑗 , and A𝑖 𝑗 = 0 otherwise.
The graph Laplacian matrix is defined as L = D − A, where D is the diagonal degree matrix. Let
                                                                                            1       1
us denote the commonly used feature aggregation matrix in GNNs [48] as Ã = D̂− 2 ÂD̂− 2 where
Â = A + I is the adjacent matrix with self-loop and its degree matrix is D̂. The corresponding
Laplacian matrix is defined as L̃ = I − Ã.
    In this work, we focus on the setting where a subset of nodes in the graph contain abnormal
features, while the remaining nodes have normal features. In the remaining of this chapter, we use
abnormal/normal features to denote nodes with abnormal/normal features, for simplicity.
                                                         44


4.2.1                  Preliminary Study
Experimental setup. To investigate how GNNs behave on abnormal and normal node features,
we design semi-supervised node classification experiments on three common datasets (i.e., Cora,
CiteSeer and PubMed), following the data splits in the work [48]. Moreover, we simulate the
abnormal features by assigning 10% of the nodes with random features sampled from a standard
Gaussian distribution. The experiments are performed on representative GNN models covering
coupled and decoupled architectures, including GCN [48], GCNII [14], APPNP [49], and their
variants with or without residual connections in feature aggregations, denoted as w/Res and wo/Res.
All methods follow the hyperparameter settings in their original papers. We examine how these
models perform when the number of layers increases. Note that for the decoupled architectures such
as APPNP, we fix the 2-layer MLP and increase the number of propagation layers. While for the
coupled architectures such as GCN and GCNII, we increase the number of feature transformation
and propagation layers simultaneously. We report the average performance over 10 times of random
selection of the noise node sets. The node classification accuracy (mean and standard variance) on
nodes with abnormal and normal features is illustrated in Figure 4.1 and Figure. 4.2, separately.
            0.70                                                                0.35                                                            0.40
            0.65                                 APPNP w/Res                                                     GCNII w/Res                                                       GCN w/Res
            0.60                                 APPNP wo/Res                   0.30                             GCNII wo/Res                   0.35                               GCN wo/Res
            0.55
            0.50                                                                0.25                                                            0.30
            0.45
 Accuracy                                                            Accuracy                                                        Accuracy
            0.40                                                                0.20                                                            0.25
            0.35
            0.30                                                                0.15                                                            0.20
            0.25
            0.20                                                                0.10                                                            0.15
            0.15
            0.10                                                                0.050                                                           0.100
                   0    2   4     6    8    10     12   14      16                      2   4     6    8    10     12   14      16                      2   4     6    8    10      12   14     16
                                Number of layers                                                Number of layers                                                Number of layers
                            (a) APPNP                                                       (b) GCNII                                                           (c) GCN
                                  Figure 4.1: Node classification accuracy on abnormal nodes (Cora).
            Observations. From Figure 4.1 and Figure 4.2, we can make the following observations: (1)
Without residual connection, more layers (e.g., > 2 for GCN and GCNII, > 10 for APPNP) hurt the
accuracy on nodes with normal features. However, more layers boost the accuracy on nodes with
abnormal features significantly, before finally starting to decrease; (2) With residual connection, the
                                                                                                  45


            0.8                                                            0.8
                                                                                                                                           0.7
                                                                           0.7
                                                                                                                                           0.6
            0.7                                                            0.6
                                                                                                                                           0.5
 Accuracy                                                       Accuracy                                                        Accuracy
                                                                           0.5
                                                                                                                                           0.4
            0.6                                                            0.4                                                             0.3
                       APPNP w/Res                                         0.3        GCNII w/Res                                          0.2        GCN w/Res
                       APPNP wo/Res                                                   GCNII wo/Res                                                    GCN wo/Res
            0.50   2      4     6     8   10     12   14   16              0.20   2      4      6    8    10     12   14   16              0.10   2      4     6    8    10     12   14    16
                              Number of layers                                                Number of layers                                               Number of layers
                          (a) APPNP                                                          (b) GCNII                                                       (c) GCN
                                    Figure 4.2: Node classification accuracy on normal nodes (Cora).
accuracy on nodes with normal features keeps increasing with more layers1. However, the accuracy
on nodes with abnormal features only increases marginally when stacking more layers, and then
starts to decrease. While we only present the experiments on Cora, we defer the results on other
datasets to Appendix C.1, which provide similar observations. To conclude, we can summarize
these observations into two major findings:
    • Finding I: Feature aggregation can boost the resilience to abnormal features, but too many
              aggregations could hurt the performance on both normal and abnormal nodes;
    • Finding II: Residual connection helps GNNs benefit from more layers for nodes with normal
              features, while making GNNs more fragile to abnormal features.
4.2.2              Understandings
In this subsection, we provide the understanding and explanation for aforementioned findings, from
the perspective of graph Laplacian smoothing.
            Understanding Finding I: Feature aggregation as Laplacian smoothing
            The message passing in GCN [48], GCNII wo/ residual and APPNP wo/ residual (as well as
many popular GNN models), follows the feature aggregation
                                                                                       Xout = ÃXin ,                                                                                     (4.2)
            1 GCN      w/Res is an exception because its residual is not appropriate, which is consistent with the experiments in the
work [48].
                                                                                                    46


where Xin and Xout represent the features before and after message passing layer, respectively. It can
be interpreted as one gradient descent step for the Laplacian smoothing problem [73]
                                  1  ⊤                1       ∑︁            X𝑖         X𝑗
             arg min L1 (X) B       tr X (I − Ã)X =                      ∥√        − √︁        ∥ 22 , (4.3)
              X∈R𝑛×𝑑              2                       2                  𝑑𝑖 + 1      𝑑𝑗 + 1
                                                            (𝑣 𝑖 ,𝑣 𝑗 )∈E
where 𝑑𝑖 is the node degree of node 𝑣 𝑖 . Eq. (4.2) can be derived from Xout = Xin − (I − Ã)Xin = ÃXin ,
with the initialization X = Xin and stepsize 𝛾 = 1. The Laplacian smoothing problem penalizes
the feature difference between neighboring nodes. To reduce this penalty, the feature aggregation
in Eq. (4.2) smooths the node features by taking the average of local neighbors, and thus can be
considered as low-pass filter which gradually filters out high-frequency signals [84, 126]. Therefore,
it increases the resilience to abnormal features which are likely to be high-frequency signals. In
other words, the local neighboring nodes help to correct the abnormal features. Unfortunately, if
applied too many times, these low-pass filters could overly smooth the features (well-known as
oversmoothing [55, 85]) such that nodes are not distinguishable enough, providing an explanation to
the degraded performance on both abnormal and normal features when stacking too many layers.
     Understanding Finding II: Residual connection maintains feature proximity
     To adjust the feature smoothness for better performance, APPNP [49] utilizes residual connections
in message passing as follows
                                      X 𝑘+1 = (1 − 𝛼) ÃX 𝑘 + 𝛼Xin ,                                   (4.4)
where X0 = Xin . It can be considered as an iterative solution for the regularized Laplacian smoothing
problem [73]
                                             𝛼                          1                
                     arg min L2 (X) B               ∥X − Xin ∥ 2𝐹 + tr X⊤ (I − Ã)X ,                  (4.5)
                      X∈R𝑛×𝑑              2(1 − 𝛼)                      2
with initialization X = Xin and stepsize 𝛾 = 1 − 𝛼 due to
                                      𝛼                                  
            X 𝑘+1 = X 𝑘 − (1 − 𝛼)          (X 𝑘 − Xin ) + (I − Ã)X 𝑘 = (1 − 𝛼) ÃX 𝑘 + 𝛼Xin .
                                    1−𝛼
GCNII [14] adopts a similar message passing but further combines a feature transformation layer
in each message passing step, which leads to a coupled architecture, as contrast to the decoupled
                                                     47


architecture of APPNP. The residual connection naturally arises when regularizing the proximity
between input and output features, as showed in the first term of L2 (X). Such proximity can
help avoid the trivial solution for the problem in Eq. (4.3), i.e., totally oversmoothed features only
depending on node degrees, and consequently mitigates the oversmoothing issue. More intuitively,
residual connections in GNNs provide direct information flows between layers that can preserve
some necessary high-frequency signals for better discrimination between classes. More layers
with residual provide a more accurate solution to Eq. (4.5), which explains the performance gain
from deeper GNNs. Unfortunately, these residual connections also undesirably carry on abnormal
features which are detrimental, leading to the inferior performance on abnormal features.
4.3     Algorithm
In this section, we first motivate the proposed adaptive message passing scheme (AMP) with further
discussions on our preliminary study. We then introduce more details about AMP, its interpretations,
convergence guarantee and computation complexity, as well as the model architecture of AirGNN.
4.3.1    Design Motivation
Our preliminary study in Section 4.2 reveals an intrinsic tension between feature aggregation and
residual connection: (1) feature aggregation helps smooth out abnormal features, while it could
cause inappropriate smoothing for normal features; (2) residual connection is essential for adjusting
the feature smoothness, but it could be detrimental for abnormal features. Although this conflict
can be partially mitigated by adjusting the residual connection such as the residual weight 𝛼 in
GCNII [14] and APPNP [49], such global adjustment cannot be adaptive to a subset of the nodes,
e.g., the nodes with abnormal features. This is crucial because in practice we often encounter the
scenario where only a subset of nodes contain abnormal features. Therefore, how to reconcile this
dilemma still desires dedicated efforts. We then naturally ask a question: Can we design a better
message passing scheme with node-wise adaptive feature aggregation and residual connection?
    The motivation of the proposed idea builds upon the following intuition: while it is important to
                                                  48


maintain the proximity between input and output features as in Eq. (4.5), it could be over aggressive
to penalize their deviations by the square of Frobenius norm, i.e., ∥X − Xin ∥ 2𝐹 = 𝑖=1 ∥X𝑖 − (Xin )𝑖 ∥ 22 .
                                                                                   Í𝑛
The fact that this penalty does not tolerate large deviations weakens the capability to remove
abnormal features through Laplacian smoothing. This motivates us to consider an alternative
proximity penalty
                                                  ∑︁𝑛
                                  ∥X − Xin ∥ 21 B      ∥X𝑖 − (Xin )𝑖 ∥ 2 ,                       (4.6)
                                                   𝑖=1
which instead penalizes the deviations by the ℓ1 norm of row-wise ℓ2 norms, namely ℓ21 norm. The
ℓ21 norm promotes row sparsity in X − Xin , and it also allows large deviations because the penalty
on large values is less aggressive, leading to the potential removal of abnormal features. Therefore,
we propose the following Laplacian smoothing problem regularized by ℓ21 norm proximity control:
                                                        1
                              arg min 𝜆∥X − Xin ∥ 21 + tr(X⊤ (I − Ã)X),                         (4.7)
                               X∈R𝑛×𝑑                   2
where 𝜆 ∈ [0, ∞) is a parameter to adjust the balance between proximity and Laplacian smoothing.
In order to easy the tuning of 𝜆, we made a modification of Eq. (4.7):
                     arg min L (X) B 𝜆∥X − Xin ∥ 21 + (1 − 𝜆)tr(X⊤ (I − Ã)X),                   (4.8)
                      X∈R𝑛×𝑑
where 𝜆 ∈ [0, 1] controls the balance.
4.3.2  Adaptive Message Passing
                         Figure 4.3: Diagram of Adaptive Message Passing.
    L (X) is a composite objective with non-smooth and smooth components. We optimize it by
proximal gradient descent [9] and obtain the following iterations as the adaptive message passing
                                                  49


(AMP):
                                                                              
                Y 𝑘 = X 𝑘 − 2𝛾(1 − 𝜆)(I − Ã)X 𝑘 = 1 − 2𝛾(1 − 𝜆) X 𝑘 + 2𝛾(1 − 𝜆) ÃX 𝑘                  (4.9)
                                   n                       1                o
              X 𝑘+1 = arg min 𝜆∥X − Xin ∥ 21 +                ∥X − Y 𝑘 ∥ 2𝐹                            (4.10)
                            X                             2𝛾
where X0 = Xin and 𝛾 is the stepsize to be specified later. Let Z = X − Xin , and Eq. (4.10) can be
rewritten as:
                                                    n            1                       o
                            Z 𝑘+1 = arg min 𝜆∥Z∥ 21 +              ∥Z − (Y 𝑘 − Xin )∥ 2𝐹
                                               Z                2𝛾
                                     = prox𝛾𝜆∥·∥ 21 (Y 𝑘 − Xin )                                       (4.11)
                            X 𝑘+1 = Xin + Z 𝑘+1 .                                                      (4.12)
The 𝑖-th row of the proximal operator in Eq. (4.11) can be computed analytically
                                        X𝑖                                         𝛾𝜆
          prox𝛾𝜆∥·∥ 21 (X) 𝑖 =                   max(∥X𝑖 ∥ 2 − 𝛾𝜆, 0) = max(1 −            , 0) · X𝑖 . (4.13)
                                     ∥X𝑖 ∥ 2                                     ∥X𝑖 ∥ 2
Note that the proximal operator returns 0 if the input vector is 0. Substituting X in Eq. (4.13) with
Y 𝑘 − Xin and combining Eq. (4.11) and Eq. (4.12), then Eq. (4.12) becomes
                X𝑖𝑘+1 = (Xin )𝑖 + 𝛽𝑖 (Y𝑖𝑘 − (Xin )𝑖 ) = (1 − 𝛽𝑖 )(Xin )𝑖 + 𝛽𝑖 Y𝑖𝑘 ,        ∀𝑖 ∈ [𝑛],   (4.14)
                                𝛾𝜆
where 𝛽𝑖 B max(1 −        ∥Y𝑖𝑘 −(Xin )𝑖 ∥ 2
                                            , 0). To summarize, the proposed adaptive message passing (AMP)
scheme is showed in Figure 4.4, and a diagram is showed in Figure 4.3. In detail, AMP works as
follows:
   • The first step takes a feature aggregation within the local neighbors with a self-loop weighted
     by 1 − 2𝛾(1 − 𝜆);
   • The second step computes a weight 𝛽𝑖 ∈ [0, 1] for each node 𝑣 𝑖 depending on the local deviation
     ∥Y𝑖𝑘 − (Xin )𝑖 ∥ 2 .
   • The final step takes a linear combination of input features Xin and the aggregated features Y 𝑘 ,
     where the node-wise residual is adaptively weighted by 1 − 𝛽𝑖 for each node 𝑣 𝑖 .
                                                             50


                              Y 𝑘 = 1 − 2𝛾(1 − 𝜆) X 𝑘 + 2𝛾(1 − 𝜆) ÃX 𝑘
                                                        𝛾𝜆
                                𝛽𝑖 = max(1 −                     , 0) ∀𝑖 ∈ [𝑛]
                                               ∥Y𝑖 − (Xin )𝑖 ∥ 2
                                                  𝑘
                            X𝑖𝑘+1 = (1 − 𝛽𝑖 )(Xin )𝑖 + 𝛽𝑖 Y𝑖𝑘       ∀𝑖 ∈ [𝑛]
                            Figure 4.4: Adaptive Message Passing (AMP).
    The convergence guarantee of AMP and parameter setting for the stepsize 𝛾 are illustrated in
                                                             1               1
Theorem 5. According to Theorem 5, if we set 𝛾 =          4(1−𝜆) or 𝛾 =   2(1−𝜆) ,  then the first step of AMP
can be simplified as Y 𝑘 = 12 X 𝑘 + 12 ÃX 𝑘 and Y 𝑘 = ÃX 𝑘 , respectively. The choice of stepsize will
only impact the convergence speed but not the ultimate effect of AMP when it convergences to the
fixed point solution. We also discuss the computation complexity per iteration of AMP in Remark 9.
                                                                                1
Theorem 5 (Convergence of AMP). Under the stepsize setting 𝛾 <              (1−𝜆) ∥ L̃∥ 2
                                                                                          , the proposed adaptive
message passing scheme (AMP) in Eq. (4.9) and Eq. (4.10) converges to the optimal solution of the
                                                                                               1
problem defined in Eq. (4.8). In practice, it is sufficient to choose any 𝛾 <               2(1−𝜆) since ∥ L̃∥ 2 ≤ 2.
Moreover, if the connected components of the graph G are not bipartite graphs, it is sufficient to
                1
choose 𝛾 =   2(1−𝜆) since ∥ L̃∥ 2 < 2.
Proof. The objective that the iterations in AMP try to optimize is
                     arg min L (X) B 𝜆∥X − Xin ∥ 21 + (1 − 𝜆)tr(X⊤ (I − Ã)X) ,                               (4.15)
                      X∈R𝑛×𝑑              | {z } |                      {z                  }
                                               𝑔(X)                     𝑓 (X)
where 𝑓 and 𝑔 are both convex functions. Moreover, 𝑔 is a non-smooth function, while 𝑓 is a smooth
function. In particular, 𝑓 is 𝐿-smoothness where 𝐿 = 2(1 − 𝜆)∥ L̃∥ 2 = 2(1 − 𝜆)∥I − Ã∥ 2 due to
      ∥∇ 𝑓 (X1 ) − ∇ 𝑓 (X2 )∥ 𝐹 = ∥2(1 − 𝜆) L̃(X1 − X2 )∥ 𝐹 ≤ 2(1 − 𝜆)∥ L̃∥ 2 ∥X1 − X2 ∥ 𝐹 .                  (4.16)
    AMP essentially applies a forward-backward splitting on the composite objective 𝑔(X) + 𝑓 (X):
                        X 𝑘+1 = (I + 𝛾𝜕𝑔) −1 (X 𝑘 − 𝛾∇ 𝑓 (X 𝑘 ))                                              (4.17)
                                                     51


                                               1
                                 = arg min ∥X − (X 𝑘 − 𝛾∇ 𝑓 (X 𝑘 ))∥ 2𝐹 + 𝛾𝑔(X),                                (4.18)
                                       X       2
which is known as proximal gradient method. The convergence of this forward-backward splitting is
                                          2
ensured if the stepsize satifies 𝛾 <      𝐿   according to Lemma 4.4 in [20]. Therefore, AMP provably
                                                                       1
converges to the optimal solution under the setting 𝛾 <            (1−𝜆) ∥ L̃∥ 2
                                                                                 . For the symmetrically normalized
                                                                   1               1                               1
Laplacian matrix, we have ∥ L̃∥ 2 ≤ 2 [19] and thus             2(1−𝜆)  ≤    (1−𝜆) ∥ L̃∥ 2
                                                                                           . Therefore, any 𝛾 < 2(1−𝜆)
will be sufficient. Moreover, according to [19], if the connected components of the graph G are not
                                                               1               1
bipartite graphs, we have ∥ L̃∥ 2 < 2 and thus 𝛾 =          2(1−𝜆)   < (1−𝜆) ∥ L̃∥ 2
                                                                                      is sufficient.
Remark 9 (Computation complexity). AMP is as efficient as simple feature aggregation Xout = ÃXin
because the additional computation cost from the second and third steps in Figure 4.4 is in the order
O (𝑛𝑑), where 𝑛 is the number of nodes and 𝑑 is the feature dimension. This is negligible compared
with the computation cost O (𝑚𝑑) in feature aggregation, where 𝑚 is the number of edges, due to
the fact that usually there are many more edges than nodes in real-world graphs, i.e., 𝑚 ≫ 𝑛.
4.3.3   Interpretation of AMP
Interestingly, the proposed AMP has a simple and intuitive interpretation as adaptive residual
connection, which aligns well with our design motivation:
 • If the feature of node 𝑣 𝑖 , i.e., (Xin )𝑖 , is significantly inconsistent with its local neighbors, i.e., the
   aggregated feature Y𝑖𝑘 , then the local deviation ∥Y𝑖𝑘 − (Xin )𝑖 ∥ 2 will be large, which leads to a 𝛽𝑖
   close to 1. Therefore, the final step will assign a small weight to the residual, i.e., (1 − 𝛽𝑖 )(Xin )𝑖 ,
   and the aggregated feature Y𝑖𝑘 will dominate.
 • On the contrary, if (Xin )𝑖 is already consistent with its local neighbors, ∥Y𝑖𝑘 − (Xin )𝑖 ∥ 2 will be
   small, which leads to a 𝛽𝑖 close to 0. Thus, the residual will dominate, which is reasonable since
   there is less need to aggregate features in this case.
                                                          52


 • To summarize, the local deviation ∥Y𝑖𝑘 − (Xin )𝑖 ∥ 2 provides a natural transition from 𝛽𝑖 → 1 to
    𝛽𝑖 → 0, and the transition can be modulated by 𝜆 which can be either learned or tuned as a
    hyperparameter through cross-validation. This transition provides an node-wise adaptive residual
    connection for the message passing scheme.
     Adaptivity for abnormal & normal features. According to the homophily assumption on
graph structure data [76, 130, 127, 48], the feature representations of normal features should be
more consistent with local neighbors than abnormal features. As a result, AMP will assign more
residual (i.e., smaller 𝛽) to normal features but less residual (i.e., larger 𝛽) to abnormal features,
providing a customized tradeoff between feature aggregation and residual connection. Consequently,
it can promote both the resilience to abnormal features and the performance on normal features.
Above discussion also implies a clear physical meaning for 𝛽 in AMP, and we formally define it as
the adaptive score.
Definition 1 (Adaptive score). The variables {𝛽1 , · · · , 𝛽𝑛 } in the adaptive message passing scheme
(AMP) are defined as the adaptive scores for nodes {𝑣 1 , · · · , 𝑣 𝑛 } respectively in graph G. In
particular, the larger 𝛽𝑖 is, the more likely the feature of node 𝑣 𝑖 is abnormal.
Remark 10 (Nonlinear smoother). Different from most existing message passing scheme which are
linear smoothers, AMP is a nonlinear smoother because the weights {𝛽𝑖 } are computed from Y 𝑘
and Xin . This nonlinearity is the key to achieve adaptive residual connection for different nodes.
4.3.4   Model Architecture
The proposed adaptive message passing (AMP) can be used as a building block in many GNN models
to improve the resilience to abnormal node features. In this work, we choose the the decoupled
architectures as APPNP [49] and DAGNN [65], and propose the Adaptive residual GNN (AirGNN):
                                        Xin = ℎ𝜃 (Xfea ),                                        (4.19)
                                       Ypre = AMP (Xin , 𝐾, 𝜆).                                  (4.20)
                                                    53


ℎ𝜃 (·) is any machine learning model parameterized by learnable parameters 𝜃, such as multilayer
perceptrons (MLPs). Xfea ∈ R𝑛×𝑑 denotes the initial node features. The model ℎ𝜃 (·) will first
transform the initial node features as Xin = ℎ𝜃 (Xfea ). AMP takes ℎ𝜃 (Xfea ) as input, and performs
𝐾 steps of AMP with the hyperparameter 𝜆. Similar to the majority of existing GNN models, the
training objective is the cross-entropy classification loss on the labeled nodes, and the whole model
is trained in an end-to-end way. Note that AirGNN is very efficient as explained in Remark 9, and
it only requires two hyperparameters 𝐾 and 𝜆 without introducing additional parameters to learn,
which could reduce the risk of overfitting.
4.4     Experiment
In this section, we aim to verify the effective of the proposed adaptive message passing scheme
(AMP) and the AirGNN model through the semi-supervised node classification tasks. Specifically,
we try to answer the following questions: (1) How does AirGNN perform on abnormal and normal
features? (Section 4.4.2 and 4.4.3) and (2) How does AirGNN work by adjusting the adaptive
residual? (Section 4.4.4)
4.4.1    Experimental Settings
Datasets and baselines. We conduct experiments on 8 real-world datasets including three citation
graphs, i.e., Cora, Citeseer, Pubmed [93], two co-authorship graphs, i.e., Coauthor CS and Coauthor
Physics [95], two co-purchase graphs, i.e., Amazon Computers and Amazon Photo [95], and one
OGB dataset, i.e., ogbn-arxiv [113]. Due to the space limit, we only present the results on Cora,
Citeseer, and Pubmed in this section, but defer the results on other datasets to Appendix C.2.1.
     In the experiments, the data statistics (full graphs) used in Section 4.4.2 are summarized in
Table 4.1. The data statistics (largest connected components) used in Section 4.4.3 are summarized
in Table 4.2. We use fixed data splits for Cora, CiteSeer, PubMed and ogbn-arxiv datasets, and
random data split for other datasets.
     The proposed AirGNN is compared with representative GNNs, including GCN [48], GAT [109],
                                                   54


                          Table 4.1: Data statistics on benchmark datasets.
  Dataset            Classes  Nodes    Edges    Features   Training Nodes   Validation Nodes Test Nodes
  Cora                  7       2708     5278      1433       20 per class        500            1000
  CiteSeer              6       3327     4552      3703       20 per class        500            1000
  PubMed                3      19717    44324       500       20 per class        500            1000
  Coauthor CS          15     18333     81894      6805       20 per class    30 per class     Rest nodes
  Coauthor Physics      5     34493    247962      8415       20 per class    30 per class     Rest nodes
  Amazon Computers     10     13381    245778      767        20 per class    30 per class     Rest nodes
  Amazon Photo          8      7487    119043      745        20 per class    30 per class     Rest nodes
  obgn-arxiv           40     169343  1166243      128           54%              18%            28%
                   Table 4.2: Dataset statistics for adversarially attacked datasets.
                              Dataset    NLCC    ELCC    Classes   Features
                                 Cora    2,485   5,069      7         1,433
                              CiteSeer 2,110     3,668      6         3,703
                              PubMed 19,717     44,338      3          500
APPNP [49] and GCNII [14]. We defer the comparison with the variants of APPNP and Robust
GCN [128] to Appendix C.2.3 and C.2.4 respectively.
    Parameter settings. For all baselines, we follow the best hyperparameter settings in their
original papers. Additionally, we tune a best residual weight 𝛼 for APPNP and GCNII in the range
[0, 1]. For AirGNN, we use a two-layer MLP as the base model ℎ𝜃 (·), following APPNP. We fix the
                                                                                         1
learning rate 0.01, dropout 0.8, and weight decay 0.0005. Moreover, we set 𝛾 =        2(1−𝜆) as suggested
by Theorem 5. We choose 𝐾 = 10 and tune 𝜆 in the range [0, 1]. Adam optimizer [47] is used in all
experiments. We run all experiments by 10 times, and report the mean and variance.
    Evaluation setting. We assess the performance of all models under two types of abnormal
feature scenarios, including noisy features and adversarial features. The abnormal features are
injected to randomly selected test nodes after model training. By default, all hyperparameters are
tuned according to the performance on validation sets when the dataset is clean. If tuning the
hyperparameter 𝜆 of AirGNN according to the validation sets after injecting abnormal features, the
performance will be even better, as discussed in Appendix C.2.2. The performance on clean data
are showed in Appendix 4.4.5 to demonstrate that AirGNN doesn’t need to sacrifice accuracy for
better robustness against abnormal features.
                                                   55


4.4.2              Performance Comparison with Noisy Features
In this subsection, we consider the abnormal features in the noisy feature scenario. Specifically,
we simulate the noisy features by assigning a subset of the nodes with random features sampled
from a multivariate standard Gaussian distribution. Note that the selection of noise subsets has a
apparent impact on the performance since some nodes are less vulnerable to abnormal features
while others are more vulnerable. To reduce such variance, we report the average performance over
10 times of random selection of the noise node sets, similar to the settings in the preliminary study
in Section 4.2. We report the node classification test accuracy on abnormal (noisy) features and
normal features in Figure 4.5 and Figure 4.6, separately, under varying noisy ratio. From these
figures, we can observe:
       • Figure 4.5 shows that AirGNN significantly outperforms all baselines on all datasets in terms
                 of the performance on noisy nodes. This verifies that AMP is able to improve the resilience to
                 noisy features, aligning well with the design motivation.
       • Figure 4.6 shows that AirGNN promotes the performance on normal nodes when abnormal
                 nodes exist. This is because AMP can remove some abnormal features which are detrimental to
                 normal nodes.
           0.7                                                            0.4                                                       0.8
                                                      GAT
                                                      GCN                                                                           0.7
           0.6                                        GCNII
                                                      APPNP               0.3                                                       0.6
           0.5                                        AirGNN
                                                                                                                                    0.5
Accuracy                                                       Accuracy                                                  Accuracy
           0.4
                                                                          0.2                                                       0.4
           0.3
                                                                                                                GAT                 0.3                                   GAT
           0.2                                                            0.1                                   GCN                 0.2                                   GCN
                                                                                                                GCNII                                                     GCNII
           0.1                                                                                                  APPNP               0.1                                   APPNP
                                                                                                                AirGNN                                                    AirGNN
           0.01    2   3     4    5   8   10 15 20 25 30                  0.01   2   3     4   5    8   10 15 20 25 30              0.01   2   3     4   5    8   10 15 20 25 30
                           Ratio of Noisy Nodes (%)                                      Ratio of Noisy Nodes (%)                                  Ratio of Noisy Nodes (%)
                                 (a) Cora                                                 (b) CiteSeer                                              (c) PubMed
                                  Figure 4.5: Node classification accuracy on abnormal (noisy) nodes.
                                                                                               56


           0.9
                                                                                                                           0.8
           0.8                                                         0.7
           0.7
Accuracy                                                    Accuracy                                            Accuracy
           0.6                                                         0.6
           0.5                                                                                                             0.7
                           GAT                                                 GAT                                                 GAT
           0.4             GCN                                         0.5     GCN                                                 GCN
                           GCNII                                               GCNII                                               GCNII
           0.3             APPNP                                               APPNP                                               APPNP
                           AirGNN                                              AirGNN                                              AirGNN
           0.2                                                         0.41                                                0.61
                 1     2     3      4   5   8   10 15 20 25 30                2 3 4     5    8 10 15 20 25 30                     2 3 4     5   8 10 15 20 25 30
                                 Ratio of Noisy Nodes (%)                          Ratio of Noisy Nodes (%)                            Ratio of Noisy Nodes (%)
                                  (a) Cora                                        (b) CiteSeer                                        (c) PubMed
                                                Figure 4.6: Node classification accuracy on normal nodes.
       4.4.3               Performance Comparison with Adversarial Features
       In this subsection, we consider the abnormal feature scenario when the node features are maliciously
       attacked by the attacker to manipulate the prediction of GNNs. We use the Nettack [134] implemented
       in DeepRobust2 [58], a PyTorch library for adversarial attacks and defenses, to generate the adversarial
       features. We randomly choose 40 test nodes as the targeted nodes, and assess the performance
       under increasing perturbation budgets {0, 5, 10, 20, 50, 80}, where the perturbation numbers denote
       the number of feature dimensions that can be manipulated. The node classification accuracy on
       these attacked nodes are showed in Figure 4.7. From these figures, we can make the following
       observations:
                 • AirGNN is significantly more robust against adversarially attacked features than all baselines.
                      MLP is the most vulnerable model, which demonstrates the usefulness of graph structure
                      information in combating against abnormal node features.
                 • The advantages of AirGNN over the baselines become much stronger with larger perturbation
                      budgets. This suggests that AMP can significantly improve the resilience to abnormal features.
                     2 https://github.com/DSE-MSU/DeepRobust
                                                                                            57


           0.9                                                                                                                      0.9
           0.8                                                           0.8
                                                                                                                                    0.8
           0.7
                                                                         0.7
           0.6                                                                                                                      0.7
Accuracy                                                      Accuracy                                                   Accuracy
           0.5                                                           0.6
                                                                                                                                    0.6
           0.4
           0.3       GAT                                                 0.5    GAT                                                 0.5    GAT
                     GCN                                                        GCN                                                        GCN
           0.2       GCNII                                                      GCNII                                                      GCNII
                     APPNP                                               0.4    APPNP                                               0.4    APPNP
           0.1       AirGNN                                                     AirGNN                                                     AirGNN
                     MLP                                                        MLP                                                        MLP
                 0    1       2      5    10   20   50   80              0.30    1       2     5    10    20   50   80              0.30    1       2     5    10    20   50   80
                              Perturbation Number                                        Perturbation Number                                        Perturbation Number
                                  (a) Cora                                               (b) CiteSeer                                               (c) PubMed
                                         Figure 4.7: Node classification accuracy on adversarial nodes.
4.4.4                Adaptive Residual for Abnormal & Normal Nodes
To further understand and verify how AMP and AirGNN work, we investigate the adaptive score 𝛽𝑖
for each node 𝑣 𝑖 . Specifically, the average adaptive scores for abnormal nodes and normal nodes in
the last layer of AMP are computed separately. In the noisy feature scenario, we fix ratio of noisy
nodes as 10%. In the adversarial feature scenario, we choose 40 target nodes and fix the perturbation
number as 80. The results in noisy and adversarial feature scenarios are showed in Table 4.3 and
Table 4.4, respectively. From these tables, we can observe:
       • On the one hand, it can be clearly observed that in both scenarios, the average adaptive scores
                 for abnormal nodes are significantly higher than those for normal nodes. Therefore, it verifies
                 our intuition that large adaptive scores are strongly related to abnormal features.
       • On the other hand, it also implies that the residual weights (i.e., 1 − 𝛽𝑖 ) for abnormal nodes are
                 much lower than those of normal nodes. This perfectly aligns with our motivation to remove
                 abnormal features by reducing their residual connections.
            The study on adaptive scores verifies how the adaptive residuals in AMP and AirGNN work as
designed. It corroborates that AirGNN not only tremendously boosts the resilience to abnormal
features but also provides interpretable information for anomaly detection that will be useful in many
security-critical scenarios since the adaptive score serves as a good indicator of abnormal nodes.
Morever, it is expected that APPNP without residual will perform well on abnormal nodes but it will
                                                                                             58


sacrifice the performance on normal nodes. We provide detailed comparison with APPNP w/Res
and APPNP wo/Res in Appendix C.2.3 to show the advantages of adaptive residual of AirGNN.
  Table 4.3: Average adaptive score (𝛽) and residual weight (1 − 𝛽) in the noisy feature scenario.
                        Measure                         Cora        CiteSeer       PubMed
        Average adaptive score for abnormal nodes   0.998 ± 0.000 0.988 ± 0.000  0.996 ± 0.000
         Average adaptive score for normal nodes    0.924 ± 0.002 0.807 ± 0.005  0.869 ± 0.006
       Average residual weight for abnormal nodes   0.002 ± 0.000 0.012 ± 0.000  0.004 ± 0.000
        Average residual weight for normal nodes    0.076 ± 0.002 0.193 ± 0.005  0.131 ± 0.006
Table 4.4: Average adaptive score (𝛽) and residual weight (1 − 𝛽) in the adversarial feature scenario.
                        Measure                         Cora        CiteSeer       PubMed
        Average adaptive score for abnormal nodes   0.987 ± 0.000 0.930 ± 0.007  0.959 ± 0.005
         Average adaptive score for normal nodes    0.922 ± 0.004 0.689 ± 0.024  0.826 ± 0.016
       Average residual weight for abnormal nodes   0.013 ± 0.000 0.070 ± 0.007  0.041 ± 0.005
        Average residual weight for normal nodes    0.078 ± 0.004 0.311 ± 0.024  0.174 ± 0.016
4.4.5   Performance in the Clean Setting
Table 4.5 shows the overall performance when the dataset does not contain abnormal node features.
The performance of APPNP and AirGNN are comparable, which supports that AirGNN doesn’t
need to sacrifice clean performance for better robustness. AirGNN also outperforms Robust GCN in
the clean data setting.
     Table 4.5: Comparison between AirGNN, APPNP, and Robust GCN in the clean setting.
                         Dataset          Cora         CiteSeer      PubMed
                       Robust GCN     0.817 ± 0.005 0.710 ± 0.005 0.791 ± 0.003
                          APPNP       0.842 ± 0.004 0.719 ± 0.004 0.804 ± 0.003
                         AirGNN       0.839 ± 0.004 0.726 ± 0.004 0.806 ± 0.003
4.5    Related Work
GNNs generalize convolutional neural networks (CNN) to graph structure data through the message
passing framework [74, 29, 91]. The design of message passing and GNN architectures are majorly
                                                   59


motivated in spectral domain [48, 23] and spatial domain [33, 109, 91, 29, 24]. Recent works have
shown that the message passing in GNNs can be regarded as low-pass graph filters [84, 126]. More
generally, it has been proven that message passing in many GNNs can be uniformly derived from
graph signal denoising [73, 86, 129, 16]. Classic GNNs such as GCN [48] and GAT [109] achieve
their best performance with shallow models, but their performance degrades when stacking more
layers, which can be partially explained through oversmoothing analyses [55, 85]. Recent works
propose to use residual connections or skip connections to mitigate the oversmoothing issues, and
they demonstrate the potential benefits from more feature aggregations. Examples include but not
limited to DeepGCNs [53], JKNet [121], GCNII [14], APPNP [49] and DeeperGNN [65]. These
models use global residual connection that can not be adaptive for each node, which significantly
differ from the proposed AirGNN. Graph-level, neighborhood-wise and pair-level smoothness are
studied in the framework of graph feature gating networks [39]. Beyond oversmoothing, feature
over-correlation in GNNs [38] and automated self-supervised learning for graphs [40] are studied.
    Recently, there are growing interests in reducing GNNs’ vulnerability to the graph structure
noise, such as Robust GCN [128], GCN-SVD [27], Pro-GNN [41], IDGL [18], ElasticGNN [67],
etc. Please refer to the comprehensive surveys [37, 132] for more details. However, how to design
GNNs with strong resilience to abnormal node features remains to be developed. To the best of our
knowledge, AirGNN is the first GNN model that is intrinsically robust to many types of abnormal
node features by design. It improves the performance in various kinds of abnormal scenarios without
needing to sacrifice clean accuracy in normal settings.
4.6     Conclusion
In this work, we discover an intrinsic tension between feature aggregation and residual connection in
the message passing scheme of GNNs, as well as the corresponding performance tradeoff between
nodes with abnormal and normal features. We analyze possible reasons to explain these findings
from the perspective of graph Laplacian smoothing. Our understandings further motivate us to
propose a simple, efficient, interpretable and adaptive message passing scheme as well as a new GNN
                                                   60


model with adaptive residual, named AirGNN. AirGNN provides a node-wise adaptive transition
between feature aggregation and residual connection, and the significant advantages of AirGNN are
demonstrated through extensive experiments. In the future, it is promising to study the interaction
between multiple perspectives of trustworthy AI [64] such as robustness and fairness [119].
                                               61


                                            CHAPTER 5
                          ELASTIC GRAPH NEURAL NETWORKS
In this chapter, we discuss a severe limitation of the message passing schemes in existing graph
neural networks (GNNs) – they are proven to perform ℓ2 -based graph smoothing that enforces
smoothness globally and such a global smoothness property might lead to the lack of robustness
under adversarial graph attacks. In this work, we propose to design a more robust message passing
algorithm for GNNs by enhancing the local smoothness adaptivity of GNNs via ℓ1 -based graph
smoothing. To this end, we introduce a family of GNNs (Elastic GNNs) based on ℓ1 and ℓ2 -based
graph smoothing. In particular, we propose a novel and general message passing scheme into
GNNs. This message passing algorithm is not only friendly to back-propagation training but also
achieves the desired smoothing properties with a theoretical convergence guarantee. Experiments on
semi-supervised learning tasks demonstrate that the proposed Elastic GNNs obtain better adaptivity
on benchmark datasets and are significantly robust to graph adversarial attacks. The implementation
of Elastic GNNs is available at https://github.com/lxiaorui/ElasticGNN.
5.1    Introduction
Graph neural networks (GNNs) generalize traditional deep neural networks (DNNs) from regular
grids, such as image, video, and text, to irregular data such as social networks, transportation
networks, and biological networks, which are typically denoted as graphs [23, 48]. One popular
such generalization is the neural message passing framework [29]:
                                 x𝑢(𝑘+1) = UPDATE (𝑘) x𝑢(𝑘) , mN(𝑘)   
                                                                  (𝑢)
                                                                                                (5.1)
where x𝑢(𝑘) ∈ R𝑑 denotes the feature vector of node 𝑢 in 𝑘-th iteration of message passing and mN(𝑘)
                                                                                                   (𝑢)
is the message aggregated from 𝑢’s neighborhood N (𝑢). The specific architecture design has been
motivated from spectral domain [48, 23] and spatial domain [33, 109, 91, 29]. Recent study [73] has
proven that the message passing schemes in numerous popular GNNs, such as GCN, GAT, PPNP,
                                                 62


and APPNP, intrinsically perform the ℓ2 -based graph smoothing to the graph signal, and they can be
considered as solving the graph signal denoising problem:
                             arg min L (F) := ∥F − Xin ∥ 2𝐹 + 𝜆 tr(F⊤ LF),                        (5.2)
                                F
where Xin ∈ R𝑛×𝑑 is the input signal and L ∈ R𝑛×𝑛 is the graph Laplacian matrix encoding the
graph structure. The first term guides F to be close to input signal Xin , while the second term
enforces global smoothness to the filtered signal F. The resulted message passing schemes can be
derived by different optimization solvers, and they typically entail the aggregation of node features
from neighboring nodes, which intuitively coincides with the cluster or consistency assumption
that neighboring nodes should be similar [130, 127]. While existing GNNs are prominently driven
by ℓ2 -based graph smoothing, ℓ2 -based methods enforce smoothness globally and the level of
smoothness is usually shared across the whole graph. However, the level of smoothness over
different regions of the graph can be different. For instance, node features or labels can change
significantly between clusters but smoothly within the cluster [131]. Therefore, it is desired to
enhance the local smoothness adaptivity of GNNs.
    Motivated by the idea of trend filtering [46, 106, 111], we aim to achieve the goal via ℓ1 -based
graph smoothing. Intuitively, compared with ℓ2 -based methods, ℓ1 -based methods penalize large
values less and thus preserve discontinuity or non-smooth signal better. Theoretically, ℓ1 -based
methods tend to promote signal sparsity to trade for discontinuity [90, 105, 94]. Owning to
these advantages, trend filtering [106] and graph trend filter [111, 108] demonstrate that ℓ1 -based
graph smoothing can adapt to inhomogenous level of smoothness of signals and yield estimators
with k-th order piecewise polynomial functions, such as piecewise constant, linear and quadratic
functions, depending on the order of the graph difference operator. While ℓ1 -based methods exhibit
various appealing properties and have been extensively studied in different domains such as signal
processing [25], statistics and machine learning [34], it has rarely been investigated in the design of
GNNs. In this work, we attempt to bridge this gap and enhance the local smoothnesss adaptivity of
GNNs via ℓ1 -based graph smoothing.
                                                  63


    Incorporating ℓ1 -based graph smoothing in the design of GNNs faces tremendous challenges.
First, since the message passing schemes in GNNs can be derived from the optimization iteration
of the graph signal denoising problem, a fast, efficient and scalable optimization solver is desired.
Unfortunately, to solve the associated optimization problem involving ℓ1 norm is challenging since
the objective function is composed by smooth and non-smooth components and the decision variable
is further coupled by the discrete graph difference operator. Second, to integrate the derived
messaging passing scheme into GNNs, it has to be composed by simple operations that are friendly
to the back-propagation training of the whole GNNs. Third, it requires an appropriate normalization
step to deal with diverse node degrees, which is often overlooked by existing graph total variation
and graph trend filtering methods. Our attempt to address these challenges leads to a family of novel
GNNs, i.e., Elastic GNNs. Our key contributions can be summarized as follows:
     • We introduce ℓ1 -based graph smoothing in the design of GNNs to further enhance the local
       smoothness adaptivity, for the first time;
     • We derive a novel and general message passing scheme, i.e., Elastic Message Passing (EMP),
       and develop a family of GNN architectures, i.e., Elastic GNNs, by integrating the proposed
       message passing scheme into deep neural nets;
     • Extensive experiments demonstrate that Elastic GNNs obtain better adaptivity on various
       real-world datasets, and they are significantly robust to graph adversarial attacks. The study
       on different variants of Elastic GNNs suggests that ℓ1 and ℓ2 -based graph smoothing are
       complementary and Elastic GNNs are more versatile.
5.2     Preliminary
We use bold upper-case letters such as X to denote matrices and bold lower-case letters such as x
to define vectors. Given a matrix X ∈ R𝑛×𝑑 , we use X𝑖 to denote its 𝑖-th row and X𝑖 𝑗 to denote its
element in 𝑖-th row and 𝑗-th column. We define the Frobenius norm, ℓ1 norm, and ℓ21 norm of matrix
               √︃Í                                                     Í √︃Í 2
X as ∥X∥ 𝐹 =           2 , ∥X∥ = Í |X |, and ∥X∥           Í
                                                              ∥X  ∥
                     X
                   𝑖𝑗 𝑖𝑗      1     𝑖𝑗   𝑖𝑗           21 =  𝑖    𝑖 2 =   𝑖    𝑗 X𝑖 𝑗 , respectively. We
                                                  64


define ∥X∥ 2 = 𝜎max (X) where 𝜎max (X) is the largest singular value of X. Given two matrices
X, Y ∈ R𝑛×𝑑 , we define the inner product as ⟨X, Y⟩ = tr(X⊤ Y).
     Let G = {V, E} be a graph with the node set V = {𝑣 1 , . . . , 𝑣 𝑛 } and the undirected edge set
E = {𝑒 1 , . . . , 𝑒 𝑚 }. We use N (𝑣 𝑖 ) to denote the neighboring nodes of node 𝑣 𝑖 , including 𝑣 𝑖 itself.
Suppose that each node is associated with a 𝑑-dimensional feature vector, and the features for all
nodes are denoted as Xfea ∈ R𝑛×𝑑 . The graph structure G can be represented as an adjacent matrix
A ∈ R𝑛×𝑛 , where A𝑖 𝑗 = 1 when there exists an edge between nodes 𝑣 𝑖 and 𝑣 𝑗 . The graph Laplacian
matrix is defined as L = D − A, where D is the diagonal degree matrix. Let Δ ∈ {−1, 0, 1} 𝑚×𝑛 be
the oriented incident matrix, which contains one row for each edge. If 𝑒 ℓ = (𝑖, 𝑗), then Δ has ℓ-th
row as:
                                   Δℓ = (0, . . . , −1 , . . . , 1 , . . . , 0)
                                                   |{z}         |{z}
                                                     𝑖            𝑗
where the edge orientation can be arbitrary. Note that the incident matrix and unnormalized
Laplacian matrix have the equivalence L = Δ⊤ Δ. Next, we briefly introduce some necessary
background about the graph signal denoising perspective of GNNs and the graph trend filtering
methods.
5.2.1    GNNs as Graph Signal Denoising
It is evident from recent work [73] that many popular GNNs can be uniformly understood as
graph signal denoising with Laplacian smoothing regularization. Here we briefly describe several
representative examples.
     GCN. The message passing scheme in Graph Convolutional Networks (GCN) [48],
                                                  Xout = ÃXin ,
is equivalent to one gradient descent step to minimize tr(F⊤ (I − Ã)F) with the initial F = Xin and
                                1       1
stepsize 1/2. Here Ã = D̂− 2 ÂD̂− 2 with Â = A + I being the adjacent matrix with self-loop, whose
degree matrix is D̂.
                                                       65


    PPNP & APPNP. The message passing scheme in PPNP and APPNP [49] follow the aggregation
rules
                                                                          −1
                                      Xout = 𝛼 I − (1 − 𝛼) Ã                 Xin ,
 and
                                     X (𝑘+1) = (1 − 𝛼) ÃX (𝑘) + 𝛼Xin .
They are shown to be the exact solution and one gradient descent step with stepsize 𝛼/2 for the
following problem
                             min ∥F − Xin ∥ 2𝐹 + (1/𝛼 − 1) tr(F⊤ (I − Ã)F).                      (5.3)
                               F
For more comprehensive illustration, please refer to [73]. We point out that all these message
passing schemes adopt ℓ2 -based graph smoothing as the signal differences between neighboring
                                                                                      F
nodes are penalized by the square of ℓ2 norm, e.g., (𝑣 𝑖 ,𝑣 𝑗 )∈E ∥ √𝑑F𝑖+1 − √ 𝑗 ∥ 22 with 𝑑𝑖 being the
                                                                   Í
                                                                                    𝑖 𝑑 𝑗 +1
node degree of node 𝑣 𝑖 . The resulted message passing schemes are usually linear smoothers which
smooth the input signal by their linear transformation.
5.2.2   Graph Trend Filtering
In the univariate case, the 𝑘-th order graph trend filtering (GTF) estimator [111] is given by
                                                 1
                                   arg min =        ∥f − x∥ 22 + 𝜆∥Δ (𝑘+1) f ∥ 1                  (5.4)
                                     f∈R𝑛        2
where x ∈ R𝑛 is the 1-dimensional input signal of 𝑛 nodes and Δ (𝑘+1) is a 𝑘-th order graph difference
operator. When 𝑘 = 0, it penalizes the absolute difference across neighboring nodes in graph G:
                                                            ∑︁
                                        ∥Δ (1) f ∥ 1 =                |f𝑖 − f 𝑗 |
                                                        (𝑣 𝑖 ,𝑣 𝑗 )∈E
where  Δ (1) is equivalent to the incident matrix Δ. Generally, 𝑘-th order graph difference operators
can be defined recursively:
                                                           𝑘+1
                                       Δ⊤ Δ (𝑘) = L 2 ∈ R𝑛×𝑛 for odd k
                                      
                                      
                             (𝑘+1)
                                      
                           Δ       =                         𝑘
                                             (𝑘)
                                       ΔΔ = ΔL ∈ R
                                      
                                                            2        𝑚×𝑛 for even k.
                                                         66


It is demonstrated that GTF can adapt to inhomogeneity in the level of smoothness of signal and
tends to provide piecewise polynomials over graphs [111]. For instance, when 𝑘 = 0, the sparsity
induced by the ℓ1 -based penalty ∥Δ (1) f ∥ 1 implies that many of the differences f𝑖 − f 𝑗 are zeros
across edges (𝑣 𝑖 , 𝑣 𝑗 ) ∈ E in G. The piecewise property originates from the discontinuity of signal
allowed by less aggressive ℓ1 penalty, with adaptively chosen knot nodes or knot edges. Note that
the smoothers induced by GTF are not linear smoothers and cannot be simply represented by linear
transformation of the input signal.
5.3     Algorithm
In this section, we first propose a new graph signal denoising estimator. Then we develop an efficient
optimization algorithm for solving the denoising problem and introduce a novel, general and efficient
message passing scheme, i.e., Elastic Message Passing (EMP), for graph signal smoothing. Finally,
the integration of the proposed message passing scheme and deep neural networks leads to Elastic
GNNs.
5.3.1    Elastic Graph Signal Estimator
To combine the advantages of ℓ1 and ℓ2 -based graph smoothing, we propose the following elastic
graph signal estimator:
                                                     𝜆2                1
                             arg min 𝜆 1 ∥ΔF∥ 1 + tr(F⊤ LF) + ∥F − Xin ∥ 2𝐹                            (5.5)
                              F∈R𝑛×𝑑 | {z } |2                     {z
                                                                       2
                                                                                }
                                           𝑔1 (ΔF)
                                                                   𝑓 (F)
where Xin ∈ R𝑛×𝑑 is the 𝑑-dimensional input signal of 𝑛 nodes. The first term can be written in an
edge-centric way: ∥Δ (1) F∥ 1 = (𝑣 𝑖 ,𝑣 𝑗 )∈E ∥F𝑖 − F 𝑗 ∥ 1 , which penalizes the absolute difference across
                                  Í
connected nodes in graph G. Similarly, the second term penalizes the difference quadratically via
tr(F⊤ LF) = (𝑣 𝑖 ,𝑣 𝑗 )∈E ∥F𝑖 − F 𝑗 ∥ 22 . The last term is the fidelity term which preserves the similarity
              Í
with the input signal. The regularization coefficients 𝜆 1 and 𝜆2 control the balance between ℓ1 and
ℓ2 -based graph smoothing.
                                                        67


Remark 11. It is potential to consider higher-order graph differences in both the ℓ1 -based and
ℓ2 -based smoothers. But, in this work, we focus on the 0-th order graph difference operator Δ, since
we assume the piecewise constant prior for graph representation learning.
     Normalization. In existing GNNs, it is beneficial to normalize the Laplacian matrix for better
numerical stability, and the normalization trick is also crucial for achieving superior performance.
Therefore, for the ℓ2 -based graph smoothing, we follow the common normalization trick in GNNs:
                                  1        1
L̃ = I − Ã, where Ã = D̂− 2 ÂD̂− 2 , Â = A + I and D̂𝑖𝑖 = 𝑑𝑖 = 𝑗 Â𝑖 𝑗 . It leads to a degree normalized
                                                                               Í
penalty
                                                                                          2
                                          ⊤
                                                        ∑︁            F𝑖           F𝑗
                                    tr(F L̃F) =                    √         − √︁           .
                                                   (𝑣 𝑖 ,𝑣 𝑗 )∈E
                                                                     𝑑𝑖 + 1       𝑑𝑗 + 1  2
In the literature of graph total variation and graph trend filtering, the normalization step is often
overlooked and the graph difference operator is directly used as in GTF [111, 108]. To achieve
better numerical stability and handle diverse node degrees in real-world graphs, we propose to
normalize each column of the incident matrix by the square root of node degrees for the ℓ1 -based
graph smoothing as follows1:
                                                                        1
                                                            Δ̃ = ΔD̂− 2 .
It leads to a degree normalized total variation penalty 2
                                                     ∑︁             F𝑖          F𝑗
                                      ∥ Δ̃F∥ 1 =                 √        − √︁            .
                                                 (𝑣 𝑖 ,𝑣 𝑗 )∈E
                                                                   𝑑𝑖 + 1      𝑑𝑗 + 1   1
Note that this normalized incident matrix maintains the relation with the normalized Laplacian
matrix as in the unnormalized case
                                                              L̃ = Δ̃⊤ Δ̃                                        (5.6)
given that
                                          1                 1        1      1        1        1
                               L̃ = D̂− 2 ( D̂ − Â) D̂− 2 = D̂− 2 LD̂− 2 = D̂− 2 Δ⊤ ΔD̂− 2 .
     1 It naturally supports read-value edge weights if the edge weights are set in the incident matrix Δ.
     2 With  the normalization, the piecewise constant prior is up to the degree scaling, i.e., sparsity in Δ̃F.
                                                                  68


    With the normalization, the estimator defined in (5.5) becomes:
                                                            𝜆2                1
                              arg min 𝜆 1 ∥ Δ̃F∥ 1 + tr(F⊤ L̃F) + ∥F − Xin ∥ 2𝐹 .                    (5.7)
                               F∈R𝑛×𝑑 | {z } |2                          {z
                                                                              2
                                                                                          }
                                          𝑔1 ( Δ̃F)
                                                                         𝑓 (F)
Capture correlation among dimensions. The node features in real-world graphs are usually
multi-dimensional. Although the estimator defined in (5.7) is able to handle multi-dimensional
data since the signal from different dimensions are separable under ℓ1 and ℓ2 norm, such estimator
treats each feature dimension independently and does not exploit the potential relation between
feature dimensions. However, the sparsity patterns of node difference across edges could be shared
among feature dimensions. To better exploit this potential correlation, we propose to couple the
multi-dimensional features by ℓ21 norm, which penalizes the summation of ℓ2 norm of the node
difference
                                                     ∑︁            F𝑖            F𝑗
                                  ∥ Δ̃F∥ 21 =                   √        − √︁           .
                                                 (𝑣 𝑖 ,𝑣 𝑗 )∈E
                                                                  𝑑𝑖 + 1       𝑑𝑗 + 1 2
This penalty promotes the row sparsity of Δ̃F and enforces similar sparsity patterns among feature
dimensions. In other words, if two nodes are similar, all their feature dimensions should be similar.
Therefore, we define the ℓ21 -based estimator as
                                                             𝜆2                1
                              arg min 𝜆 1 ∥ Δ̃F∥ 21 + tr(F⊤ L̃F) + ∥F − Xin ∥ 2𝐹                     (5.8)
                               F∈R𝑛×𝑑 | {z } |2                            {z
                                                                               2
                                                                                           }
                                          𝑔21 ( Δ̃F)
                                                                           𝑓 (F)
where 𝑔21 (·) = 𝜆 1 ∥ · ∥ 21 . In the following subsections, we will use 𝑔(·) to represent both 𝑔1 (·) and
𝑔21 (·). We use ℓ1 to represent both ℓ1 and ℓ21 if not specified.
5.3.2    Elastic Message Passing
For the ℓ2 -based graph smoother, message passing schemes can be derived from the gradient
descent iterations of the graph signal denoising problem, as in the case of GCN and APPNP [73].
However, computing the estimators defined by (5.7) and (5.8) is much more challenging because of
the nonsmoothness, and the two components, i.e., 𝑓 (F) and 𝑔( Δ̃F), are non-separable as they are
coupled by the graph difference operator Δ̃. In the literature, researchers have developed optimization
                                                                69


algorithms for the graph trend filtering problem (5.4) such as Alternating Direction Method of
Multipliers (ADMM) and Newton type algorithms [111, 108]. However, these algorithms require
to solve the minimization of a non-trivial sub-problem in each single iteration, which incurs high
computation complexity. Moreover, it is unclear how to make these iterations compatible with the
back-propagation training of deep learning models. This motivates us to design an algorithm which
is not only efficient but also friendly to back-propagation training. To this end, we propose to solve
an equivalent saddle point problem using a primal-dual algorithm with efficient computations.
    Saddle point reformulation. For a general convex function 𝑔(·), its conjugate function is
defined as
                                         𝑔 ∗ (Z) := sup ⟨Z, X⟩ − 𝑔(X).
                                                     X
By using 𝑔( Δ̃F) = sup ⟨Δ̃F, Z⟩ −     𝑔 ∗ (Z),  the problem (5.7) and (5.8) can be equivalently written as
                       Z
the following saddle point problem:
                                    min max 𝑓 (F) + ⟨Δ̃F, Z⟩ − 𝑔 ∗ (Z).                              (5.9)
                                       F      Z
where Z ∈ R𝑚×𝑑 . Motivated by Proximal Alternating Predictor-Corrector (PAPC) [70, 15], we
propose an efficient algorithm with per iteration low computation complexity and convergence
guarantee:
                                    F̄ 𝑘+1 = F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 ) − 𝛾 Δ̃⊤ Z 𝑘 ,                        (5.10)
                                   Z 𝑘+1 = prox 𝛽𝑔∗ (Z 𝑘 + 𝛽Δ̃F̄ 𝑘+1 ),                             (5.11)
                                    F 𝑘+1 = F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 ) − 𝛾 Δ̃⊤ Z 𝑘+1 ,                       (5.12)
    where prox 𝛽𝑔∗ (X) = arg min 12 ∥Y − X∥ 2𝐹 + 𝛽𝑔 ∗ (Y). The stepsizes, 𝛾 and 𝛽, will be specified
                                Y
later. The first step (5.10) obtains a prediction of F 𝑘+1 , i.e., F̄ 𝑘+1 , by a gradient descent step on
primal variable F 𝑘 . The second step (5.11) is a proximal dual ascent step on the dual variable Z 𝑘
based on the predicted F̄ 𝑘+1 . Finally, another gradient descent step on the primal variable based
on (F 𝑘 , Z 𝑘+1 ) gives next iteration F 𝑘+1 (5.12). Algorithm (5.10)–(5.12) can be interpreted as a
“predict-correct” algorithm for the saddle point problem (5.9). Next we demonstrate how to compute
the proximal operator in Eq. (5.11).
                                                       70


     Proximal operators. Using the Moreau’s decomposition principle [6]
                                        X = prox 𝛽𝑔∗ (X) + 𝛽prox 𝛽−1 𝑔 (X/𝛽),
we can rewrite the step (5.11) using the proximal operator of 𝑔(·), that is,
                                                                              1
                                           prox 𝛽𝑔∗ (X) = X − 𝛽prox 1 𝑔 ( X).                  (5.13)
                                                                           𝛽  𝛽
 Y 𝑘+1 = 𝛾Xin + (1 − 𝛾) ÃF 𝑘
 F̄ 𝑘+1 = Y 𝑘 − 𝛾 Δ̃⊤ Z 𝑘
 Z̄ 𝑘+1 = Z 𝑘 + 𝛽Δ̃F̄ 𝑘+1
  Z 𝑘+1 = min(| Z̄ 𝑘+1 |, 𝜆1 ) · sign( Z̄ 𝑘+1 )                   (Option I: ℓ1 norm)
 
 
 
                                          Z̄𝑖𝑘+1
  Z𝑖 = min(∥ Z̄𝑖 ∥ 2 , 𝜆1 )       ·                , ∀𝑖 ∈ [𝑚]    (Option II: ℓ21 norm)
     𝑘+1            𝑘+1
                                       ∥ Z̄𝑖𝑘+1 ∥ 2
 
 F 𝑘+1 = Y 𝑘 − 𝛾 Δ̃⊤ Z 𝑘+1
                Figure 5.1: Elastic Message Passing (EMP). F0 = Xin and Z0 = 0𝑚×𝑑 .
     We discuss the two options for the function 𝑔(·) corresponding to the objectives (5.7) and (5.8).
• Option I (ℓ1 norm): 𝑔1 (X) = 𝜆 1 ∥X∥ 1
  By definition, the proximal operator of 1𝛽 𝑔1 (X) is
                                                                  1              1
                                prox 1 𝑔1 (X) = arg min ∥Y − X∥ 2𝐹 + 𝜆 1 ∥Y∥ 1 ,
                                          𝛽
                                                            Y     2              𝛽
  which is equivalent to the soft-thresholding operator (component-wise):
                                                                                 1
                                (𝑆 1 𝜆1 (X))𝑖 𝑗 =sign(X𝑖 𝑗 ) max(|X𝑖 𝑗 | − 𝜆 1 , 0)
                                     𝛽                                           𝛽
                                                                                       1
                                                       =X𝑖 𝑗 − sign(X𝑖 𝑗 ) min(|X𝑖 𝑗 |, 𝜆 1 ).
                                                                                       𝛽
  Therefore, using (5.13), we have
                                   (prox 𝛽𝑔∗ (X))𝑖 𝑗 = sign(X𝑖 𝑗 ) min(|X𝑖 𝑗 |, 𝜆1 ).          (5.14)
                                                    1
  which is a component-wise projection onto the ℓ∞ ball of radius 𝜆 1 .
                                                                71


• Option II (ℓ21 norm): 𝑔21 (X) = 𝜆 1 ∥X∥ 21
   By definition, the proximal operator of 1𝛽 𝑔21 (X) is
                                                         1               1
                            prox 1 𝑔21 (X) = arg min ∥Y − X∥ 2𝐹 + 𝜆 1 ∥Y∥ 21
                                  𝛽
                                                   Y     2               𝛽
   with the 𝑖-th row being
                                                    X𝑖                    1
                             prox 1 𝑔21 (X) 𝑖 =             max(∥X𝑖 ∥ 2 − 𝜆 1 , 0).
                                    𝛽               ∥X𝑖 ∥ 2                𝛽
   Similarly, using (5.13), we have the 𝑖-th row of prox 𝛽𝑔∗ (X) being
                                                                21
                              (prox 𝛽𝑔∗ (X))𝑖
                                        21
                               = X𝑖 − 𝛽prox 1 𝑔21 (X𝑖 /𝛽)
                                                 𝛽
                                             X𝑖 /𝛽
                               = X𝑖 − 𝛽               max(∥X𝑖 /𝛽∥ 2 − 𝜆 1 /𝛽, 0)
                                           ∥X𝑖 /𝛽∥ 2
                                            X𝑖
                               = X𝑖 −            max(∥X𝑖 ∥ 2 − 𝜆 1 , 0)
                                         ∥X𝑖 ∥ 2
                                     X𝑖
                               =           (∥X𝑖 ∥ 2 − max(∥X𝑖 ∥ 2 − 𝜆 1 , 0))
                                  ∥X𝑖 ∥ 2
                                     X𝑖
                               =           min(∥X𝑖 ∥ 2 , 𝜆1 ),                                  (5.15)
                                  ∥X𝑖 ∥ 2
   which is a row-wise projection on the ℓ2 ball of radius 𝜆 1 . Note that the proximal operator in the
   ℓ1 norm case treats each feature dimension independently, while in the ℓ21 norm case, it couples
   the multi-dimensional features, which is consistent with the motivation to exploit the correlation
   among feature dimensions.
     The Algorithm (5.10)–(5.12) and the proximal operators (5.14) and (5.15) enable us to derive the
final message passing scheme. Note that the computation F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 ) in steps (5.10) and (5.12)
can be shared to save computation. Therefore, we decompose the step (5.10) into two steps:
                                    Y 𝑘 = F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 )
                                                               
                                        = (1 − 𝛾)I − 𝛾𝜆2 L̃ F 𝑘 + 𝛾Xin ,                        (5.16)
                                 F̄ 𝑘+1 = Y 𝑘 − 𝛾 Δ̃⊤ Z 𝑘 .                                     (5.17)
                                                      72


                                   1             1
In this work, we choose 𝛾 =      1+𝜆2 and 𝛽 =   2𝛾 .  Therefore, with L̃ = I − Ã, Eq. (5.16) can be simplified
as
                                          Y 𝑘+1 = 𝛾Xin + (1 − 𝛾) ÃF 𝑘 .                                (5.18)
Let Z̄ 𝑘+1 := Z 𝑘 + 𝛽Δ̃F̄ 𝑘+1 , then steps (5.11) and (5.12) become
                                      Z 𝑘+1 = prox 𝛽𝑔∗ ( Z̄ 𝑘+1 ),                                      (5.19)
                                      F 𝑘+1 = F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 ) − 𝛾 Δ̃Z 𝑘+1
                                             = Y 𝑘 − 𝛾 Δ̃⊤ Z 𝑘+1 .                                      (5.20)
Substituting the proximal operators in (5.19) with (5.14) and (5.15), we obtain the complete elastic
message passing scheme (EMP) as summarized in Figure 5.1.
     Interpretation of EMP. EMP can be interpreted as the standard message passing (MP) (Y in
Fig. 1) with extra operations (the following steps). The extra operations compute Δ̃⊤ Z to adjust the
standard MP such that sparsity in Δ̃F is promoted and some large node differences can be preserved.
EMP is general and covers some existing propagation rules as special cases as demonstrated in
Remark 12.
Remark 12 (Special cases). If there is only ℓ2 -based regularization, i.e., 𝜆 1 = 0, then according
                                                                                   1
to the projection operator, we have Z 𝑘 = 0𝑚×𝑛 . Therefore, with 𝛾 =             1+𝜆2 , the proposed message
passing scheme reduces to
                                                   1              𝜆2
                                       F 𝑘+1 =          Xin +         ÃF 𝑘 .
                                                1 + 𝜆2         1 + 𝜆2
         1
If 𝜆 2 = 𝛼 − 1, it recovers the message passing in APPNP:
                                          F 𝑘+1 = 𝛼Xin + (1 − 𝛼) ÃF 𝑘 .
If 𝜆 2 = ∞, it recovers the simple aggregation operation in many GNNs:
                                                   F 𝑘+1 = ÃF 𝑘 .
                                                         73


    Computation Complexity. EMP is efficient and composed by simple operations. The major
computation cost comes from four sparse matrix multiplications, include ÃF 𝑘 , Δ̃⊤ Z 𝑘 , Δ̃F̄ 𝑘+1 and
Δ̃⊤ Z 𝑘+1 . The computation complexity is in the order 𝑂 (𝑚𝑑) where 𝑚 is the number of edges in
graph G and 𝑑 is the feature dimension of input signal Xin . Other operations are simple matrix
additions and projection.
    The convergence of EMP and the parameter settings are justified by Theorem 6.
                                                                                   2                      4
Theorem 6 (Convergence of EMP). Under the stepsize setting 𝛾 <                1+𝜆 2 ∥ L̃∥ 2
                                                                                            and 𝛽 ≤  3𝛾∥ Δ̃Δ̃⊤ ∥ 2
                                                                                                                   , the
elastic message passing scheme (EMP) in Figure 5.1 converges to the optimal solution of the elastic
graph signal estimator defined in (5.7) (Option I) or (5.8) (Option II). It is sufficient to choose any
        2             2
𝛾<    1+2𝜆2  and 𝛽 ≤ 3𝛾  since ∥ L̃∥ 2 = ∥ Δ̃⊤ Δ̃∥ 2 = ∥ Δ̃Δ̃⊤ ∥ 2 ≤ 2.
Proof. We first consider the general problem
                                              min 𝑓 (F) + 𝑔(BF)                                                (5.21)
                                                F
where 𝑓 and 𝑔 are convex functions and B is a bounded linear operator. It is proved in [70, 15]
that the iterations in (5.10)–(5.12) guarantee the convergence of F 𝑘 to the optimal solution of the
                                                                        2                   1
minimization problem (5.21) if the parameters satisfy 𝛾 <               𝐿 and 𝛽 ≤    𝛾𝜆max (BB⊤ ) , where 𝐿 is the
                                                                                            2                  4
Lipschitz constant of ∇ 𝑓 (F). These conditions are further relaxed to 𝛾 <                  𝐿 and 𝛽 ≤  3𝛾𝜆max (BB⊤ )
in [60].
    For the specific problems defined in (5.7) and (5.8), the two function components 𝑓 and 𝑔
are both convex, and the linear operator Δ is bounded. The Lipschitz constant of ∇ 𝑓 (F) can be
computed by the largest eigenvalue of the Hessian matrix of 𝑓 (F):
                          𝐿 = 𝜆 max (∇2 𝑓 (F)) = 𝜆 max (I + 𝜆 2 L̃) = 1 + 𝜆 2 ∥ L̃∥ 2 .
Therefore, the elastic message passing scheme derived from iterations (5.10)–(5.12) is guaranteed
to converge to the optimal solution of problem (5.7) (Option I) or problem (5.8) (Option II) if the
                            2                     4
stepsizes satisfy 𝛾 <   1+𝜆2 ∥ L̃∥ 2
                                     and 𝛽 ≤ 3𝛾∥ Δ̃Δ̃⊤ ∥ 2
                                                           .
                                                            74


    Let Δ̃ = UΣV⊤ be the singular value decomposition of Δ̃ and we derive
      ∥ Δ̃Δ̃⊤ ∥ 2 = ∥UΣV⊤ VΣU⊤ ∥ 2 = ∥UΣ2 U⊤ ∥ 2 = ∥VΣ2 V⊤ ∥ 2 = ∥VΣU⊤ UΣV⊤ ∥ 2 = ∥ Δ̃⊤ Δ̃∥ 2 .
The equivalence L̃ = Δ̃⊤ Δ̃ in (5.6) further gives
                                       ∥ L̃∥ 2 = ∥ Δ̃⊤ Δ̃∥ 2 = ∥ Δ̃Δ̃⊤ ∥ 2 .
                                     2            2              2         4                          2        2
Since ∥ L̃∥ 2 ≤ 2 [19], we have    1+2𝜆2  ≤   1+𝜆2 ∥ L̃∥ 2
                                                            and 3𝛾 ≤ 3𝛾∥ Δ̃Δ̃⊤ ∥ 2
                                                                                   . Therefore, 𝛾 < 1+2𝜆2 𝛽≤  3𝛾
are sufficient for the convergence of EMP.
5.3.3    Elastic GNNs
Incorporating the elastic message passing scheme from the elastic graph signal estimator (5.7)
and (5.8) into deep neural networks, we introduce a family of GNNs, namely Elastic GNNs. In this
work, we follow the decoupled way as proposed in APPNP [49], where we first make predictions
from node features and aggregate the prediction through the proposed EMP:
                                                                             
                                   Ypre = EMP ℎ𝜃 (Xfea ), 𝐾, 𝜆 1 , 𝜆2 .                                   (5.22)
Xfea ∈ R𝑛×𝑑 denotes the node features, ℎ𝜃 (·) is any machine learning model, such as multilayer
perceptrons (MLPs), 𝜃 is the learnable parameters in the model, and 𝐾 is the number of message
passing steps. The training objective is the cross entropy loss defined by the final prediction Ypre
and labels for training data. Elastic GNNs also have the following nice properties:
    • In addition to the backbone neural network model, Elastic GNNs only require to set up three
       hyperparameters including two coefficients 𝜆 1 , 𝜆2 and the propagation step 𝐾, but they do not
       introduce any learnable parameters. Therefore, it reduces the risk of overfitting.
    • The hyperparameters 𝜆 1 and 𝜆 2 provide better smoothness adaptivity to Elastic GNNs
       depending on the smoothness properties of the graph data.
    • The message passing scheme only entails simple and efficient operations, which makes it
       friendly to the efficient and end-to-end back-propagation training of the whole GNN model.
                                                           75


5.4     Experiment
In this section, we conduct experiments to validate the effectiveness of the proposed Elastic GNNs.
We first introduce the experimental settings. Then we assess the performance of Elastic GNNs and
investigate the benefits of introducing ℓ1 -based graph smoothing into GNNs with semi-supervised
learning tasks under normal and adversarial settings. In the ablation study, we validate the local
adaptive smoothness, sparsity pattern, and convergence of EMP.
5.4.1    Experimental Settings
Datasets. We conduct experiments on 8 real-world datasets including three citation graphs, i.e.,
Cora, Citeseer, Pubmed [93], two co-authorship graphs, i.e., Coauthor CS and Coauthor Physics [95],
two co-purchase graphs, i.e., Amazon Computers and Amazon Photo [95], and one blog graph, i.e.,
Polblogs [2]. In Polblogs graph, node features are not available so we set the feature matrix to be
a 𝑛 × 𝑛 identity matrix. The data statistics for the benchmark datasets used in Section 5.4.2 are
summarized in Table 5.1. The data statistics for the adversarially attacked graph used in Section 5.4.3
are summarized in Table 5.2.
                             Table 5.1: Statistics of benchmark datasets.
  Dataset            Classes   Nodes   Edges    Features  Training Nodes Validation Nodes Test Nodes
  Cora                   7      2708     5278     1433      20 per class         500          1000
  CiteSeer               6      3327     4552     3703      20 per class         500          1000
  PubMed                 3     19717    44324      500      20 per class         500          1000
  Coauthor CS           15     18333   81894      6805      20 per class     30 per class   Rest nodes
  Coauthor Physics       5     34493   247962     8415      20 per class     30 per class   Rest nodes
  Amazon Computers      10     13381   245778      767      20 per class     30 per class   Rest nodes
  Amazon Photo           8      7487   119043      745      20 per class     30 per class   Rest nodes
                    Table 5.2: Dataset Statistics for adversarially attacked graph.
                                          NLCC    ELCC   Classes Features
                                Cora      2,485   5,069     7        1,433
                              CiteSeer    2,110   3,668     6        3,703
                              Polblogs    1,222  16,714     2          /
                              PubMed     19,717  44,338     3         500
                                                    76


    Baselines. We compare the proposed Elastic GNNs with representative GNNs including
GCN [48], GAT [109], ChebNet [23], GraphSAGE [33], APPNP [49] and SGC [115]. For all
models, we use 2 layer neural networks with 64 hidden units.
    Parameter settings. For each experiment, we report the average performance and the standard
variance of 10 runs. For all methods, hyperparameters are tuned from the following search space:
1) learning rate: {0.05, 0.01, 0.005}; 2) weight decay: {5e-4, 5e-5, 5e-6}; 3) dropout rate: {0.5,
0.8}. For APPNP, the propagation step 𝐾 is tuned from {5, 10} and the parameter 𝛼 is tuned from
{0, 0.1, 0.2, 0.3, 0.5, 0.8, 1.0}. For Elastic GNNs, the propagation step 𝐾 is tuned from {5, 10} and
                                                                                                    1
parameters 𝜆 1 and 𝜆 2 are tuned from {0, 3, 6, 9}. As suggested by Theorem 1, we set 𝛾 =         1+𝜆 2
             1
and 𝛽 =     2𝛾 in the proposed elastic message passing scheme. Adam optimizer [47] is used in all
experiments.
5.4.2    Performance on Benchmark Datasets
On commonly used datasets including Cora, CiteSeer, PubMed, Coauthor CS, Coauthor Physics,
Amazon Computers and Amazon Photo, we compare the performance of the proposed Elastic GNN
(ℓ21 + ℓ2 ) with representative GNN baselines on the semi-supervised learning task. The classification
accuracy are showed in Table 5.3. From these results, we can make the following observations:
     • Elastic GNN outperforms GCN, GAT, ChebNet, GraphSAGE and SGC by significant margins
       on all datasets. For instance, Elastic GNN improves over GCN by 3.1%, 2.0% and 1.8% on
       Cora, CiteSeer and PubMed datasets. The improvement comes from the global and local
       smoothness adaptivity of Elastic GNN.
     • Elastic GNN (ℓ21 + ℓ2 ) consistently achieves higher performance than APPNP on all datasets.
       Essentially, Elastic GNN covers APPNP as a special case when there is only ℓ2 regularization,
       i.e., 𝜆 1 = 0. Beyond the ℓ2 -based graph smoothing, the ℓ21 -based graph smoothing further
       enhances the local smoothness adaptivity. This comparison verifies the benefits of introducing
       ℓ21 -based graph smoothing in GNNs.
                                                  77


 Table 5.3: Classification accuracy (%) on benchmark datasets with 10 times random data splits.
  Model             Cora       CiteSeer   PubMed          CS        Physics    Computers      Photo
  ChebNet         76.3 ± 1.5  67.4 ± 1.5  75.0 ± 2.0   91.8 ± 0.4    OOM        81.0 ± 2.0  90.4 ± 1.0
  GCN             79.6 ± 1.1  68.9 ± 1.2  77.6 ± 2.3   91.6 ± 0.6  93.3 ± 0.8  79.8 ± 1.6   90.3 ± 1.2
  GAT             80.1 ± 1.2  68.9 ± 1.8  77.6 ± 2.2   91.1 ± 0.5  93.3 ± 0.7  79.3 ± 2.4   89.6 ± 1.6
  SGC             80.2 ± 1.5  68.9 ± 1.3  75.5 ± 2.9   90.1 ± 1.3  93.1 ± 0.6  73.0 ± 2.0   83.5 ± 2.9
  APPNP           82.2 ± 1.3  70.4 ± 1.2  78.9 ± 2.2   92.5 ± 0.3  93.7 ± 0.7  80.1 ± 2.1   90.8 ± 1.3
  GraphSAGE       79.0 ± 1.1  67.5 ± 2.0  77.6 ± 2.0   91.7 ± 0.5  92.5 ± 0.8  80.7 ± 1.7   90.9 ± 1.0
  ElasticGNN      82.7 ± 1.0  70.9 ± 1.4  79.4 ± 1.8   92.5 ± 0.3  94.2 ± 0.5  80.7 ± 1.8   91.3 ± 1.3
Table 5.4: Classification accuracy (%) under different perturbation rates of adversarial graph attack.
                              Basic GNN                              Elastic GNN
   Dataset    Ptb Rate
                            GCN        GAT          ℓ2          ℓ1        ℓ21       ℓ1 + ℓ2   ℓ21 + ℓ2
                 0%       83.5±0.4   84.0±0.7   85.8±0.4    85.1±0.5   85.3±0.4 85.8±0.4 85.8±0.4
                 5%       76.6±0.8   80.4±0.7   81.0±1.0    82.3±1.1   81.6±1.1 81.9±1.4 82.2±0.9
                10%       70.4±1.3   75.6±0.6   76.3±1.5    76.2±1.4   77.9±0.9 78.2±1.6 78.8±1.7
    Cora
                15%       65.1±0.7   69.8±1.3   72.2±0.9    73.3±1.3   75.7±1.2 76.9±0.9 77.2±1.6
                20%       60.0±2.7   59.9±0.6   67.7±0.7    63.7±0.9   70.3±1.1 67.2±5.3 70.5±1.3
                 0%       72.0±0.6   73.3±0.8   73.6±0.9    73.2±0.6   73.2±0.5 73.6±0.6 73.8±0.6
                 5%       70.9±0.6   72.9±0.8   72.8±0.5    72.8±0.5   72.8±0.5 73.3±0.6 72.9±0.5
                10%       67.6±0.9   70.6±0.5   70.2±0.6    70.8±0.6   70.7±1.2 72.4±0.9 72.6±0.4
  Citeseer
                15%       64.5±1.1   69.0±1.1   70.2±0.6    68.1±1.4   68.2±1.1 71.3±1.5 71.9±0.7
                20%       62.0±3.5   61.0±1.5   64.9±1.0    64.7±0.8   64.7±0.8 64.7±0.8 64.7±0.8
                 0%       95.7±0.4   95.4±0.2   95.4±0.2    95.8±0.3   95.8±0.3   95.8±0.3   95.8±0.3
                 5%       73.1±0.8   83.7±1.5   82.8±0.3    78.7±0.6   78.7±0.7   82.8±0.4   83.0±0.3
                10%       70.7±1.1   76.3±0.9   73.7±0.3    75.2±0.4   75.3±0.7   81.5±0.2   81.6±0.3
  Polblogs
                15%       65.0±1.9   68.8±1.1   68.9±0.9    72.1±0.9   71.5±1.1   77.8±0.9   78.7±0.5
                20%       51.3±1.2   51.5±1.6   65.5±0.7    68.1±0.6   68.7±0.7   77.4±0.2   77.5±0.2
                 0%       87.2±0.1   83.7±0.4   88.1±0.1    86.7±0.1   87.3±0.1 88.1±0.1 88.1±0.1
                 5%       83.1±0.1   78.0±0.4   87.1±0.2    86.2±0.1   87.0±0.1 87.1±0.2 87.1±0.2
                10%       81.2±0.1   74.9±0.4   86.6±0.1    86.0±0.2   86.9±0.2 86.3±0.1 87.0±0.1
  Pubmed
                15%       78.7±0.1   71.1±0.5   85.7±0.2    85.4±0.2   86.4±0.2 85.5±0.1 86.4±0.2
                20%       77.4±0.2   68.2±1.0   85.8±0.1    85.4±0.1   86.4±0.1 85.4±0.1 86.4±0.1
5.4.3   Robustness Under Adversarial Attack
Locally adaptive smoothness makes Elastic GNNs more robust to adversarial attack on graph
structure. This is because the attack tends to connect nodes with different labels, which fuzzes the
cluster structure in the graph. But EMP can tolerate large node differences along these wrong edges,
                                                    78


and maintain the smoothness along correct edges.
    To validate this, we evaluate the performance of Elastic GNNs under untargeted adversarial
graph attack, which tries to degrade GNN models’ overall performance by deliberately modifying
the graph structure. We use the MetaAttack [135] implemented in DeepRobust [58]3, a PyTorch
library for adversarial attacks and defenses, to generate the adversarially attacked graphs based on
four datasets including Cora, CiteSeer, Polblogs and PubMed. We randomly split 10%/10%/80%
of nodes for training, validation and test. Note that following the works [134, 135, 27, 41], we only
consider the largest connected component (LCC) in the adversarial graphs. Therefore, the results
in Table 5.4 are not directly comparable with the results in Table 5.3. We focus on investigating
the robustness introduced by ℓ1 -based graph smoothing but not on adversarial defense so we don’t
compare with defense strategies. Existing defense strategies can be applied on Elastic GNNs to
further improve the robustness against attacks.
    Variants of Elastic GNNs. To make a deeper investigation of Elastic GNNs, we consider
the following variants: (1) ℓ2 (𝜆 1 = 0); (2) ℓ1 (𝜆2 = 0, Option I); (3) ℓ21 (𝜆 2 = 0, Option II); (4)
ℓ1 + ℓ2 (Option I); (5) ℓ21 + ℓ2 (Option II). To save computation, we fix the learning rate as 0.01,
weight decay as 0.0005, dropout rate as 0.5 and 𝐾 = 10 since this setting works well for the
chosen datasets and models. Only 𝜆1 and 𝜆 2 are tuned. The classification accuracy under different
perturbation rates ranging from 0% to 20% is summarized in Table 5.4. From the results, we can
make the following observations:
     • All variants of Elastic GNNs outperforms GCN and GAT by significant margins under all
        perturbation rates. For instance, when the pertubation rate is 15%, Elastic GNN (ℓ21 + ℓ2 )
        improves over GCN by 12.1%, 7.4%, 13.7% and 7.7% on the four datasets being considered.
        This is because Elastic GNN can adapt to the change of smoothness while GCN and GAT can
        not adapt well when the perturbation rate increases.
     • ℓ21 outperforms ℓ1 in most cases, and ℓ21 + ℓ2 outperforms ℓ1 + ℓ2 in almost all cases. It
        demonstrates the benefits of exploiting the correlation between feature channels by coupling
    3 https://github.com/DSE-MSU/DeepRobust
                                                  79


        multi-dimensional features via ℓ21 norm.
     • ℓ21 outperforms ℓ2 in most cases, which suggests the benefits of local smoothness adaptivity.
        When ℓ21 and ℓ2 is combined, the Elastic GNN (ℓ21 + ℓ2 ) achieves significantly better
        performance than solely ℓ2 , ℓ21 or ℓ1 variant in almost all cases. It suggests that ℓ1 and ℓ2 -based
        graph smoothing are complementary to each other, and combining them provides significant
        better robustness against adversarial graph attacks.
5.4.4    Ablation Study
We provide ablation study to further investigate the adaptive smoothness, sparsity pattern, and
convergence of EMP in Elastic GNN, based on three datasets including Cora, CiteSeer and PubMed.
In this section, we fix 𝜆 1 = 3, 𝜆2 = 3 for Elastic GNN, and 𝛼 = 0.1 for APPNP. We fix learning
rate as 0.01, weight decay as 0.0005 and dropout rate as 0.5 since this setting works well for both
methods.
    Adaptive smoothness. It is expected that ℓ1 -based smoothing enhances local smoothness
adaptivity by increasing the smoothness along correct edges (connecting nodes with same labels)
while lowering smoothness along wrong edges (connecting nodes with different labels). To validate
this, we compute the average adjacent node differences (based on node features in the last layer)
along wrong and correct edges separately, and use the ratio between these two averages to measure
the smoothness adaptivity. The results are summarized in Table 5.5. It is clearly observed that for
all datasets, the ratio for ElasticGNN is significantly higher than ℓ2 based method such as APPNP,
which validates its better local smoothness adaptivity.
    Sparsity pattern. To validate the piecewise constant property enforced by EMP, we also
investigate the sparsity pattern in the adjacent node differences, i.e., Δ̃F, based on node features in the
last layer. Node difference along edge 𝑒𝑖 is defined as sparse if ∥( Δ̃F)𝑖 ∥ 2 < 0.1. The sparsity ratios
for ℓ2 -based method such as APPNP and ℓ1 -based method such as Elastic GNN are summarized
in Table 5.6. It can be observed that in Elastic GNN, a significant portion of Δ̃F are sparse for
all datasets. While in APPNP, this portion is much smaller. This sparsity pattern validates the
                                                     80


piecewise constant prior as designed.
         Table 5.5: Ratio between average node differences along wrong and correct edges.
                                Model              Cora      CiteSeer     PubMed
                             ℓ2 (APPNP)             1.57        1.35        1.43
                        ℓ21 +ℓ2 (ElasticGNN)        2.03        1.94        1.79
               Table 5.6: Sparsity ratio (i.e., ∥( Δ̃F)𝑖 ∥ 2 < 0.1) in node differences Δ̃F.
                                Model              Cora      CiteSeer     PubMed
                             ℓ2 (APPNP)              2%         16%         11%
                        ℓ21 +ℓ2 (ElasticGNN)        37%         74%         42%
    Convergence of EMP. We provide two additional experiments to demonstrate the impact of
propagation step 𝐾 on classification performance and the convergence of message passing scheme.
Figure 5.2 shows that the increase of classification accuracy when the propagation step 𝐾 increases.
It verifies the effectiveness of EMP in improving graph representation learning. It also shows
that a small number of propagation step can achieve very good performance, and therefore the
computation cost for EMP can be small. Figure 5.3 shows the decreasing of the objective value
defined in Eq. (5.8) during the forward message passing process, and it verifies the convergence of
the proposed EMP as suggested by Theorem 6.
5.5     Related Work
The design of GNN architectures can be majorly motivated in spectral domain [48, 23] and spatial
domain [33, 109, 91, 29]. The message passing scheme [29, 74] for feature aggregation is one
central component of GNNs. Recent works have proven that the message passing in GNNs can be
regarded as low-pass graph filters [84, 126]. Generally, it is recently proved that message passing in
many GNNs can be unified in the graph signal denosing framework [73, 86, 129, 16]. We point out
that they intrinsically perform ℓ2 -based graph smoothing and typically can be represented as linear
smoothers.
                                                     81


                                        84
                                        82
                    Test Accuracy (%)
                                        80
                                                                                   Cora (ElasticGNN)
                                                                                   Cora (APPNP)
                                        78                                         CiteSeer (ElasticGNN)
                                                                                   CiteSeer (APPNP)
                                        76                                         PubMed (ElasticGNN)
                                                                                   PubMed (APPNP)
                                        74
                                        72
                                        70
                                                    2       4       6          8        10           12
                                                                    Step K
              Figure 5.2: Classification accuracy under different propagation steps.
                                         1500                                                      Cora
                         Value
                                         1000
                                          500
                                         1000
                                                                                                CiteSeer
                         Value            500
                                        20000                                                   PubMed
                    Value               10000
                                                0       2       4   6      8       10      12        14
                                                                     Step K
Figure 5.3: Convergence of the objective value for the problem in Eq. (5.8) during message passing.
   ℓ1 -based graph signal denoising has been explored in graph trend filtering [111, 108] which
tends to provide estimators with 𝑘-th order piecewise polynomials over graphs. Graph total variation
has also been utilized in semi-supervised learning [83, 44, 43, 5], spectral clustering [12, 11] and
graph cut problems [100, 10]. However, it is unclear whether these algorithms can be used to design
GNNs. To the best of our knowledge, we make first such investigation in this work.
                                                                    82


5.6     Conclusion
In this work, we propose to enhance the smoothness adaptivity of GNNs via ℓ1 and ℓ2 -based graph
smoothing. Through the proposed elastic graph signal estimator, we derive a novel, efficient and
general message passing scheme, i.e., elastic message passing (EMP). Integrating the proposed
message passing scheme and deep neural networks leads to a family of GNNs, i.e., Elastic GNNs.
Extensitve experiments on benchmark datasets and adversarially attacked graphs demonstrate the
benefits (e.g., intrinsic robustness) of introducing ℓ1 -based graph smoothing in the design of GNNs.
The empirical study suggests that ℓ1 and ℓ2 -based graph smoothing is complementary to each other,
and the proposed Elastic GNNs has better smoothnesss adaptivity owning to the integration of
ℓ1 and ℓ2 -based graph smoothing. We hope the proposed elastic message passing scheme can
inspire more powerful GNN architecture design and more general smoothness assumptions such
as low homophily [72] can be made in the future. In addition, we also demonstrate the significant
advantages of elastic message passing (EMP) in capturing reliable user-item interactions in noisy
recommendation systems through the proposed framework of Graph Trend Filtering Networks [28].
                                                   83


                                           CHAPTER 6
                                          CONCLUSION
In this chapter, we summarize the research results in this dissertation and their broader impact, and
discuss promising research directions.
6.1     Summary
In this dissertation, we proposed fours solutions to solve the efficiency and security challenges in
machine learning - (1) centralized distributed optimization algorithm with bidirectional commu-
nication compression; (2) decentralized distributed optimization algorithm with communication
compression; (3) graph neural networks with adaptive message passing that is robust to adversarial
features; (4) graph neural networks with elastic message passing that is robust to adversarial graph
structures.
    To fundamentally improve the efficiency of distributed ML systems, I proposed a series of
innovative algorithms to break through the communication bottleneck. In particular, when the
communication network is a start network, I proposed DORE [68], a double residual compression
algorithm, to compress the bi-directional communication between client devices and the server such
that over 95% of the communication bits can be reduced. This is the first algorithm that reduces that
much communication cost while maintaining the superior convergence complexities (e.g., linear
convergence) as the uncompressed counterpart, both theoretically and numerically.
    When the communication network is of any general topology (as long as it is connected),
I proposed LEAD [69], the first linear convergent decentralized optimization algorithm with
communication compression, which only requires point-to-point compressed communication
between neighboring devices over communication networks. Theoretically, we prove that under
certain compression ratios, the convergence complexity of the proposed algorithm does not depend
on the compression operator. In other words, it achieves better communication efficiency for free.
    These algorithms significantly improve the efficiency and scalability of large-scale ML systems
                                                 84


with solid theoretical guarantees and remarkable empirical performance. They have the great
potential to accelerate scientific discovery through machine learning and data science.
    To design intrinsically secure ML models against feature attacks, I investigate to denoise the
hidden features in neural network layers caused by the adversarial perturbation using the graph
structural information. This is achieved by the proposed AirGNN [66] in which the adaptive message
passing denoises perturbed features by feature aggregations and maintains feature separability by
adaptive residuals. The proposed algorithm has a clear design principle and interpretation as well
strong as performance both in the clean and adversarial data settings. This points out a promising
direction of achieving adversarial robustness through feature denoising in hidden layers.
    To design intrinsically secure ML models against graph structure attacks, I investigate a new prior
knowledge of smoothness in the design of graph neural networks. In particular, we derive an elastic
message passing scheme to model the piecewise constant signal in graph data. We demonstrate its
stronger resilience to adversarial structure attacks and superior performance when the data is clean
through a comprehensive empirical study on the proposed model ElasticGNN [67].
    These secure ML models immensely boost the security of ML models under potential adversarial
threats. They might not only be applied in safety-critical applications but also inspire further research
in this emerging direction.
6.2     Future Direction
Large-scale ML and secure ML are active areas of exploration. Below we discuss some promising
research directions:
     • Distributed machine learning under heterogeneous environments: Our study in centralized
       learning and decentralized learning suggests that the convergence properties of distributed
       optimization algorithms might be sensitive to heterogeneous environments - (1) data hetero-
       geneity; (2) network heterogeneity; (3) computation heterogeneity. Due to data heterogeneity,
       the data distributions from multiple computation devices might significantly differ from each
       other, which causes direction conflict for model updating if synchronization is not timely.
                                                   85


  Due to network heterogeneity, the network bandwidths and conditions are often uneven in
  large distributed systems. Therefore, it is critical to take the potential communication decay
  into consideration. Due to computation heterogeneity, different computation devices might
  have diverse computation power. To fully utilize the computation power, it is vital to design
  algorithms that support flexible computation tasks and avoid idle time.
• Secure machine learning for more data types. Our study in designing secure graph neural
  networks suggests a new promising direction for designing intrinsically robust ML models
  through feature denoising in hidden layers of deep neural networks. Therefore, it is promising
  to generalize these ideas to general data types such as images, videos, and text where the
  graph structure information is not explicitly available but can be constructed from the data.
  Moreover, it is also promising to consider more advanced and flexible smoothing assumptions
  beyond homophily graphs in the design of GNN models.
                                              86


APPENDICES
    87


                                            APPENDIX A
A DOUBLE RESIDUAL COMPRESSION ALGORITHM FOR DISTRIBUTED LEARNING
A.1     Additional Experiments
A.1.1    Communication Efficiency
To make an explicit comparison of communication efficiency, we report the training loss convergence
with respect to communication bits in Figure A.1, A.2 and A.3 for the experiments on synthetic data,
MNIST and CIFAR10 dataset respectively. These results are independent of the system architectures
and network bandwidth. It suggests that the proposed DORE reduce the communication cost
significantly while maintaining good convergence speed.
    Furthermore, we also test the running time of ResNet18 trained on CIFAR10 dataset under two
different network bandwidth configurations, i.e. 1Gbps and 200Mbps, as showed in Figure A.4
and A.5. Due to its superior communication efficiency, the proposed DORE runs faster in both
configurations. Moreover, when the network bandwidth reduces from 1Gbps to 200Mbps, the
running time of DORE only increases slightly, which indicates that DORE is more robust to network
bandwidth change and can work more efficiently under limited bandwidth. These results clearly
suggest the advantages of the proposed algorithm.
    All the experiments in this section are under the exactly same setting as described in Section 2.5.
The running time is tested in a High Performance Computing Cluster with NVIDIA Tesla K80
GPUs and the computing nodes are connected by Gigabit Ethernet interfaces and we use mpi4py as
the communication backend. All algorithms in this work are implemented with PyTorch.
                                                  88


                         3
                    10
                         1
                    10
                                                                                                                                       2.0
                         1
             10
                                                                                                                       Training Loss
                                                                                                                                       1.5
x * ||2
                         3
             10
||xk         10
                         5
                                                                                                                                       1.0
                         7
             10
                                                                                                                                       0.5
                         9
             10
                                                                                                                                       0.0
                             0.0   0.5     1.0          1.5                          2.0     2.5      3.0                                    0.0         0.5            1.0        1.5      2.0      2.5
                                           Communication cost (bits)                                1e8                                                         Communication cost (bits)          1e9
Figure A.1: Linear regression on synthetic data. Figure A.2: LeNet trained on MNIST dataset.
                                                                              4.5
                                                                              4.0
                                                                              3.5
                                                              Training Loss
                                                                              3.0
                                                                              2.5
                                                                              2.0
                                                                              1.5
                                                                              1.0
                                                                                    0.0       0.5           1.0                        1.5         2.0            2.5
                                                                                                    Communication cost (bits)                                  1e12
                                                        Figure A.3: Resnet18 trained on CIFAR10 dataset.
                    4.5                                                                                                                4.5
                    4.0                                                                                                                4.0
                    3.5                                                                                                                3.5
    Training Loss                                                                                                      Training Loss
                    3.0                                                                                                                3.0
                    2.5                                                                                                                2.5
                    2.0                                                                                                                2.0
                    1.5                                                                                                                1.5
                    1.0                                                                                                                1.0
                             0       500         1000                         1500         2000      2500                                    0       1000           2000          3000      4000    5000
                                                  Time (seconds)                                                                                                        Time (seconds)
Figure A.4: Resnet18 trained on CIFAR10 Figure A.5: Resnet18 trained on CIFAR10
dataset with 1Gbps network bandwidth.   dataset with 200Mbps network bandwidth.
                                                                                                                  89


A.1.2                       Parameter sensitivity
Continuing the MNIST experiment in Section 2.5, we further conduct parameter analysis on DORE.
The basic setting for block size, learning rate, 𝛼, 𝛽 and 𝜂 are 256, 0.1, 0.1, 1, 1, respectively.
We change each parameter individually. Figures A.6, A.7, A.8, and A.9 demonstrate that DORE
performs consistently well under different parameter settings.
                                                                                           2.00
                  2.0                                                                      1.75
                                                                                           1.50
                  1.5
  Training Loss
                                                                                           1.25
                                                                               Test Loss
                                                                                           1.00
                  1.0
                                                                                           0.75
                                                                                           0.50
                  0.5
                                                                                           0.25
                  0.0                                                                      0.00
                        0     10   20      30      40     50   60   70                            0   10   20    30      40     50   60   70
                                                Epoch                                                                 Epoch
                                        (a) Training loss                                                       (b) Test loss
                                        Figure A.6: Training under different compression block sizes.
                                                                                           2.00
                  2.0
                                                                                           1.75
                                                                                           1.50
                  1.5
  Training Loss
                                                                                           1.25
                                                                               Test Loss   1.00
                  1.0
                                                                                           0.75
                                                                                           0.50
                  0.5
                                                                                           0.25
                  0.0                                                                      0.00
                        0     10   20      30      40     50   60   70                            0   10   20    30      40     50   60   70
                                                Epoch                                                                 Epoch
                                        (a) Training loss                                                       (b) Test loss
                                                        Figure A.7: Training under different 𝛼.
                                                                          90


                                                                                           2.00
                  2.0                                                                      1.75
                                                                                           1.50
                  1.5
  Training Loss
                                                                                           1.25
                                                                               Test Loss
                                                                                           1.00
                  1.0
                                                                                           0.75
                                                                                           0.50
                  0.5
                                                                                           0.25
                  0.0                                                                      0.00
                        0     10   20      30      40     50   60   70                            0   10   20    30      40     50   60   70
                                                Epoch                                                                 Epoch
                                        (a) Training loss                                                       (b) Test loss
                                                        Figure A.8: Training under different 𝛽.
                                                                                           2.00
                  2.0
                                                                                           1.75
                                                                                           1.50
                  1.5
  Training Loss                                                                Test Loss
                                                                                           1.25
                                                                                           1.00
                  1.0
                                                                                           0.75
                  0.5                                                                      0.50
                                                                                           0.25
                  0.0                                                                      0.00
                        0     10   20      30      40     50   60   70                            0   10   20    30      40     50   60   70
                                                Epoch                                                                 Epoch
                                        (a) Training loss                                                       (b) Test loss
                                                        Figure A.9: Training under different 𝜂.
A.2                         Proofs of the theorems
A.2.1                       Proof of Theorem 1
We first provide two lemmas. We define E𝑄 , E 𝑘 , and E be the expectation taken over the quantization,
the 𝑘th iteration based on x̂ 𝑘 , and the overall expectation, respectively.
Lemma 7. For every 𝑖, we can estimate the first two moments of h𝑖𝑘+1 as
                                   E𝑄 h𝑖𝑘+1 =(1 − 𝛼)h𝑖𝑘 + 𝛼g𝑖𝑘 ,                                                                           (A.1)
                                                                          91


         E𝑄 ∥h𝑖𝑘+1 − s𝑖 ∥ 2 ≤(1 − 𝛼)∥h𝑖𝑘 − s𝑖 ∥ 2 + 𝛼∥g𝑖𝑘 − s𝑖 ∥ 2 + 𝛼[(𝐶𝑞 + 1)𝛼 − 1] ∥Δ𝑖𝑘 ∥ 2 .     (A.2)
Proof. The first equality follows from lines 5-7 of Algorithm 1 and Assumption 1. For the second
equation, we have the following variance decomposition
                                         E∥ 𝑋 ∥ 2 = ∥E𝑋 ∥ 2 + E∥ 𝑋 − E𝑋 ∥ 2                          (A.3)
for any random vector 𝑋. By taking 𝑋 = h𝑖𝑘+1 − s𝑖 , we get
               E𝑄 ∥h𝑖𝑘+1 − s𝑖 ∥ 2 = ∥(1 − 𝛼)(h𝑖𝑘 − s𝑖 ) + 𝛼(g𝑖𝑘 − s𝑖 )∥ 2 + 𝛼2 E𝑄 ∥ Δ̂𝑖𝑘 − Δ𝑖𝑘 ∥ 2 . (A.4)
Using the basic equality
                     ∥𝜆a + (1 − 𝜆)b∥ 2 + 𝜆(1 − 𝜆)∥a − b∥ 2 = 𝜆∥a∥ 2 + (1 − 𝜆)∥b∥ 2                   (A.5)
for all a, b ∈ R𝑑 and 𝜆 ∈ [0, 1], as well as Assumption 1, we have
     E𝑄 ∥h𝑖𝑘+1 − s𝑖 ∥ 2 ≤ (1 − 𝛼)∥h𝑖𝑘 − s𝑖 ∥ 2 + 𝛼∥g𝑖𝑘 − s𝑖 ∥ 2 − 𝛼(1 − 𝛼)∥Δ𝑖𝑘 ∥ 2 + 𝛼2𝐶𝑞 ∥Δ𝑖𝑘 ∥ 2 , (A.6)
which is the inequality (A.2).
    Next, from the variance decomposition (A.3), we also derive Lemma 8.
Lemma 8. The following inequality holds
                                                                           𝑛
                                    ∗ 2                        ∗ 2  𝐶𝑞 ∑︁                𝜎2
                             𝑘                         𝑘
                      E[∥ ĝ − h ∥ ] ≤ E∥∇ 𝑓 ( x̂ ) − h ∥ + 2                E∥Δ𝑖𝑘 ∥ 2 +    ,        (A.7)
                                                                    𝑛 𝑖=1                𝑛
                           1          ∗             1
where h∗ = ∇ 𝑓 (x∗ ) =                  and 𝜎 2 =            𝜎𝑖2 .
                             Í𝑛                       Í𝑛
                           𝑛   𝑖=1 h𝑖               𝑛    𝑖=1
Proof. By taking the expectation over the quantization of g, we have
                               E∥ ĝ 𝑘 − h∗ ∥ 2 = E∥g 𝑘 − h∗ ∥ 2 + E∥ ĝ 𝑘 − g 𝑘 ∥ 2
                                                                          𝑛
                                                              ∗ 2  𝐶𝑞 ∑︁
                                                       𝑘
                                                ≤ E∥g − h ∥ + 2              E∥Δ𝑖𝑘 ∥ 2 ,             (A.8)
                                                                   𝑛 𝑖=1
where the inequality is from Assumption 1.
                                                           92


    For ∥g 𝑘 − h∗ ∥, we take the expectation over the sampling of gradients and derive
                         E∥g 𝑘 − h∗ ∥ 2 = E∥∇ 𝑓 ( x̂ 𝑘 ) − h∗ ∥ 2 + E∥g 𝑘 − ∇ 𝑓 ( x̂ 𝑘 )∥ 2
                                                                       𝜎2
                                          ≤ E∥∇ 𝑓 ( x̂ 𝑘 ) − h∗ ∥ 2 +                                      (A.9)
                                                                        𝑛
by Assumption 2.
    Combining (A.8) with (A.9) gives (A.7).
Proof of Theorem 1. We consider x 𝑘+1 − x∗ first. Since x∗ is the solution of (2.1), it satisfies
                                           x∗ = prox𝛾𝑅 (x∗ − 𝛾h∗ ).                                       (A.10)
Hence
        E∥x 𝑘+1 − x∗ ∥ 2 =E∥prox𝛾𝑅 ( x̂ 𝑘 − 𝛾 ĝ 𝑘 ) − prox𝛾𝑅 (x∗ − 𝛾h∗ )∥ 2
                         ≤E∥ x̂ 𝑘 − x∗ − 𝛾( ĝ 𝑘 − h∗ )∥ 2
                         =E∥ x̂ 𝑘 − x∗ ∥ 2 − 2𝛾E⟨x̂ 𝑘 − x∗ , ĝ 𝑘 − h∗ ⟩ + 𝛾 2 E∥ ĝ 𝑘 − h∗ ∥ 2
                         =E∥ x̂ 𝑘 − x∗ ∥ 2 − 2𝛾E⟨x̂ 𝑘 − x∗ , ∇ 𝑓 ( x̂ 𝑘 ) − h∗ ⟩ + 𝛾 2 E∥ ĝ 𝑘 − h∗ ∥ 2 , (A.11)
where the inequality comes from the non-expansiveness of the proximal operator and the last equality
is derived by taking the expectation of the stochastic gradient ĝ 𝑘 . Combining (A.7) and (A.11), we
have
         E∥x 𝑘+1 − x∗ ∥ 2 ≤E∥ x̂ 𝑘 − x∗ ∥ 2 − 2𝛾E⟨x̂ 𝑘 − x∗ , ∇ 𝑓 ( x̂ 𝑘 ) − h∗ ⟩
                                       𝑛
                                  𝛾 2 ∑︁                    ∗ 2     𝐶𝑞 𝛾 2 ∑︁
                                                                            𝑛
                                                                                              𝛾2
                              +                      𝑘
                                          E∥∇ 𝑓𝑖 ( x̂ ) − h𝑖 ∥ + 2              E∥Δ𝑖𝑘 ∥ 2 + 𝜎 2 .         (A.12)
                                  𝑛 𝑖=1                              𝑛 𝑖=1                     𝑛
    Then we consider E∥ x̂ 𝑘+1 − x∗ ∥ 2 . According to Algorithm 1, we have:
                        E𝑄 [ x̂ 𝑘+1 − x∗ ] = x̂ 𝑘 + 𝛽q 𝑘 − x∗
                                           = (1 − 𝛽)( x̂ 𝑘 − x∗ ) + 𝛽(x 𝑘+1 − x∗ + 𝜂e 𝑘 )                 (A.13)
where the expectation is taken on the quantization of q 𝑘 .
                                                         93


   By variance decomposition (A.3) and the basic equality (A.5),
          E∥ x̂ 𝑘+1 − x∗ ∥ 2
        ≤(1 − 𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 + 𝛽E∥x 𝑘+1 + 𝜂e 𝑘 − x∗ ∥ 2 − 𝛽(1 − 𝛽)E∥q 𝑘 ∥ 2 + 𝛽2𝐶𝑞𝑚 E∥q 𝑘 ∥ 2
        ≤(1 − 𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 + (1 + 𝜂2 𝜖) 𝛽E∥x 𝑘+1 − x∗ ∥ 2 − 𝛽(1 − (𝐶𝑞𝑚 + 1) 𝛽)E∥q 𝑘 ∥ 2
                     1
           + (𝜂2 + ) 𝛽𝐶𝑞𝑚 E∥q 𝑘−1 ∥ 2 ,                                                                  (A.14)
                     𝜖
where 𝜖 is generated from Cauchy inequality of inner product. For convenience, we let 𝜖 = 𝜂1 .
                                            1
   Choose a 𝛽 such that 0 < 𝛽 ≤           1+𝐶𝑞𝑚 .  Then we have
           𝛽(1 − (𝐶𝑞𝑚 + 1) 𝛽)E∥q 𝑘 ∥ 2 + E∥ x̂ 𝑘+1 − x∗ ∥ 2
         ≤(1 − 𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 + (1 + 𝜂) 𝛽E∥x 𝑘+1 − x∗ ∥ 2 + (𝜂2 + 𝜂) 𝛽𝐶𝑞𝑚 E∥q 𝑘−1 ∥ 2 .              (A.15)
   Letting s𝑖 = h𝑖∗ in (A.2), we have
                                    𝑛
                  (1 + 𝜂)𝑐𝛽𝛾 2 ∑︁
                                       E∥h𝑖𝑘+1 − h𝑖∗ ∥ 2
                         𝑛         𝑖=1
                                               𝑛                                        𝑛
                  (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁ 𝑘                            (1 + 𝜂)𝛼𝑐𝛽𝛾 2 ∑︁ 𝑘
                ≤                                  ∥h𝑖 − h𝑖∗ ∥ 2 +                         ∥g𝑖 − h𝑖∗ ∥ 2
                               𝑛              𝑖=1
                                                                              𝑛        𝑖=1
                     (1 + 𝜂)𝛼[(𝐶𝑞 + 1)𝛼 − 1]𝑐𝛽𝛾 2 ∑︁            𝑛
                  +                                                ∥Δ𝑖𝑘 ∥ 2 .                            (A.16)
                                        𝑛                     𝑖=1
   Then we let R 𝑘 = 𝛽(1 − (𝐶𝑞𝑚 + 1) 𝛽)E∥q 𝑘 ∥ 2 and define
                                                                                𝑛
                                                             (1 + 𝜂)𝑐𝛽𝛾 2 ∑︁
                         𝑘
                       V =R      𝑘−1          𝑘
                                     + E∥ x̂ − x ∥ +∗ 2
                                                                                   E∥h𝑖𝑘 − h𝑖∗ ∥ 2 .
                                                                    𝑛          𝑖=1
   Thus, we obtain
  V 𝑘+1 ≤(𝜂2 + 𝜂) 𝛽𝐶𝑞𝑚 E∥q 𝑘−1 ∥ 2 + (1 + 𝜂𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 − 2(1 + 𝜂) 𝛽𝛾E⟨x̂ 𝑘 − x∗ , ∇ 𝑓 ( x̂ 𝑘 ) − h∗ ⟩
                                          𝑛
             (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁
           +                                  E∥h𝑖𝑘 − h𝑖∗ ∥ 2
                          𝑛              𝑖=1
                           2                                        𝑛
             (1 + 𝜂) 𝛽𝛾      h
                                                2
                                                                 i ∑︁
           +         2
                               𝑛𝑐(𝐶  𝑞 +  1)𝛼     − 𝑛𝑐𝛼   +  𝐶 𝑞        E∥Δ𝑖𝑘 ∥ 2
                   𝑛                                               𝑖=1
                                          𝑛
             (1 + 𝜂)(1 + 𝑐𝛼) 2 ∑︁                                         (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2
           +                       𝛽𝛾        E∥∇ 𝑓𝑖 ( x̂ 𝑘 ) − h𝑖∗ ∥ 2 +                      𝛽𝛾 𝜎 .     (A.17)
                       𝑛                𝑖=1
                                                                                    𝑛
                                                            94


    The E∥Δ𝑖𝑘 ∥ 2 -term can be ignored if 𝑛𝑐(𝐶𝑞 + 1)𝛼2 − 𝑛𝑐𝛼 + 𝐶𝑞 ≤ 0, which can be guaranteed by
     4𝐶𝑞 (𝐶𝑞 +1)
𝑐≥        𝑛      and
                                           √︃                         √︃
                                                  4𝐶𝑞 (𝐶𝑞 +1)                  4𝐶𝑞 (𝐶𝑞 +1)
                                    ©1 − 1 −            𝑛𝑐        1+ 1−               𝑛𝑐     ª
                              𝛼 ∈ ­­                           ,                             ®.
                                            2(𝐶𝑞 + 1)                   2(𝐶𝑞 + 1)            ®
                                    «                                                        ¬
    Given that each 𝑓𝑖 is 𝐿-Lipschitz differentiable and 𝜇-strongly convex, we have
                                                                                    𝑛
                                            𝜇𝐿                           1 1 ∑︁
                 𝑘      ∗   𝑘     ∗
      E⟨∇ 𝑓 ( x̂ ) − h , x̂ − x ⟩ ≥              E∥ x̂ 𝑘 − x∗ ∥ 2 +                      E∥∇ 𝑓𝑖 ( x̂ 𝑘 ) − h𝑖∗ ∥ 2 .      (A.18)
                                          𝜇+𝐿                         𝜇 + 𝐿 𝑛 𝑖=1
Hence
V 𝑘+1 ≤𝜌1 R 𝑘−1 + (1 + 𝜂𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 − 2(1 + 𝜂) 𝛽𝛾E⟨x̂ 𝑘 − x∗ , ∇ 𝑓 ( x̂ 𝑘 ) − h∗ ⟩
                                       𝑛                                                     𝑛
            (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁                            (1 + 𝜂)(1 + 𝑐𝛼) 2 ∑︁
         +                                     𝑘      ∗ 2
                                           E∥h𝑖 − h𝑖 ∥ +                               𝛽𝛾        E∥∇ 𝑓𝑖 ( x̂ 𝑘 ) − h𝑖∗ ∥ 2
                       𝑛              𝑖=1
                                                                         𝑛                  𝑖=1
            (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2
         +                      𝛽𝛾 𝜎
                     𝑛
                                                                                                           𝑛
                    h            2(1 + 𝜂) 𝛽𝛾𝜇𝐿 i                   ∗ 2     (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁
      ≤𝜌1 R   𝑘−1
                   + 1 + 𝜂𝛽 −                               𝑘
                                                      E∥ x̂ − x ∥ +                                            E∥h𝑖𝑘 − h𝑖∗ ∥ 2
                                        𝜇+𝐿                                               𝑛               𝑖=1
                                                              𝑛
            h                             2(1 + 𝜂) 𝛽𝛾 1 ∑︁
                                                         i                                      (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2
         + (1 + 𝜂)(1 + 𝑐𝛼) 𝛽𝛾 2 −                                 E∥∇ 𝑓𝑖 ( x̂ 𝑘 ) − h𝑖∗ ∥ 2 +                           𝛽𝛾 𝜎
                                             𝜇+𝐿           𝑛 𝑖=1                                           𝑛
                                                                        𝑛
                                            (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁                                 (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2
      ≤𝜌1 R 𝑘−1 + 𝜌2 E∥ x̂ 𝑘 − x∗ ∥ 2 +                                    E∥h𝑖𝑘 − h𝑖∗ ∥ 2 +                               𝛽𝛾 𝜎
                                                       𝑛               𝑖=1
                                                                                                              𝑛
                                                                                                                          (A.19)
where
                                                  (𝜂2 + 𝜂)𝐶𝑞𝑚
                                           𝜌1 =                     ,
                                                1 − (𝐶𝑞𝑚 + 1) 𝛽
                                                             2(1 + 𝜂) 𝛽𝛾𝜇𝐿
                                           𝜌2 =1 + 𝜂𝛽 −                           .
                                                                   𝜇+𝐿
                          2                                                     2(1+𝜂) 𝛽𝛾
Here we let 𝛾 ≤      (1+𝑐𝛼) (𝜇+𝐿)  such that (1 + 𝜂)(1 + 𝑐𝛼) 𝛽𝛾 2 −                 𝜇+𝐿      ≤ 0 and the last inequality
holds. In order to get max(𝜌1 , 𝜌2 , 1 − 𝛼) < 1, we have the following conditions
                                        0 ≤ (𝜂2 + 𝜂)𝐶𝑞𝑚 ≤1 − (𝐶𝑞𝑚 + 1) 𝛽,
                                                                 2(1 + 𝜂)𝛾𝜇𝐿
                                                           𝜂<                       .
                                                                      𝜇+𝐿
                                                            95


Therefore, the condition for 𝛾 is
                                            𝜂(𝜇 + 𝐿)                          2
                                                           ≤𝛾≤                             ,
                                          2(1 + 𝜂)𝜇𝐿                (1 + 𝑐𝛼)(𝜇 + 𝐿)
which implies an additional condition for 𝜂. Therefore, the condition for 𝜂 is
                                         √︃
                                              𝑚 2
                              © −𝐶𝑞 + (𝐶𝑞 ) + 4(1 − (𝐶𝑞 + 1) 𝛽)
                                   𝑚                                𝑚
                                                                                              4𝜇𝐿            ªª
           𝜂 ∈ 0, min ­
                                                                                                             ®® .
                              ­                                                  ,
                                                   2𝐶𝑞𝑚                            (𝜇 + 𝐿) 2 (1 + 𝑐𝛼) − 4𝜇𝐿 ®®
                  
                             «                                                                               ¬¬
                              4𝜇𝐿                            𝜂(𝜇+𝐿)                2
    where 𝜂 ≤        (𝜇+𝐿) 2 (1+𝑐𝛼)−4𝜇𝐿
                                           is to ensure    2(1+𝜂)𝜇𝐿    ≤ (1+𝑐𝛼) (𝜇+𝐿)     such that we don’t get an empty
set for 𝛾.
    If we define 𝜌 = max{𝜌1 , 𝜌2 , 1 − 𝛼}, we obtain
                                                             (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2
                                       V 𝑘+1 ≤ 𝜌V 𝑘 +                                𝛽𝛾 𝜎                              (A.20)
                                                                       𝑛
and the proof is completed by applying (A.20) recurrently.
A.2.2    Proof of Theorem 2
Proof. In Algorithm 2, we can show
                         E∥ x̂ 𝑘+1 − x̂ 𝑘 ∥ 2 = 𝛽2 E∥ q̂ 𝑘 ∥ 2 = 𝛽2 E∥Eq̂ 𝑘 ∥ 2 + 𝛽2 E∥ q̂ 𝑘 − Eq̂ 𝑘 ∥ 2
                                              = 𝛽2 E∥q 𝑘 ∥ 2 + 𝛽2 E∥ q̂ 𝑘 − q 𝑘 ∥ 2                                    (A.21)
                                              ≤ (1 + 𝐶𝑞𝑚 ) 𝛽2 E∥q 𝑘 ∥ 2 .
and
 E∥q 𝑘 ∥ 2 = E∥ − 𝛾 ĝ 𝑘 + 𝜂e 𝑘 ∥ 2 ≤ 2𝛾 2 E∥ ĝ 𝑘 ∥ 2 + 2𝜂2 E∥e 𝑘 ∥ 2 ≤ 2𝛾 2 E∥ ĝ 𝑘 ∥ 2 + 2𝐶𝑞𝑚 𝜂2 E∥q 𝑘−1 ∥ 2 . (A.22)
    Using (A.21)(A.22) and the Lipschitz continuity of ∇ 𝑓 (x), we have
          E 𝑓 ( x̂ 𝑘+1 ) + (𝐶𝑞𝑚 + 1)𝐿 𝛽2 E∥q 𝑘 ∥ 2
                                                           𝐿
        ≤E 𝑓 ( x̂ 𝑘 ) + E⟨∇ 𝑓 ( x̂ 𝑘 ), x̂ 𝑘+1 − x̂ 𝑘 ⟩ +    E∥ x̂ 𝑘+1 − x̂ 𝑘 ∥ 2 + (𝐶𝑞𝑚 + 1)𝐿 𝛽2 E∥q 𝑘 ∥ 2
                                                           2
                                                               (1 + 𝐶𝑞𝑚 )𝐿 𝛽2
                   𝑘                 𝑘          𝑘
        =E 𝑓 ( x̂ ) + 𝛽E⟨∇ 𝑓 ( x̂ ), −𝛾 ĝ + 𝜂e ⟩ +      𝑘
                                                                                   E∥q 𝑘 ∥ 2 + (𝐶𝑞𝑚 + 1)𝐿 𝛽2 E∥q 𝑘 ∥ 2
                                                                       2
                                                                96


                                                                      3(𝐶𝑞𝑚 + 1)𝐿 𝛽2
                  𝑘                   𝑘               𝑘
        =E 𝑓 ( x̂ ) + 𝛽E⟨∇ 𝑓 ( x̂ ), −𝛾∇ 𝑓 ( x̂ ) + 𝜂e ⟩ +    𝑘
                                                                                         E∥q 𝑘 ∥ 2
                                                                               2
                                                  𝛽𝜂                       𝛽𝜂
        ≤E 𝑓 ( x̂ 𝑘 ) − 𝛽𝛾E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 +         E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 +      E∥e 𝑘 ∥ 2
                                                  2                         2
                                      h                                      i
             + 3(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 E∥ ĝ 𝑘 ∥ 2 + 𝐶𝑞𝑚 𝜂2 E∥q 𝑘−1 ∥ 2
                         h        𝛽𝜂                            i
        ≤E 𝑓 ( x̂ 𝑘 ) − 𝛽𝛾 −            − 3(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2
                                    2
            3𝐶𝑞 (𝐶𝑞 + 1)𝐿 𝛽2 𝛾 2 ∑︁
                        𝑚                   𝑛
                                                     𝑘 2
                                                             3(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 2
          +                                     E∥Δ     ∥ +                         𝜎
                           𝑛2              𝑖=1
                                                     𝑖
                                                                        𝑛
            h 𝛽𝜂𝐶𝑞𝑚                                     i
          +              + (3𝐶𝑞 + 1)𝐶𝑞 𝐿 𝛽 𝜂 E∥q 𝑘−1 ∥ 2 ,
                                  𝑚          𝑚     2 2
                                                                                                                   (A.23)
                    2
where the last inequality is from (A.7) with h∗ = 0.
    Letting s𝑖 = 0 in (A.2), we have
                      E𝑄 ∥h𝑖𝑘+1 ∥ 2 ≤(1 − 𝛼)∥h𝑖𝑘 ∥ 2 + 𝛼∥g𝑖𝑘 ∥ 2 + 𝛼[(𝐶𝑞 + 1)𝛼 − 1] ∥Δ𝑖𝑘 ∥ 2 .                     (A.24)
    Due to the assumption that each worker samples the gradient from the full dataset, we have
                                Eg𝑖𝑘 = E∇ 𝑓 ( x̂ 𝑘 ),      E∥g𝑖𝑘 ∥ 2 ≤ E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 + 𝜎𝑖2 .                   (A.25)
    Define Λ 𝑘 = (𝐶𝑞𝑚 +1)𝐿 𝛽2 ∥q 𝑘−1 ∥ 2 + 𝑓 ( x̂ 𝑘 )− 𝑓 ∗ +3𝑐(𝐶𝑞𝑚 +1)𝐿 𝛽2 𝛾 2 𝑛1                      𝑘 2
                                                                                           Í𝑛
                                                                                             𝑖=1 E∥h𝑖 ∥ , and from (A.23),
(A.24), and (A.25), we have
                                                                                         𝑛
                                                                                2 21
                                                                                       ∑︁
                                              ∗
                EΛ    𝑘+1            𝑘
                           ≤E 𝑓 ( x̂ ) − 𝑓 + 3(1 −        𝛼)𝑐(𝐶𝑞𝑚     + 1)𝐿 𝛽 𝛾             E∥h𝑖𝑘 ∥ 2
                                                                                    𝑛 𝑖=1
                                 h         𝛽𝜂                                      i
                              − 𝛽𝛾 −            − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2
                                            2
                                (𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 h                        2
                                                                                               i ∑︁𝑛
                              +                          3𝑛𝑐(𝐶𝑞 + 1)𝛼 − 3𝑛𝑐𝛼 + 3𝐶𝑞                    E∥Δ𝑖𝑘 ∥ 2
                                          𝑛2                                                     𝑖=1
                                                (𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 𝜎 2
                              + 3(1 + 𝑛𝑐𝛼)
                                                           𝑛
                                h 𝛽𝜂𝐶𝑞𝑚                                   i
                              +             + 3(𝐶𝑞𝑚 + 1)𝐶𝑞𝑚 𝐿 𝛽2 𝜂2 E∥q 𝑘−1 ∥ 2 .                                  (A.26)
                                       2
                      4𝐶𝑞 (𝐶𝑞 +1)
    If we let 𝑐 =          𝑛      , then the condition of 𝛼 in (2.5) gives 3𝑛𝑐(𝐶𝑞 + 1)𝛼2 − 3𝑛𝑐𝛼 + 3𝐶𝑞 ≤ 0
and
                                                                                             𝑛
                                                                                    2 21
                                                                                            ∑︁
                                                  ∗
                      EΛ   𝑘+1            𝑘
                                ≤E 𝑓 ( x̂ ) − 𝑓 + 3(1 −       𝛼)𝑐(𝐶𝑞𝑚       + 1)𝐿 𝛽 𝛾            E∥h𝑖𝑘 ∥ 2
                                                                                          𝑛 𝑖=1
                                                               97


                                      h            𝛽𝜂                                     i
                                  − 𝛽𝛾 −                 − 3(1 + 𝑐𝛼)(𝐶𝑞 + 1)𝐿 𝛽 𝛾 E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2
                                                                            𝑚         2 2
                                                     2
                                                          (𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 𝜎 2
                                  + 3(1 + 𝑛𝑐𝛼)
                                                                     𝑛
                                        𝛽𝜂𝐶𝑞𝑚
                                  +[                 + 3(𝐶𝑞𝑚 + 1)𝐶𝑞𝑚 𝐿 𝛽2 𝜂2 ]E∥q 𝑘−1 ∥ 2 .                         (A.27)
                                             2
                                          1
   Let 𝜂 = 𝛾 and 𝛽𝛾 ≤          6(1+𝑐𝛼) (𝐶𝑞𝑚 +1)𝐿 ,       we have
                   𝛽𝜂                                                    𝛽𝛾
            𝛽𝛾 −         − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 =                      − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 ≥ 0.
                    2                                                     2
                            √︂
                                  48𝐿 2 𝛽 2 (𝐶𝑞𝑚 +1) 2
                     n −1+    1+          𝐶𝑞𝑚
                                                                            o
                                                                  1
   Take 𝛾 ≤ min              12𝐿 𝛽(𝐶𝑞𝑚 +1)             , 6𝐿 𝛽(1+𝑐𝛼) (𝐶𝑞 +1) will guarantee
                                                                       𝑚
                                h 𝛽𝜂𝐶𝑞𝑚                                       i
                                              + 3(𝐶𝑞𝑚 + 1)𝐶𝑞𝑚 𝐿 𝛽2 𝜂2 ≤ (𝐶𝑞𝑚 + 1)𝐿 𝛽2 .
                                     2
   Hence we obtain
    𝑘+1        𝑘
                    h 𝛽𝛾
                                                                   2 2
                                                                       i
                                                                                     2
                                                                                                   (𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 𝜎 2
 EΛ     ≤ EΛ −              − 3(1 +       𝑐𝛼)(𝐶𝑞𝑚                                 𝑘
                                                         + 1)𝐿 𝛽 𝛾 E∥∇ 𝑓 ( x̂ )∥ + 3(1 + 𝑛𝑐𝛼)                             .
                        2                                                                                    𝑛
                                                                                                                    (A.28)
   Taking the telescoping sum and plugging the initial conditions, we derive (2.12).
A.2.3   Proof of Corollary 2
                          1                     4𝐶𝑞 (𝐶𝑞 +1)                                                        1
Proof. With 𝛼 =     2(𝐶𝑞 +1)    and 𝑐 =                𝑛      , 1 + 𝑛𝑐𝛼 = 1 + 2𝐶𝑞 is a constant. We set 𝛽 =      𝐶𝑞𝑚 +1 and
               √︂
                        2
         n −1+ 1+ 48𝐿
                    𝐶𝑚
                                                         o
                      𝑞                  1 √
𝛾 = min         12𝐿       ,                                . In general, 𝐶𝑞𝑚 is bounded which makes       the first bound
                            12𝐿(1+𝑐𝛼) (1+ 𝐾/𝑛)
                                    1 √
negligible, i.e., 𝛾 =                                  when 𝐾 is large enough. Therefore, we have
                         12𝐿(1+𝑐𝛼) (1+ 𝐾/𝑛)
                    𝛽                                                 1 − 6(1 + 𝑐𝛼)𝐿𝛾           1
                        − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 =                                      ≤           .             (A.29)
                    2                                                      2(𝐶𝑞 + 1)
                                                                                𝑚           4(𝐶𝑞 + 1)
                                                                                               𝑚
   From Theorem 2, we derive
                 𝐾
            1 ∑︁
                    E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2
            𝐾 𝑘=1
            4(𝐶𝑞𝑚 + 1)(EΛ1 − EΛ𝐾+1 )                      12(1 + 𝑛𝑐𝛼)𝐿𝜎 2 𝛾
         ≤                                             +
                            𝛾𝐾                                       𝑛
                                                                    98


                                              1   1   (1 + 𝑛𝑐𝛼)𝜎 2 1
        ≤48𝐿 (𝐶𝑞𝑚 + 1)(1 + 𝑐𝛼)(EΛ1 − EΛ𝐾+1 )(   +√ )+             √ , (A.30)
                                              𝐾   𝑛𝐾    (1 + 𝑐𝛼)   𝑛𝐾
which completes the proof.
                                           99


                                             APPENDIX B
LINEAR CONVERGENT DECENTRALIZED OPTIMIZATION WITH COMPRESSION
B.1    Compression method
B.1.1   p-norm b-bits quantization
Theorem 9 (p-norm b-bit quantization). Let us define the quantization operator as
                                                                   2𝑏−1 |x|        
                                                          −(𝑏−1)
                           𝑄 𝑝 (x) := ∥x∥ 𝑝 sign(x)2                ·            +u                          (B.1)
                                                                        ∥x∥ 𝑝
where · is the Hadamard product, |x| is the elementwise absolute value and u is a random dither
vector uniformly distributed in [0, 1] 𝑑 . 𝑄 𝑝 (x) is unbiased, i.e., E𝑄 𝑝 (x) = x, and the compression
variance is upper bounded by
                                                  1
                             E∥x − 𝑄 𝑝 (x)∥ 2 ≤      ∥sign(x)2−(𝑏−1) ∥ 2 ∥x∥ 2𝑝 ,                            (B.2)
                                                  4
which suggests that ∞-norm provides the smallest upper bound for the compression variance due to
∥x∥ 𝑝 ≤ ∥x∥ 𝑞 , ∀x if 1 ≤ 𝑞 ≤ 𝑝 ≤ ∞.
Remark 13. For the compressor defined in (B.1), we have the following the compression constant
                                             ∥sign(x)2−(𝑏−1) ∥ 2 ∥x∥ 2𝑝
                                   𝐶 = sup                                  .
                                          x               4∥x∥ 2
                                                                        j          k          l          m
                                                         2𝑏−1 |x|         2𝑏−1 |x|              2𝑏−1 |x|
Proof. Let denote v = ∥x∥ 𝑝 sign(x)2−(𝑏−1) , 𝑠 =          ∥x∥ 𝑝 ,  𝑠1 =    ∥x∥ 𝑝     and 𝑠2 =    ∥x∥ 𝑝    . We can
rewrite x as x = 𝑠 · v.
    For any coordinate 𝑖 such that 𝑠𝑖 = (𝑠1 )𝑖 , we have 𝑄 𝑝 (x𝑖 ) = (𝑠1 )𝑖 v𝑖 with probability 1. Hence
E𝑄 𝑝 (x)𝑖 = 𝑠𝑖 v𝑖 = x𝑖 and
                                  E(x𝑖 − 𝑄 𝑝 (x)𝑖 ) 2 = (x𝑖 − 𝑠𝑖 v𝑖 ) 2 = 0.
    For any coordinate 𝑖 such that 𝑠𝑖 ≠ (𝑠1 )𝑖 , we have (𝑠2 )𝑖 − (𝑠1 )𝑖 = 1 and 𝑄 𝑝 (x)𝑖 satisfies
                                             
                                              (𝑠1 )𝑖 v𝑖 , w.p. (𝑠2 )𝑖 − 𝑠𝑖 ,
                                             
                                             
                                             
                                  𝑄 𝑝 (x)𝑖 =
                                              (𝑠2 )𝑖 v𝑖 , w.p. 𝑠𝑖 − (𝑠1 )𝑖 .
                                             
                                             
                                             
                                                     100


    Thus, we derive
              E𝑄 𝑝 (x)𝑖 = v𝑖 (𝑠1 )𝑖 (𝑠2 − 𝑠)𝑖 + v𝑖 (𝑠2 )𝑖 (𝑠 − 𝑠1 )𝑖 = v𝑖 𝑠𝑖 (𝑠2 − 𝑠1 )𝑖 = v𝑖 𝑠𝑖 = x𝑖 ,
and
  E[x𝑖 − 𝑄 𝑝 (x)𝑖 ] 2 = (x𝑖 − v𝑖 (𝑠1 )𝑖 ) 2 (𝑠2 − 𝑠)𝑖 + (x𝑖 − v𝑖 (𝑠2 )𝑖 ) 2 (𝑠 − 𝑠1 )𝑖
                      = (𝑠2 − 𝑠1 )𝑖 x𝑖2 + (𝑠1 )𝑖 (𝑠2 )𝑖 (𝑠1 − 𝑠2 )𝑖 + 𝑠𝑖 ((𝑠2 )𝑖2 − (𝑠1 )𝑖2 ) v𝑖2 − 2𝑠𝑖 (𝑠2 − 𝑠1 )𝑖 x𝑖 v𝑖
                                                                                                
                      = x𝑖2 + − (𝑠1 )𝑖 (𝑠2 )𝑖 + 𝑠𝑖 (𝑠2 + 𝑠1 )𝑖 v𝑖2 − 2𝑠𝑖 x𝑖 v𝑖
                                                                    
                      = (x𝑖 − 𝑠𝑖 v𝑖 ) 2 + − (𝑠1 )𝑖 (𝑠2 )𝑖 + 𝑠𝑖 (𝑠2 + 𝑠1 )𝑖 − 𝑠𝑖2 v𝑖2
                                                                                      
                      = (x𝑖 − 𝑠𝑖 v𝑖 ) 2 + (𝑠2 − 𝑠)𝑖 (𝑠 − 𝑠1 )𝑖 v𝑖2
                      = (𝑠2 − 𝑠)𝑖 (𝑠 − 𝑠1 )𝑖 v𝑖2
                        1 2
                      ≤    v .
                        4 𝑖
    Considering both cases, we have E𝑄(x) = x and
                                            ∑︁                                  ∑︁
               E∥x − 𝑄 𝑝 (x)∥ 2 =                     E[x𝑖 − 𝑄 𝑝 (x)𝑖 ] 2 +               E[x𝑖 − 𝑄 𝑝 (x)𝑖 ] 2
                                        {𝑠𝑖 =(𝑠1 )𝑖 }                       {𝑠𝑖 ≠(𝑠1 )𝑖 }
                                              1       ∑︁
                                     ≤ 0+                       v𝑖2
                                              4
                                                  {𝑠𝑖 ≠(𝑠1 )𝑖 }
                                         1
                                     ≤     ∥v∥ 2
                                         4
                                        1
                                    = ∥sign(x)2−(𝑏−1) ∥ 2 ∥x∥ 2𝑝 .
                                        4
B.1.2    Compression error
To verify Theorem 9, we compare the compression error of the quantization method defined in (B.1)
with different norms (𝑝 = 1, 2, 3, . . . , 6, ∞). Specifically, we uniformly generate 100 random vectors
in R10000 and compute the average compression error. The result shown in Figure B.1 verifies our
proof in Theorem 9 that the compression error decreases when 𝑝 increases. This suggests that
∞-norm provides the best compression precision under the same bit constraint.
                                                                101


                                                                                                                          2 bits
                                                                                                                          3 bits
                                                                                                                          4 bits
                                                   1
                                              10                                                                          5 bits
                                                                                                                          6 bits
                         ||X Q(X)||2/||X||2
                                                                                                                          7 bits
                                                   0
                                              10
                                                   1
                                              10
                                                   2
                                              10
                                                         1           2            3            4        5             6      inf
                                                                                         p-norm
                                                                                               ∥x−𝑄(x) ∥ 2
         Figure B.1: Relative compression error                                                   ∥x∥ 2         for p-norm b-bit quantization.
                                                   1
                                              10
                                                   0
                                              10
                                                   1
                                              10
                         ||X Q(X)||2/||X||2
                                                   2
                                              10
                                                   3
                                              10
                                                   4
                                              10                2-norm b-bits quantization
                                                                4-norm b-bits quantization
                                              10
                                                   5            inf-norm b-bits quantization
                                                                top-k sparsification
                                                   6            random-k sparsification
                                              10
                                                       2 bits            4 bits           8 bits            16 bits       20 bits
                                                                                      Bits constraint
                                                                                         ∥x−𝑄(x) ∥ 2
Figure B.2: Comparison of compression error                                                 ∥x∥ 2            between different compression methods.
   Under similar setting, we also compare the compression error with other popular compression
methods, such as top-k and random-k sparsification. The x-axes represents the average bits needed
to represent each element of the vector. The result is showed in Fig. B.2. Note that intuitively
top-k methods should perform better than random-k method, but the top-k method needs extra bits
to transmitted the index while random-k method can avoid this by using the same random seed.
Therefore, top-k method doesn’t outperform random-k too much under the same communication
budget. The result in Fig. B.2 suggests that ∞-norm b-bits quantization provides significantly better
compression precision than others under the same bit constraint.
                                                                                        102


B.2     Experiments
B.2.1   Experiments in homogeneous setting
The experiments on logistic regression problem in homogeneous case are showed in Fig. B.3
and Fig. B.4. It shows that DeepSqueeze, CHOCO-SGD and LEAD converges similarly while
DeepSqueeze and CHOCO-SGD require to tune a smaller 𝛾 for convergence as showed in the
parameter setting in Section B.2.2. Generally, a smaller 𝛾 decreases the model propagation between
agents since 𝛾 changes the effective mixing matrix and this may cause slower convergence. However,
in the setting where data from different agents are very similar, the models move to close directions
such that the convergence is not affected too much.
                                                    DGD (32 bits)                                                                     DGD (32 bits)
                                                    NIDS (32 bits)                                                                    NIDS (32 bits)
                         0                                                                   0
                   10
                                                    QDGD (2 bits)                      10                                             QDGD (2 bits)
                                                    DeepSqueeze (2 bits)                                                              DeepSqueeze (2 bits)
                                                    CHOCO-SGD (2 bits)                                                                CHOCO-SGD (2 bits)
                                                    LEAD (2 bits)                                                                     LEAD (2 bits)
         Loss                                                                Loss
                         −1
                6 × 10                                                                       −1
                                                                                    6 × 10
                         −1
                4 × 10                                                                       −1
                                                                                    4 × 10
                         −1
                3 × 10
                                                                                             −1
                                                                                    3 × 10
                              0   200     400       600     800      1000                         0.00   0.25    0.50   0.75   1.00    1.25   1.50   1.75   2.00
                                                Epoch                                                                   Bits transmitted                     1e10
                                    (a) Loss 𝑓 (X 𝑘 )                                                           (b) Loss 𝑓 (X 𝑘 )
          Figure B.3: Logistic regression in the homogeneous case (full-batch gradient).
B.2.2   Parameter settings
The best parameter settings we search for all algorithms and experiments are summarized in
Tables B.1– B.4. QDGD and DeepSqueeze are more sensitive to 𝛾 and CHOCO-SGD is slight more
robust. LEAD is most robust to parameter settings and it works well for the setting 𝛼 = 0.5 and
𝛾 = 1.0 in all experiments in this work.
                                                                           103


                  1.2                                                            1.2
                                                 DGD (32 bits)                                                          DGD (32 bits)
                                                 NIDS (32 bits)                                                         NIDS (32 bits)
                  1.0                            QDGD (2 bits)                   1.0                                    QDGD (2 bits)
                                                 DeepSqueeze (2 bits)                                                   DeepSqueeze (2 bits)
                  0.8                            CHOCO-SGD (2 bits)                                                     CHOCO-SGD (2 bits)
                                                                                 0.8
                                                 LEAD (2 bits)                                                          LEAD (2 bits)
           Loss                                                           Loss
                  0.6                                                            0.6
                  0.4
                                                                                 0.4
                        0   10   20   30    40     50    60     70   80                0.0   0.2    0.4    0.6    0.8      1.0    1.2   1.4
                                           Epoch                                                          Bits transmitted                    1e9
                                  (a) Loss 𝑓 (X 𝑘 )                                                 (b) Loss 𝑓 (X 𝑘 )
           Figure B.4: Logistic regression in the homogeneous case (mini-batch gradient).
                                                    Algorithm              𝜂             𝛾          𝛼
                                                      DGD                 0.1            -          -
                                                      NIDS                0.1            -          -
                                                     QDGD                 0.1           0.2         -
                                                   DeepSqueeze            0.1           0.2         -
                                                   CHOCO-SGD              0.1           0.8         -
                                                     LEAD                 0.1           1.0        0.5
                            Table B.1: Parameter settings for the linear regression problem.
              Algorithm                     𝜂       𝛾          𝛼                        Algorithm                    𝜂            𝛾      𝛼
                DGD                        0.1      -          -                          DGD                       0.1           -      -
                NIDS                       0.1      -          -                          NIDS                      0.1           -      -
               QDGD                        0.1     0.4         -                         QDGD                       0.1          0.2     -
             DeepSqueeze                   0.1     0.4         -                       DeepSqueeze                  0.1          0.6     -
             CHOCO-SGD                     0.1     0.6         -                       CHOCO-SGD                    0.1          0.6     -
               LEAD                        0.1     1.0        0.5                        LEAD                       0.1          1.0    0.5
                            Homogeneous case                                                       Heterogeneous case
        Table B.2: Parameter settings for the logistic regression problem (full-batch gradient).
B.3      Proofs of the theorems
B.3.1     Illustrative flow
The following flow graph depicts the relation between iterative variables and clarifies the range
of conditional expectation. {G𝑘 }∞            ∞
                                 𝑘=0 and {F𝑘 } 𝑘=0 are two 𝜎−algebras generated by the gradient
                                                                        104


               Algorithm             𝜂      𝛾      𝛼                    Algorithm         𝜂        𝛾        𝛼
                  DGD              0.1      -       -                      DGD           0.1        -       -
                  NIDS             0.1      -       -                     NIDS           0.1        -       -
                 QDGD              0.05    0.2      -                    QDGD           0.05      0.2       -
            DeepSqueeze            0.1     0.6      -                 DeepSqueeze        0.1      0.6       -
            CHOCO-SGD              0.1     0.6      -                CHOCO-SGD           0.1      0.6       -
                 LEAD              0.1     1.0    0.5                    LEAD            0.1      1.0     0.5
                     Homogeneous case                                       Heterogeneous case
      Table B.3: Parameter settings for the logistic regression problem (mini-batch gradient).
               Algorithm             𝜂      𝛾      𝛼                    Algorithm         𝜂        𝛾        𝛼
                  DGD              0.1      -       -                      DGD          0.05        -       -
                  NIDS             0.1      -       -                     NIDS           0.1        -       -
                 QDGD              0.05    0.1      -                    QDGD             *        *        -
            DeepSqueeze            0.1     0.2      -                 DeepSqueeze         *        *        -
            CHOCO-SGD              0.1     0.6      -                CHOCO-SGD            *        *        -
                 LEAD              0.1     1.0    0.5                    LEAD            0.1      1.0     0.5
                     Homogeneous case                                       Heterogeneous case
Table B.4: Parameter settings for the deep neural network. (* means divergence for all options we
try).
sampling and the stochastic compression respectively. They satisfy
                                  G0 ⊂ F0 ⊂ G1 ⊂ F1 ⊂ · · · ⊂ G𝑘 ⊂ F𝑘 ⊂ · · ·
      (X1 , D1 , H1 )            (X2 , D2 , H2 )            (X3 , D3 , H3 )        (X 𝑘 , D 𝑘 , H 𝑘 )            ···
                     ∇F(X1 ;𝜉 1 )∈G0   E1       ∇F(X2 ;𝜉 2 )∈G1   E2           ···       E 𝑘−1 ∇F(X 𝑘 ;𝜉 𝑘 )∈G𝑘−1
                                       Y1        1st round
                                                                  Y2         ···        Y 𝑘−1                    Y𝑘
                                                                                                 (𝑘−1)th round
                                                      ⊂                                               ⊂
                                       F0                         F1         ···        F𝑘−2                    F𝑘−1
    The solid and dashed arrows in the top flow illustrate the dynamics of the algorithm, while in the
bottom, the arrows stand for the relation between successive F -𝜎-algebras. The downward arrows
determine the range of F -𝜎-algebras. E.g., up to E 𝑘 , all random variables are in F𝑘−1 and up to
∇F(X 𝑘 ; 𝜉 𝑘 ), all random variables are in G𝑘−1 with G𝑘−1 ⊂ F𝑘−1 . Throughout the appendix, without
specification, E is the expectation conditioned on the corresponding stochastic estimators given the
                                                            105


context.
B.3.2    Two central Lemmas
Lemma 10 (Fundamental equality). Let X∗ be the optimal solution, D∗ B −∇F(X∗ ) and E 𝑘 denote
the compression error in the 𝑘th iteration, that is E 𝑘 = Q 𝑘 − (Y𝑘 − H 𝑘 ) = Ŷ 𝑘 − Y 𝑘 . From Alg. 3,
we have
    ∥X 𝑘+1 − X∗ ∥ 2 + (𝜂2 /𝛾)∥D 𝑘+1 − D∗ ∥ 2M
  =∥X 𝑘 − X∗ ∥ 2 + (𝜂2 /𝛾)∥D 𝑘 − D∗ ∥ 2M − (𝜂2 /𝛾)∥D 𝑘+1 − D 𝑘 ∥ 2M − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2
    − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + 2𝜂⟨E 𝑘 , D 𝑘+1 − D∗ ⟩,
where M B 2(I − W) † − 𝛾I and 𝛾 < 2/𝜆 max (I − W) ensures the positive definiteness of M over
range(I − W).
Lemma 11 (State inequality). Let the same assumptions in Lemma 10 hold. From Alg. 3, if we take
the expectation over the compression operator conditioned on the 𝑘-th iteration, we have
      E∥H 𝑘+1 − X∗ ∥ 2 ≤ (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝛼E∥X 𝑘+1 − X∗ ∥ 2 + 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2
                 2𝛼𝜂2
               +       E∥D 𝑘+1 − D 𝑘 ∥ 2M + 𝛼2 E∥E 𝑘 ∥ 2 − 𝛼𝛾E∥E 𝑘 ∥ 2I−W − 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 .
                    𝛾
B.3.3    Proof of Lemma 10
Before proving Lemma 10, we let E 𝑘 = Ŷ 𝑘 − Y 𝑘 and introduce the following three Lemmas.
Lemma 12. Let X∗ be the consensus solution. Then, from Line 4-7 of Alg. 3, we obtain
                                                         
                    I − W 𝑘+1          ∗      𝐼 I−W                          I−W 𝑘
                           (X − X ) =           −           (D 𝑘+1 − D 𝑘 ) −      E .                (B.3)
                      2𝜂                      𝛾     2                         2𝜂
Proof. From the iterations in Alg. 3, we have
                        𝛾
         D 𝑘+1 = D 𝑘 +     (I − W) Ŷ 𝑘      (from Line 6)
                        2𝜂
                                                   106


                        𝛾
              = D𝑘 +      (I − W)(Y 𝑘 + E 𝑘 )
                       2𝜂
                        𝛾
              = D 𝑘 + (I − W)(X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘 + E 𝑘 )            (from Line 4)
                       2𝜂
                        𝛾
              = D 𝑘 + (I − W)(X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘+1 − X∗ + 𝜂(D 𝑘+1 − D 𝑘 ) + E 𝑘 )
                       2𝜂
                        𝛾                           𝛾                             𝛾
              = D 𝑘 + (I − W)(X 𝑘+1 − X∗ ) + (I − W)(D 𝑘+1 − D 𝑘 ) + (I − W)E 𝑘 ,
                       2𝜂                           2                            2𝜂
where the fourth equality holds due to (I − W)X∗ = 0 and the last equality comes from Line 7 of
Alg. 3. Rewriting this equality, and we obtain (B.3).
Lemma 13. Let D∗ = −∇F(X∗ ) ∈ span{I − W}, we have
                                           𝜂
            ⟨X 𝑘+1 − X∗ , D 𝑘+1 − D 𝑘 ⟩ = ∥D 𝑘+1 − D 𝑘 ∥ 2M − ⟨E 𝑘 , D 𝑘+1 − D 𝑘 ⟩,                (B.4)
                                           𝛾
                                           𝜂
            ⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩ = ⟨D 𝑘+1 − D 𝑘 , D 𝑘+1 − D∗ ⟩M − ⟨E 𝑘 , D 𝑘+1 − D∗ ⟩,       (B.5)
                                           𝛾
where M = 2(I − W) † − 𝛾I and 𝛾 < 2/𝜆 max (I − W) ensures the positive definiteness of M over
span{I − W}.
Proof. Since D 𝑘+1 ∈ span{I − W} for any 𝑘, we have
     ⟨X 𝑘+1 − X∗ , D 𝑘+1 − D 𝑘 ⟩
   =⟨(I − W)(X 𝑘+1 − X∗ ), (I − W) † (D 𝑘+1 − D 𝑘 )⟩
                                                                                  
        𝜂                      𝑘+1     𝑘                𝑘         †    𝑘+1      𝑘
   = (2I − 𝛾(I − W))(D − D ) − (I − W)E , (I − W) (D − D )                              (from (B.3))
        𝛾
                                                           
        𝜂           †            𝑘+1     𝑘     𝑘    𝑘+1    𝑘
   = (2(I − W) − 𝛾I (D − D ) − E , D − D
        𝛾
      𝜂
   = ∥D 𝑘+1 − D 𝑘 ∥ 2M − ⟨E 𝑘 , D 𝑘+1 − D 𝑘 ⟩.
      𝛾
Similarly, we have
               ⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩
             =⟨(I − W)(X 𝑘+1 − X∗ ), (I − W) † (D 𝑘+1 − D∗ )⟩
                                                                                          
                 𝜂                       𝑘+1     𝑘              𝑘            †    𝑘+1   ∗
             = (2I − 𝛾(I − W))(D − D ) − (I − W)E , (I − W) (D − D )
                 𝛾
                                                   107


                                                                     
                  𝜂
              = (2(I − W) † − I)(D 𝑘+1 − D 𝑘 ) − E 𝑘 , D 𝑘+1 − D∗
                  𝛾
               𝜂
              = ⟨D 𝑘+1 − D 𝑘 , D 𝑘+1 − D∗ ⟩M − ⟨E 𝑘 , D 𝑘+1 − D∗ ⟩.
               𝛾
To make sure that M is positive definite over span{I − W}, we need 𝛾 < 2/𝜆 max (I − W).
Lemma 14. Taking the expectation conditioned on the compression in the 𝑘th iteration, we have
                                                                                            
                    𝑘  𝑘+1      ∗            𝑘    𝑘    𝛾            𝑘    𝛾            𝑘    ∗
           2𝜂E⟨E , D − D ⟩ = 2𝜂E E , D + (I − W)Y + (I − W)E − D
                                                       2𝜂               2𝜂
                                  = 𝛾E⟨E 𝑘 , (I − W)E 𝑘 ⟩ = 𝛾E∥E 𝑘 ∥ 2I−W ,
                                                                                 
                    𝑘  𝑘+1      𝑘            𝑘 𝛾              𝑘    𝛾            𝑘
           2𝜂E⟨E , D − D ⟩ = 2𝜂E E , (I − W)Y + (I − W)E
                                                 2𝜂               2𝜂
                                  = 𝛾E⟨E 𝑘 , (I − W)E 𝑘 ⟩ = 𝛾E∥E 𝑘 ∥ 2I−W .
Proof. The proof is straightforward and omitted here.
Proof of Lemma 10. From Alg. 3, we have
 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩
=2⟨X 𝑘 − X∗ , 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂∇F(X∗ )⟩
=2⟨X 𝑘 − X∗ , X 𝑘 − X 𝑘+1 − 𝜂(D 𝑘+1 − D∗ )⟩        (from Line 7)
=2⟨X 𝑘 − X∗ , X 𝑘 − X 𝑘+1 ⟩ − 2𝜂⟨X 𝑘 − X∗ , D 𝑘+1 − D∗ ⟩
=2⟨X 𝑘 − X∗ , X 𝑘 − X 𝑘+1 ⟩ − 2𝜂⟨X 𝑘 − X 𝑘+1 , D 𝑘+1 − D∗ ⟩ − 2𝜂⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩
=2⟨X 𝑘 − X∗ − 𝜂(D 𝑘+1 − D∗ ), X 𝑘 − X 𝑘+1 ⟩ − 2𝜂⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩
=2⟨X 𝑘+1 − X∗ + 𝜂(∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )), X 𝑘 − X 𝑘+1 ⟩ − 2𝜂⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩ (from Line 7)
=2⟨X 𝑘+1 − X∗ , X 𝑘 − X 𝑘+1 ⟩ + 2𝜂⟨∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ ), X 𝑘 − X 𝑘+1 ⟩
  − 2𝜂⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩.                                                                (B.6)
Then we consider the terms on the right hand side of (B.6) separately. Using 2⟨A − B, B − C⟩ =
∥A − C∥ 2 − ∥B − C∥ 2 − ∥A − B∥ 2 , we have
            2⟨X 𝑘+1 − X∗ , X 𝑘 − X 𝑘+1 ⟩ =2⟨X∗ − X 𝑘+1 , X 𝑘+1 − X 𝑘 ⟩
                                                    108


                                              =∥X 𝑘 − X∗ ∥ 2 − ∥X 𝑘+1 − X 𝑘 ∥ 2 − ∥X 𝑘+1 − X∗ ∥ 2 . (B.7)
Using 2⟨A, B⟩ = ∥A∥ 2 + ∥B∥ 2 − ∥A − B∥ 2 , we have
    2𝜂⟨∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ ), X 𝑘 − X 𝑘+1 ⟩
   =𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + ∥X 𝑘 − X 𝑘+1 ∥ 2 − ∥X 𝑘 − X 𝑘+1 − 𝜂(∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ ))∥ 2
   =𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + ∥X 𝑘 − X 𝑘+1 ∥ 2 − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2 .        (from Line 7)  (B.8)
Combining (B.6), (B.7), (B.8), and (B.4), we obtain
           2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩
         = ∥X 𝑘 − X∗ ∥ 2 − ∥X 𝑘+1 − X 𝑘 ∥ 2 − ∥X 𝑘+1 − X∗ ∥ 2
           |                          {z                             }
                            2⟨X 𝑘+1 −X∗ ,X 𝑘 −X 𝑘+1 ⟩
            + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + ∥X 𝑘 − X 𝑘+1 ∥ 2 − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2
              |                                        {z                              }
                                      2𝜂⟨∇F(X 𝑘 ;𝜉 𝑘 )−∇F(X∗ ),X 𝑘 −X 𝑘+1 ⟩
               2𝜂2                                                              
            −        ⟨D 𝑘+1 − D 𝑘 , D 𝑘+1 − D∗ ⟩M − 2𝜂⟨E 𝑘 , D 𝑘+1 − D∗ ⟩
                 𝛾
              |                                {z                               }
                                     2𝜂⟨X 𝑘+1 −X∗ ,D 𝑘+1 −D∗ ⟩
         =∥X 𝑘 − X∗ ∥ 2 − ∥X 𝑘+1 − X 𝑘 ∥ 2 − ∥X 𝑘+1 − X∗ ∥ 2
            + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + ∥X 𝑘 − X 𝑘+1 ∥ 2 − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2
              𝜂2  𝑘                                                            
            +      ∥D − D ∥ M − ∥D − D ∥ M − ∥D − D ∥ M +2𝜂⟨E 𝑘 , D 𝑘+1 − D∗ ⟩,
                             ∗ 2            𝑘+1        ∗ 2          𝑘+1     𝑘 2
               𝛾
                  |                              {z                             }
                                     −2⟨D 𝑘+1 −D 𝑘 ,D 𝑘+1 −D∗ ⟩M
where the last equality holds because
          2⟨D 𝑘 − D 𝑘+1 , D 𝑘+1 − D∗ ⟩M =∥D 𝑘 − D∗ ∥ 2M − ∥D 𝑘+1 − D∗ ∥ 2M − ∥D 𝑘+1 − D 𝑘 ∥ 2M .
Thus, we reformulate it as
                       𝜂2 𝑘+1
   ∥X 𝑘+1 − X∗ ∥ 2 +     ∥D − D∗ ∥ 2M
                       𝛾
                    𝜂2                      𝜂2
  =∥X 𝑘 − X∗ ∥ 2 + ∥D 𝑘 − D∗ ∥ 2M − ∥D 𝑘+1 − D 𝑘 ∥ 2M − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2
                     𝛾                       𝛾
    − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + 2𝜂⟨E 𝑘 , D 𝑘+1 − D∗ ⟩,
which completes the proof.
                                                           109


B.3.4    Proof of Lemma 11
Proof of Lemma 11. From Alg. 3, we take the expectation conditioned on 𝑘th compression and
obtain
               E∥H 𝑘+1 − X∗ ∥ 2
            =E∥(1 − 𝛼)(H 𝑘 − X∗ ) + 𝛼(Y 𝑘 − X∗ ) + 𝛼E 𝑘 ∥ 2          (from Line 13)
            =∥(1 − 𝛼)(H 𝑘 − X∗ ) + 𝛼(Y 𝑘 − X∗ )∥ 2 + 𝛼2 E∥E 𝑘 ∥ 2
            =(1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝛼∥Y 𝑘 − X∗ ∥ 2 − 𝛼(1 − 𝛼)∥H 𝑘 − Y 𝑘 ∥ 2 + 𝛼2 E∥E 𝑘 ∥ 2 .          (B.9)
In the second equality, we used the unbiasedness of the compression, i.e., EE 𝑘 = 0. The last equality
holds because of
                 ∥(1 − 𝛼)A + 𝛼B∥ 2 = (1 − 𝛼)∥A∥ 2 + 𝛼∥B∥ 2 − 𝛼(1 − 𝛼)∥A − B∥ 2 .
    In addition, by taking the conditional expectation on the compression, we have
          ∥Y 𝑘 − X∗ ∥ 2 =∥X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘 − X∗ ∥ 2      (from Line 4)
                        =E∥X 𝑘+1 + 𝜂D 𝑘+1 − 𝜂D 𝑘 − X∗ ∥ 2        (from Line 7)
                        =E∥X 𝑘+1 − X∗ ∥ 2 + 𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 + 2𝜂E⟨X 𝑘+1 − X∗ , D 𝑘+1 − D 𝑘 ⟩
                        =E∥X 𝑘+1 − X∗ ∥ 2 + 𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2
                           2𝜂2
                         +      E∥D 𝑘+1 − D 𝑘 ∥ 2M − 2𝜂E⟨E 𝑘 , D 𝑘+1 − D 𝑘 ⟩.    (from (B.4))
                             𝛾
                        =E∥X 𝑘+1 − X∗ ∥ 2 + 𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2
                           2𝜂2
                         +      E∥D 𝑘+1 − D 𝑘 ∥ 2M − 𝛾E∥E 𝑘 ∥ 2I−W .     (from Line 6)              (B.10)
                             𝛾
Combing the above two equations (B.9) and (B.10) together, we have
        E∥H 𝑘+1 − X∗ ∥ 2
                                                                              2𝛼𝜂2
      ≤(1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝛼E∥X 𝑘+1 − X∗ ∥ 2 + 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 +          E∥D 𝑘+1 − D 𝑘 ∥ 2M
                                                                               𝛾
        − 𝛼𝛾E∥E 𝑘 ∥ 2I−W + 𝛼2 E∥E 𝑘 ∥ 2 − 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 ,                                  (B.11)
which completes the proof.
                                                   110


B.3.5   Proof of Theorem 3
Proof of Theorem 3. Combining Lemmas 10, 11, and 14, we have the expectation conditioned on
the compression satisfying
                             𝜂2
      E∥X 𝑘+1 − X∗ ∥ 2 +        E∥D 𝑘+1 − D∗ ∥ 2M + 𝑎 1 E∥H 𝑘+1 − X∗ ∥ 2
                             𝛾
                        𝜂2                   𝜂2
     ≤∥X 𝑘 − X∗ ∥ 2 + ∥D 𝑘 − D∗ ∥ 2M − E∥D 𝑘+1 − D 𝑘 ∥ 2M − 𝜂2 E∥D 𝑘+1 − D∗ ∥ 2
                         𝛾                    𝛾
       − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + 𝛾E∥E 𝑘 ∥ 2I−W
       + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝑎 1 𝛼E∥X 𝑘+1 − X∗ ∥ 2 + 𝑎 1 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2
         2𝑎 1 𝛼𝜂2
       +          E∥D 𝑘+1 − D 𝑘 ∥ 2M + 𝑎 1 𝛼2 E∥E 𝑘 ∥ 2 − 𝑎 1 𝛼𝛾E∥E 𝑘 ∥ 2I−W − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2
              𝛾
     = ∥X 𝑘 − X∗ ∥ 2 − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2
       |                                              {z                                         }
                                                      A
                                   𝜂2
       + 𝑎 1 𝛼E∥X 𝑘+1 − X∗ ∥ 2 +       ∥D 𝑘 − D∗ ∥ 2M − 𝜂2 E∥D 𝑘+1 − D∗ ∥ 2
                                    𝛾
                                                   𝜂2
       + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 −(1 − 2𝑎 1 𝛼) E∥D 𝑘+1 − D 𝑘 ∥ 2M + 𝑎 1 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2
                                                    𝛾
                                    |                             {z                            }
                                                                   B
       + 𝑎 1 𝛼2 E∥E 𝑘 ∥ 2 + (1 − 𝑎 1 𝛼)𝛾E∥E 𝑘 ∥ 2I−W − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 ,                   (B.12)
         |                                   {z                                  }
                                              C
where 𝑎 1 is a non-negative number to be determined. Then we deal with the three terms on the right
hand side separately. We want the terms B and C to be nonpositive. First, we consider B. Note that
D 𝑘 ∈ Range(I − W). If we want B ≤ 0, then, we need 1 − 2𝑎 1 𝛼 > 0, i.e., 𝑎 1 𝛼 < 1/2. Therefore
we have
                                          𝜂2
                    B = − (1 − 2𝑎 1 𝛼) E∥D 𝑘+1 − D 𝑘 ∥ 2M + 𝑎 1 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2
                                           𝛾
                                                          
                                    (1 − 2𝑎 1 𝛼)𝜆 𝑛−1 (M) 2
                        ≤ 𝑎1 𝛼 −                             𝜂 E∥D 𝑘+1 − D 𝑘 ∥ 2 ,
                                              𝛾
where 𝜆 𝑛−1 (M) > 0 is the second smallest eigenvalue of M. It means that we also need
                                              (2𝑎 1 𝛼 − 1)𝜆 𝑛−1 (M)
                                      𝑎1 𝛼 +                         ≤ 0,
                                                         𝛾
                                                       111


which is equivalent to
                                                      𝜆 𝑛−1 (M)
                                        𝑎1 𝛼 ≤                      < 1/2.                            (B.13)
                                                   𝛾 + 2𝜆 𝑛−1 (M)
Then we look at C. We have
                C =𝑎 1 𝛼2 E∥E 𝑘 ∥ 2 + (1 − 𝑎 1 𝛼)𝛾E∥E 𝑘 ∥ 2I−W − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2
                   ≤((1 − 𝑎 1 𝛼) 𝛽𝛾 + 𝑎 1 𝛼2 )E∥E 𝑘 ∥ 2 − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2
                   ≤𝐶 ((1 − 𝑎 1 𝛼) 𝛽𝛾 + 𝑎 1 𝛼2 )∥Y 𝑘 − H 𝑘 ∥ 2 − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2
Because we have 1 − 𝑎 1 𝛼 > 1/2, so we need
    𝐶 ((1 − 𝑎 1 𝛼) 𝛽𝛾 + 𝑎 1 𝛼2 ) − 𝑎 1 𝛼(1 − 𝛼) = (1 + 𝐶)𝑎 1 𝛼2 − 𝑎 1 (𝐶 𝛽𝛾 + 1)𝛼 + 𝐶 𝛽𝛾 ≤ 0.         (B.14)
That is
                                             √︃
                         𝑎 1 (𝐶 𝛽𝛾 + 1) −       𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾
                    𝛼≥                                                              C 𝛼0 ,            (B.15)
                                                    2(1 + 𝐶)𝑎 1
                                             √︃
                         𝑎 1 (𝐶 𝛽𝛾 + 1) +       𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾
                    𝛼≤                                                              C 𝛼1 .            (B.16)
                                                    2(1 + 𝐶)𝑎 1
    Next, we look at A. Firstly, by the bounded variance assumption, we have the expectation
conditioned on the gradient sampling in 𝑘th iteration satisfying
       E∥X 𝑘 − X∗ ∥ 2 − 2𝜂E⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 E∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2
     ≤∥X 𝑘 − X∗ ∥ 2 − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2 + 𝑛𝜂2 𝜎 2
    Then with the smoothness and strong convexity from Assumptions 8, we have the co-coercivity
of ∇𝑔𝑖 (x) with 𝑔𝑖 (x) := 𝑓𝑖 (x) − 𝑢2 ∥x∥ 22 , which gives
                                                    𝜇𝐿                     1
        ⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩ ≥                ∥X 𝑘 − X∗ ∥ 2 +     ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2 .
                                                 𝜇+𝐿                     𝜇+𝐿
When 𝜂 ≤ 2/(𝜇 + 𝐿), we have
     ⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩
                                                         112


                       
             𝜂(𝜇 + 𝐿)                                           𝜂(𝜇 + 𝐿) 𝑘
     = 1−                 ⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩ +               ⟨X − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩
                  2                                                  2
                                   
              𝜂𝜇(𝜇 + 𝐿) 𝜂𝜇𝐿                           𝜂
    ≥ 𝜇−                   +          ∥X 𝑘 − X∗ ∥ 2 + ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2
                   2           2                      2
              𝜂𝜇                     𝜂
     =𝜇 1 −          ∥X 𝑘 − X∗ ∥ 2 + ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2 .
                2                      2
Therefore, we obtain
                           − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩
                        ≤ − 𝜂2 ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2 − 𝜇(2𝜂 − 𝜇𝜂2 )∥X 𝑘 − X∗ ∥ 2 .                     (B.17)
     Conditioned on the 𝑘the iteration, (i.e., conditioned on the gradient sampling in 𝑘th iteration),
the inequality (B.12) becomes
                                  𝜂2
             E∥X 𝑘+1 − X∗ ∥ 2 +       E∥D 𝑘+1 − D∗ ∥ 2M + 𝑎 1 E∥H 𝑘+1 − X∗ ∥ 2
                                   𝛾
                                 
            ≤ 1 − 𝜇(2𝜂 − 𝜇𝜂2 ) ∥X 𝑘 − X∗ ∥ 2 + 𝑎 1 𝛼E∥X 𝑘+1 − X∗ ∥ 2
                𝜂2 𝑘
             + ∥D − D∗ ∥ 2M − 𝜂2 E∥D 𝑘+1 − D∗ ∥ 2 + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝑛𝜂2 𝜎 2 ,                (B.18)
                𝛾
                                    2
if the step size satisfies 𝜂 ≤   𝜇+𝐿 .   Rewriting (B.18), we have
                                        𝜂2
      (1 − 𝑎 1 𝛼)E∥X 𝑘+1 − X∗ ∥ 2 +        E∥D 𝑘+1 − D∗ ∥ 2M + 𝜂2 E∥D 𝑘+1 − D∗ ∥ 2 + 𝑎 1 E∥H 𝑘+1 − X∗ ∥ 2
                                         𝛾
                                             𝜂2
    ≤ 1 − 𝜇(2𝜂 − 𝜇𝜂2 ) ∥X 𝑘 − X∗ ∥ 2 + ∥D 𝑘 − D∗ ∥ 2M + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝑛𝜂2 𝜎 2 , (B.19)
                                               𝛾
and thus
                                        𝜂2
      (1 − 𝑎 1 𝛼)E∥X − X ∥ + E∥D 𝑘+1 − D∗ ∥ 2M+𝛾I + 𝑎 1 E∥H 𝑘+1 − X∗ ∥ 2
                       𝑘+1     ∗ 2
                                         𝛾
                                             𝜂2
    ≤ 1 − 𝜇(2𝜂 − 𝜇𝜂2 ) ∥X 𝑘 − X∗ ∥ 2 + ∥D 𝑘 − D∗ ∥ 2M + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝑛𝜂2 𝜎 2 .            (B.20)
                                               𝛾
     With the definition of L 𝑘 in (3.17), we have
                                             EL 𝑘+1 ≤ 𝜌L 𝑘 + 𝑛𝜂2 𝜎 2 ,                                  (B.21)
with
                                          1 − 𝜇(2𝜂 − 𝜇𝜂2 ) 𝜆 max (M)
                                                                                  
                            𝜌 = max                         ,               ,1−𝛼 .
                                              1 − 𝑎1 𝛼        𝛾 + 𝜆 max (M)
                                                        113


where
                                        𝜆 max (M) = 2𝜆 max ((I − W) † ) − 𝛾.
    Recall all the conditions on the parameters 𝑎 1 , 𝛼, and 𝛾 to make sure that 𝜌 < 1:
                               𝜆 𝑛−1 (M)
                  𝑎1 𝛼 ≤                        ,                                                    (B.22)
                           𝛾 + 2𝜆 𝑛−1 (M)
                  𝑎 1 𝛼 ≤ 𝜇(2𝜂 − 𝜇𝜂2 ),                                                              (B.23)
                                                  √︃
                           𝑎 1 (𝐶 𝛽𝛾 + 1) −          𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾
                      𝛼≥                                                                   C 𝛼0 ,    (B.24)
                                                         2(1 + 𝐶)𝑎 1
                                                  √︃
                           𝑎 1 (𝐶 𝛽𝛾 + 1) +          𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾
                      𝛼≤                                                                  C 𝛼1 .     (B.25)
                                                         2(1 + 𝐶)𝑎 1
In the following, we show that there exist parameters that satisfy these conditions.
    Since we can choose any 𝑎 1 , we let
                                                             4(1 + 𝐶)
                                                     𝑎1 =              ,
                                                             𝐶 𝛽𝛾 + 2
such that
                                    𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾 = 𝑎 21 .
Then we have
                                               𝐶 𝛽𝛾
                                     𝛼0 =                  → 0,          as 𝛾 → 0,
                                            2(1 + 𝐶)
                                            𝐶 𝛽𝛾 + 2             1
                                     𝛼1 =                  →         , as 𝛾 → 0.
                                            2(1 + 𝐶)           1+𝐶
    Conditions (B.24) and (B.25) show
                                                     
                                          2𝐶 𝛽𝛾
                          𝑎1 𝛼 ∈                   , 2 → [0, 2], if 𝐶 = 0 or 𝛾 → 0.
                                        𝐶 𝛽𝛾 + 2
Hence in order to make (B.22) and (B.23) satisfied, it’s sufficient to make
                                                                             (2                   )
         2𝐶 𝛽𝛾
                         
                               𝜆 𝑛−1 (M)
                                                                   
                                                                              𝛽 −𝛾
                 ≤ min                          , 𝜇(2𝜂 − 𝜇𝜂2 ) = min 4              , 𝜇(2𝜂 − 𝜇𝜂2 ) . (B.26)
       𝐶 𝛽𝛾 + 2            𝛾 + 2𝜆 𝑛−1 (M)                                     𝛽 − 𝛾
                                    2                 2
where we use 𝜆 𝑛−1 (M) =      𝜆 max (I−W)   −𝛾 =      𝛽   − 𝛾.
                                                             114


    When 𝐶 > 0, the condition (B.26) is equivalent to
                             (                √︁                                               )
                                (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶                  2𝜇𝜂(2 − 𝜇𝜂)
                  𝛾 ≤ min                                         ,                                .        (B.27)
                                               𝐶𝛽                    [2 − 𝜇𝜂(2 − 𝜇𝜂)]𝐶 𝛽
    The first term can be simplified using
                                              √︁
                                (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶                    2
                                                                    ≥
                                                𝐶𝛽                       (3𝐶 + 1) 𝛽
       √
due to 1 − 𝑥 ≤ 1 − 2𝑥 when 𝑥 ∈ (0, 1).
    Therefore, for a given stepsize 𝜂, if we choose
                                                                                       
                                           n     2             2𝜇𝜂(2 − 𝜇𝜂)            o
                            𝛾 ∈ 0, min                  ,
                                             (3𝐶 + 1) 𝛽 [2 − 𝜇𝜂(2 − 𝜇𝜂)]𝐶 𝛽
and
                                     n 𝐶 𝛽𝛾 + 2 2 − 𝛽𝛾 𝐶 𝛽𝛾 + 2                                     
                      𝐶 𝛽𝛾                                                                𝐶 𝛽𝛾 + 2 o
              𝛼∈               , min              ,                       , 𝜇𝜂(2 − 𝜇𝜂)                 ,
                    2(1 + 𝐶)            2(1 + 𝐶) 4 − 𝛽𝛾 4(1 + 𝐶)                          4(1 + 𝐶)
then, all conditions (B.22)-(B.25) hold.
                          2                      2
    Note that 𝛾 <     (3𝐶+1) 𝛽   implies 𝛾 <     𝛽,  which ensures the positive definiteness of M over
span{I − W} in Lemma 13.
                     2
    Note that 𝜂 ≤   𝜇+𝐿  ensures
                                                    𝐶 𝛽𝛾 + 2      𝐶 𝛽𝛾 + 2
                                     𝜇𝜂(2 − 𝜇𝜂)               ≤                .                            (B.28)
                                                    4(1 + 𝐶)      2(1 + 𝐶)
So, we can simplify the bound for 𝛼 as
                                           n 2 − 𝛽𝛾 𝐶 𝛽𝛾 + 2                                  
                            𝐶 𝛽𝛾                                                   𝐶 𝛽𝛾 + 2 o
                   𝛼∈                , min                      , 𝜇𝜂(2 − 𝜇𝜂)                     .
                          2(1 + 𝐶)            4 − 𝛽𝛾 4(1 + 𝐶)                      4(1 + 𝐶)
    Lastly, taking the total expectation on both sides of (B.21) and using tower property, we complete
the proof for 𝐶 > 0.
                                                                  𝜆 max (I−W)
Proof of Corollary 4. Let’s first define 𝜅 𝑓 =       𝐿
                                                     𝜇 and 𝜅 𝑔 =   𝜆+min (I−W)   = 𝜆 max (I − W)𝜆 max ((I − W) † ).
                                                       115


                                                  1
   We can choose the stepsize 𝜂 =                 𝐿 such that the upper bound of 𝛾 is
                                                                     
                                                      2            1
                        n         2                  𝜅𝑓    2  −   𝜅𝑓             2o
                                                                                              
                                                                                                     2        1
                                                                                                                   
       𝛾upper  = min                         ,h                      i      ,     ≥ min                  ,         ,
                          (3𝐶 + 1) 𝛽 2 − 1 2 − 1 𝐶 𝛽 𝛽                                          (3𝐶 + 1) 𝛽 𝜅 𝑓 𝐶 𝛽
                                                     𝜅𝑓            𝜅𝑓
        𝑥(2−𝑥)        𝑥
due to 2−𝑥(2−𝑥)  ≥   2−𝑥   ≥ 𝑥 when 𝑥 ∈ (0, 1).
                                                 1          1
   Hence we can take 𝛾 = min{ (3𝐶+1)                 𝛽 , 𝜅 𝑓 𝐶 𝛽 }.
   The bound of 𝛼 is
                                                                                                        
                                 𝐶 𝛽𝛾                 2 − 𝛽𝛾 𝐶 𝛽𝛾 + 2 1                      1 𝐶 𝛽𝛾 + 2
                    𝛼∈                       , min                              , (2 − )
                             2(1 + 𝐶)                 4 − 𝛽𝛾 4(1 + 𝐶) 𝜅 𝑓                   𝜅 𝑓 4(1 + 𝐶)
                                    1
   When 𝛾 is chosen as           𝜅 𝑓 𝐶𝛽 ,  pick
                                                         𝐶 𝛽𝛾                    1
                                                𝛼=                    =                  .                             (B.29)
                                                      2(1 + 𝐶) 2(1 + 𝐶)𝜅 𝑓
               1            1
   When    (3𝐶+1) 𝛽  ≤   𝜅 𝑓 𝐶𝛽 ,   the upper bound of 𝛼 is
                                                                                                      
                                                  2 − 𝛽𝛾 𝐶 𝛽𝛾 + 2 1                     1 𝐶 𝛽𝛾 + 2
                          𝛼upper      = min                                , (2 − )
                                                  4 − 𝛽𝛾 4(1 + 𝐶) 𝜅 𝑓                 𝜅 𝑓 4(1 + 𝐶)
                                                                                 
                                                   6𝐶 + 1 1                  1             7𝐶 + 2
                                      = min                   , (2 − )
                                                  12𝐶 + 3 𝜅 𝑓               𝜅 𝑓 4(𝐶 + 1)(3𝐶 + 1)
                                                                     
                                                   6𝐶 + 1 1                     7𝐶 + 2
                                      ≥ min                    ,                                .
                                                  12𝐶 + 3 𝜅 𝑓 4(𝐶 + 1)(3𝐶 + 1)
   In this case, we pick
                                                                       
                                                     6𝐶 + 1 1                     7𝐶 + 2
                                      𝛼 = min                    ,                                .                    (B.30)
                                                    12𝐶 + 3 𝜅 𝑓           4(𝐶 + 1)(3𝐶 + 1)
                              
                        1                    6𝐶+1
   Note 𝛼 = O       (1+𝐶)𝜅 𝑓      since 12𝐶+3       is lower bounded by 13 . Hence in both cases (Eq. (B.29) and
                                     
                              1
Eq. (B.30)), 𝛼 = O (1+𝐶)𝜅           𝑓
                                        , and the third term of 𝜌 is upper bounded by
                                                                                                               
                                                 1                           6𝐶 + 1 1                7𝐶 + 2
            1 − 𝛼 ≤ max 1 −                                , 1 − min                  ,
                                          2(1 + 𝐶)𝜅 𝑓                       12𝐶 + 3 𝜅 𝑓 4(1 + 𝐶)(3𝐶 + 1)
   In two cases of 𝛾, the second term of 𝜌 becomes
                                                                                                         
                                           𝛾                                     1                  1
                      1−                                 = max 1 −                    ,1−
                             2𝜆 max ((I − W) † )                            2𝐶𝜅 𝑓 𝜅 𝑔         (1 + 3𝐶)𝜅 𝑔
                                                                   116


    Before analysing the first term of 𝜌, we look at 𝑎 1 𝛼 in two cases of 𝛾.
                  1
    When 𝛾 =   𝜅 𝑓 𝐶𝛽 ,    we have
                                                    2𝐶 𝛽𝛾             2          1
                                           𝑎1 𝛼 =              =              ≤     .
                                                   𝐶 𝛽𝛾 + 2 2𝜅 𝑓 + 1 𝜅 𝑓
                     1
    When 𝛾 =   (3𝐶+1) 𝛽 ,     we have
                                                                           
                                                            6𝐶 + 1       1        1
                                          𝑎 1 𝛼 = min                 ,        ≤      .
                                                           (12𝐶 + 3) 𝜅 𝑓          𝜅𝑓
                                    1
    In both cases, 𝑎 1 𝛼 ≤        𝜅𝑓  . Therefore, the first term of 𝜌 becomes
                         1 − 𝜇𝜂(2 − 𝜇𝜂)            1 − 𝜅1𝑓 (2 −   1
                                                                 𝜅𝑓 )          1−     1
                                                                                     𝜅𝑓        1
                                                ≤                      =1−              =1−       .
                               1 − 𝑎1 𝛼                  1 − 𝜅1𝑓               𝜅𝑓 − 1         𝜅𝑓
    To summarize, we have
                                                                                                          
                     1           1             1               1                 6𝐶 + 1 1            7𝐶 + 2
   𝜌 ≤ 1 − min           ,              ,             ,                , min             ,
                    𝜅 𝑓 2𝐶𝜅 𝑓 𝜅 𝑔 (1 + 3𝐶)𝜅 𝑔 2(1 + 𝐶)𝜅 𝑓                       12𝐶 + 3 𝜅 𝑓 4(1 + 𝐶)(3𝐶 + 1)
and therefore
                                                                                           1 
                                                1                      1      
                𝜌 = max 1 − O                              ,1− O                   ,1− O             .
                                            (1 + 𝐶)𝜅 𝑓              (1 + 𝐶)𝜅 𝑔              𝐶𝜅 𝑓 𝜅 𝑔
    With full-gradient (i.e., 𝜎 = 0), we get 𝜖−accuracy solution with the total number of iterations
                                          𝑘≥O   e((1 + 𝐶)(𝜅 𝑓 + 𝜅 𝑔 ) + 𝐶𝜅 𝑓 𝜅 𝑔 ).
    When 𝐶 = 0, i.e., there is no compression, the iteration complexity recovers that of NIDS,
           
O
e 𝜅 𝑓 + 𝜅𝑔 .
                     𝜅 𝑓 +𝜅 𝑔
    When 𝐶 ≤    𝜅 𝑓 𝜅 𝑔 +𝜅 𝑓 +𝜅 𝑔 ,  the complexity is improved to that of NIDS, i.e., the compression doesn’t
harm the convergence in terms of the order of the coefficients.
                                                              117


Proof of Corollary 5. Note that (x 𝑘 ) ⊤ = X 𝑘 and 1𝑛×1 X∗ = X∗ , then
                            𝑛
                           ∑︁                                           2
                                E∥x𝑖𝑘 − x 𝑘 ∥ 2 = E X 𝑘 − 1𝑛×1 X 𝑘
                           𝑖=1
                                                                                     2
                                                = E X 𝑘 − X∗ + X∗ − 1𝑛×1 X 𝑘
                                                         𝑘      ∗
                                                                     1𝑛×1 1⊤𝑛×1
                                                                                
                                                                                   𝑘     ∗
                                                                                           
                                                =E X −X −                         X −X
                                                                          𝑛
                                                ≤ E∥X 𝑘 − X∗ ∥ 2
                                                   𝜌EL 𝑘−1 + 𝑛𝜂2 𝜎 2 (1 − 𝜌) −1
                                                ≤
                                                                1 − 𝑎1 𝛼
                                                                 𝑛𝜂2 𝜎 2
                                                ≤ 2𝜌 𝑘 L 0 + 2           .                              (B.31)
                                                                   1−𝜌
The last inequality holds because we have 𝑎 1 𝛼 ≤ 1/2.
Proof of Corollary 3. From the proof of Theorem 3, when 𝐶 = 0, we can set 𝛾 = 1, 𝛼 = 1, and
𝑎 1 = 0. Plug those values into 𝜌, and we obtain the convergence rate for NIDS.
B.3.6    Proof of Theorem 4
                                                                                                          𝐶 𝛽𝛾
Proof of Theorem 4. In order to get exact convergence, we pick diminishing step-size, set 𝛼 =           2(1+𝐶) ,
         2𝐶 𝛽𝛾 𝑘                1                      𝐶𝛽
𝑎1 𝛼 =  𝐶 𝛽𝛾 𝑘 +2 , 𝜃1 = 2𝜆max ((I−W) † )
                                          and 𝜃 2 =  2(1+𝐶) , then
                                                                                             
                                             𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ) − 𝑎 1 𝛼
                         𝜌 𝑘 = max 1 −                                , 1 − 𝜃1 𝛾𝑘 , 1 − 𝜃2 𝛾𝑘
                                                    1 − 𝑎1 𝛼
    If we further pick diminishing 𝜂 𝑘 and 𝛾 𝑘 such that 𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ) − 𝑎 1 𝛼 ≥ 𝑎 1 𝛼, then
                           𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ) − 𝑎 1 𝛼         𝑎1 𝛼        2𝐶 𝛽𝛾 𝑘
                                                      ≥             =               ≥ 𝐶 𝛽𝛾 𝑘 .
                                   1 − 𝑎1 𝛼              1 − 𝑎 1 𝛼 2 − 𝐶 𝛽𝛾 𝑘
                                                   √︁
    Notice that 𝐶 𝛽𝛾 ≤ 23 since (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶 is increasing in 𝐶 > 0 with limit            2
                                                                                                        3 at ∞.
    In this case we only need,
                                           √︁                                                       !
                          n (3𝐶 + 1) −        (3𝐶 + 1) 2 − 4𝐶          2𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 )         2o
          𝛾 𝑘 ∈ 0, min                                           ,                             ,      . (B.32)
                                            𝐶𝛽                     [4 − 𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 )]𝐶 𝛽 𝛽
    And
                           𝜌 𝑘 ≤ max {1 − 𝐶 𝛽𝛾 𝑘 , 1 − 𝜃 1 𝛾 𝑘 , 1 − 𝜃 2 𝛾 𝑘 } ≤ 1 − 𝜃 3 𝛾 𝑘
                                                           118


if 𝜃 3 = min{𝜃 1 , 𝜃 2 } and note that 𝜃 2 ≤ 𝐶 𝛽.
     We define
          L 𝑘 B (1 − 𝑎 1 𝛼𝑘 )∥X 𝑘 − X∗ ∥ 2 + (2𝜂2𝑘 /𝛾 𝑘 )E∥D 𝑘+1 − D∗ ∥ 2(I−W) † + 𝑎 1 ∥H 𝑘 − X∗ ∥ 2 .
     Hence
                                          EL 𝑘+1 ≤ (1 − 𝜃 3 𝛾 𝑘 )EL 𝑘 + 𝑛𝜎 2 𝜂2𝑘 .
                     𝜇𝜂 𝑘 (2−𝜇𝜂 𝑘 )
     From 𝑎 1 𝛼 ≤           2       , we get
                                                 4𝐶 𝛽𝛾 𝑘
                                                              ≤ 𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ).
                                               𝐶 𝛽𝛾 𝑘 + 2
     If we pick 𝛾 𝑘 = 𝜃 4 𝜂 𝑘 , then it’s sufficient to let
                                                2𝐶 𝛽𝜃 4 𝜂 𝑘 ≤ 𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ).
                  𝜇                      2(𝜇−𝐶 𝛽𝜃 4 )                𝛾𝑘
Hence if 𝜃 4 <   𝐶𝛽   and let 𝜂∗ =           𝜇2
                                                      , then 𝜂 𝑘 =   𝜃4  ∈ (0, 𝜂∗ ) guarantees the above discussion and
                                         EL 𝑘+1 ≤ (1 − 𝜃 3 𝜃 4 𝜂 𝑘 )EL 𝑘 + 𝑛𝜎 2 𝜂2𝑘 .
     So far all restrictions for 𝜂 𝑘 are
                                                                              
                                                                      2
                                                   𝜂 𝑘 ≤ min              , 𝜂∗
                                                                   𝜇+𝐿
and                                                (                 √︁                           )
                                          1            (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶               2
                               𝜂𝑘 ≤         min                                               ,
                                         𝜃4                            𝐶𝛽                       𝛽
                                             √                                                        n              o
                           2          (3𝐶+1)− (3𝐶+1) 2 −4𝐶 2                        1                                2
     Let 𝜃 5 = min 𝜇+𝐿 , 𝜂∗ ,                 𝐶 𝛽𝜃 4           , 𝛽𝜃 4 , 𝜂 𝑘 = 𝐵𝑘+𝐴        and   𝐷 = max   𝐴L 0 , 2𝑛𝜎
                                                                                                                 𝜃3 𝜃4   , we
                                  𝜃3 𝜃4                                               2
claim that if we pick 𝐵 =           2    and some 𝐴, by setting 𝜂 𝑘 =          𝜃 3 𝜃 4 𝑘+2𝐴 , we get
                                                                      𝐷
                                                         EL 𝑘 ≤             .
                                                                  𝐵𝑘 + 𝐴
     Induction:
When 𝑘 = 0, it’s obvious. Suppose previous 𝑘 inequalities hold. Then
                                                                                            4𝑛𝜎 2
                                                              
                              𝑘+1                 2𝜃 3 𝜃 4              2𝐷
                         EL         ≤ 1−                                          +                  .
                                              𝜃 3 𝜃 4 𝑘 + 2𝐴 𝜃 3 𝜃 4 𝑘 + 2𝐴 (𝜃 3 𝜃 4 𝑘 + 2𝐴) 2
                                                                119


  Multiply 𝑀 B (𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴)(𝜃 3 𝜃 4 𝑘 + 2𝐴)(2𝐷) −1 on both sides, we get
                                                                          4𝑛𝜎 2 (𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴)
                                
    𝑘+1               2𝜃 3 𝜃 4
𝑀EL     ≤ 1−                        (𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴) +
                  𝜃 3 𝜃 4 𝑘 + 2𝐴                                                2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴)
          2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴 − 2𝜃 3 𝜃 4 )(𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴) + 4𝑛𝜎 2 (𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴)
        =
                                                      2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴)
          2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴) 2 + 4𝑛𝜎 2 (𝜃 3 𝜃 4 𝑘 + 2𝐴) − 4𝐷𝜃 3 𝜃 4 (𝜃 3 𝜃 4 𝑘 + 2𝐴) + 2𝐷𝜃 3 𝜃 4 (𝜃 3 𝜃 4 𝑘 + 2𝐴)
        =
                                                              2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴)
                               2        2
             −4𝐷 (𝜃 3 𝜃 4 ) + 4𝑛𝜎 𝜃 3 𝜃 4
          +
                   2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴)
        ≤𝜃 3 𝜃 4 𝑘 + 2𝐴.
  Hence
                                                                        2𝐷
                                           EL 𝑘+1 ≤
                                                            𝜃 3 𝜃 4 (𝑘 + 1) + 2𝐴
  This induction holds for any 𝐴 such that 𝜂 𝑘 is feasible, i.e.
                                                                1
                                                        𝜂0 =        ≤ 𝜃5.
                                                                𝐴
  Here we summarize the definition of constant numbers:
                                      1                                 𝐶𝛽
                     𝜃1 =                            †
                                                        , 𝜃2 =                 ,                           (B.33)
                              2𝜆max ((I − W) )                     2(1 + 𝐶)
                                                                    
                                                                𝜇                2(𝜇 − 𝐶 𝛽𝜃 4 )
                     𝜃 3 = min{𝜃 1 , 𝜃 2 }, 𝜃 4 ∈ 0,                    , 𝜂∗ =                     ,       (B.34)
                                                              𝐶𝛽                        𝜇2
                                 (                                       √︁                            )
                                       2               (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶 2
                     𝜃 5 = min               , 𝜂∗ ,                                            ,         . (B.35)
                                    𝜇+𝐿                                  𝐶 𝛽𝜃 4                   𝛽𝜃 4
                           1                    2𝜃 5
  Therefore, let 𝐴 =       𝜃5 and 𝜂 𝑘 =   𝜃 3 𝜃 4 𝜃 5 𝑘+2 , we get
                                                                    n               o
                                                                       1 0 2𝜎 2 𝜃 5
                                         1                2 max        𝑛 L , 𝜃3 𝜃4
                                           EL 𝑘 ≤                                     .
                                         𝑛                      𝜃3𝜃4𝜃5 𝑘 + 2
  Since 1 − 𝑎 1 𝛼 𝑘 ≥ 1/2, we complete the proof.
                                                               120


                                                                                       APPENDIX C
                                GRAPH NEURAL NETWORKS WITH ADAPTIVE RESIDUAL
C.1                 Additional Results for the Preliminary Study
In this section, we provide additional results on CiteSeer and PubMed datasets for the preliminary
study in Section 4.2. The results on these two datasets are showed in Figure C.1, C.2, C.3 and C.4.
It can be observed that residual connection helps obtain better performance on normal features but it
is detrimental to abnormal features, which aligns with the findings in Section 4.2.
            0.35                                                               0.25                                                                0.40
                                                APPNP w/Res                                                         GCNII w/Res                                                          GCN w/Res
            0.30
                                                APPNP wo/Res                                                        GCNII wo/Res                   0.35                                  GCN wo/Res
                                                                               0.20                                                                0.30
            0.25
 Accuracy                                                           Accuracy                                                            Accuracy
                                                                                                                                                   0.25
            0.20
                                                                               0.15                                                                0.20
            0.15                                                                                                                                   0.15
            0.100   2      4     6     8   10     12   14      16              0.100   2      4      6    8    10     12   14      16              0.100   2      4     6    8    10      12   14     16
                               Number of layers                                                    Number of layers                                                   Number of layers
                           (a) APPNP                                                              (b) GCNII                                                           (c) GCN
                               Figure C.1: Node classification accuracy on abnormal nodes (CiteSeer).
            0.65                                                               0.7                                                                 0.7
                                                                               0.6                                                                 0.6
                                                                               0.5                                                                 0.5
 Accuracy                                                           Accuracy                                                            Accuracy
            0.60                                                               0.4                                                                 0.4
                                                                               0.3                                                                 0.3
                        APPNP w/Res                                            0.2         GCNII w/Res                                             0.2         GCN w/Res
                        APPNP wo/Res                                                       GCNII wo/Res                                                        GCN wo/Res
            0.550   2      4     6     8   10     12   14      16              0.10    2      4      6    8    10     12   14      16              0.10    2      4     6    8    10     12    14     16
                               Number of layers                                                    Number of layers                                                   Number of layers
                           (a) APPNP                                                              (b) GCNII                                                           (c) GCN
                                Figure C.2: Node classification accuracy on normal nodes (CiteSeer).
                                                                                                     121


            0.80                                                                  0.80                                                               0.60
            0.75                                   APPNP w/Res                    0.75                                GCNII w/Res                                                          GCN w/Res
            0.70                                   APPNP wo/Res                   0.70                                GCNII wo/Res                   0.55                                  GCN wo/Res
            0.65                                                                  0.65                                                               0.50
            0.60                                                                  0.60
 Accuracy                                                              Accuracy                                                           Accuracy
            0.55                                                                  0.55                                                               0.45
            0.50                                                                  0.50                                                               0.40
            0.45                                                                  0.45
            0.40                                                                  0.40                                                               0.35
            0.35                                                                  0.35
                                                                                                                                                     0.30
            0.30                                                                  0.30
            0.250      2      4     6     8   10     12   14      16              0.250   2      4     6     8   10     12   14      16              0.250   2      4     6    8    10      12   14     16
                                  Number of layers                                                   Number of layers                                                   Number of layers
                              (a) APPNP                                                          (b) GCNII                                                              (c) GCN
                                  Figure C.3: Node classification accuracy on abnormal nodes (PubMed).
                                                                                  0.80                                                               0.8
            0.80
                                                                                                                                                     0.7
                                                                                  0.75
                                                                                                                                                     0.6
 Accuracy                                                              Accuracy                                                           Accuracy
                                                                                  0.70
                                                                                                                                                     0.5
            0.75
                                                                                  0.65
                                                                                                                                                     0.4
                           APPNP w/Res                                                        GCNII w/Res                                                        GCN w/Res
                           APPNP wo/Res                                                       GCNII wo/Res                                                       GCN wo/Res
                   0   2      4     6     8   10     12   14      16              0.600   2      4     6     8   10     12   14      16              0.30    2      4     6    8    10     12    14     16
                                  Number of layers                                                   Number of layers                                                   Number of layers
                              (a) APPNP                                                          (b) GCNII                                                              (c) GCN
                                   Figure C.4: Node classification accuracy on normal nodes (PubMed).
C.2                    Additional Experiments for the Proposed Method
In this section, we provide more experiments and ablation study for the proposed AirGNN.
C.2.1                  Experiments on More Datasets
In this subsection, we provide additional experiments for Section 4.4. In particular, we conduct the
experiments for the noisy feature scenario on the following 5 datasets: Coauthor CS [95], Coauthor
Physics [95], Amazon Computers [95], Amazon Photo [95], and ogbn-arxiv [113]. The node
classification accuracy are showed in Figures C.5, C.6, C.7, C.8, and C.9, respectively. Specifically,
the accuracy on abnormal nodes and normal nodes are plotted separately in (a) and (b), with respect
to the ratio of noisy nodes.
            When the ratios of noisy nodes are within a reasonable range, we can observe that (1) AirGNN
obtains much better accuracy on abnormal nodes on all datasets, which verifies its stronger resilience
                                                                                                      122


to abnormal features; and (2) AirGNN achieves better or sometimes comparable accuracy on normal
nodes in most cases, which shows its capability to maintain good performance for normal nodes.
   However, when the noise ratio is very high, the performance of AirGNN drops quickly. This is
because the modulation hyperparameter 𝜆 is tuned based on the clean dataset such that it is far away
from being optimal for highly noisy dataset. But it can be significantly improved by adjusting the
hyperparameter 𝜆 as discussed in next subsection.
   These results suggest the significant advantages of adaptive residual in AirGNN, and confirm the
conclusion in the main paper. The adversarial attack on larger graphs is computationally expensive
so we omit the results on more datasets in the adversarial feature scenario.
                          0.5                                                               0.9
                                                                      GAT                                                             GAT
                                                                      GCN                   0.8                                       GCN
                          0.4                                         GCNII                 0.7                                       GCNII
                                                                      APPNP                                                           APPNP
                                                                      AirGNN                0.6                                       AirGNN
                          0.3
               Accuracy                                                          Accuracy
                                                                                            0.5
                                                                                            0.4
                          0.2
                                                                                            0.3
                          0.1                                                               0.2
                                                                                            0.1
                          0.01   2     3      4   5   8   10 15 20 25 30                    0.01   2   3     4   5    8   10 15 20 25 30
                                           Ratio of Noisy Nodes (%)                                        Ratio of Noisy Nodes (%)
                          (a) Accuracy on abnormal nodes                                      (b) Accuracy on normal nodes
       Figure C.5: Node classification accuracy in noisy features scenario (Coauthor CS).
                                                                                            1.0
                          0.9
                                                                                            0.9
                          0.8
                                                                                            0.8
                          0.7
                                     GAT                                                    0.7
                          0.6
               Accuracy                                                          Accuracy
                                     GCN                                                    0.6
                          0.5        GCNII
                                     APPNP                                                  0.5
                          0.4        AirGNN                                                 0.4     GAT
                          0.3                                                                       GCN
                                                                                            0.3     GCNII
                          0.2                                                               0.2     APPNP
                                                                                                    AirGNN
                          0.11   2     3      4   5   8   10 15 20 25 30                    0.11   2 3 4         5    8   10 15 20 25 30
                                           Ratio of Noisy Nodes (%)                                        Ratio of Noisy Nodes (%)
                          (a) Accuracy on abnormal nodes                                      (b) Accuracy on normal nodes
     Figure C.6: Node classification accuracy in noisy features scenario (Coauthor Physics).
                                                                               123


                           0.6                                                             0.8
                                                                     GAT                                                             GAT
                                                                     GCN                   0.7                                       GCN
                           0.5                                       GCNII                                                           GCNII
                                                                     APPNP                 0.6                                       APPNP
                           0.4                                       AirGNN                                                          AirGNN
                                                                                           0.5
                Accuracy                                                        Accuracy
                           0.3                                                             0.4
                                                                                           0.3
                           0.2
                                                                                           0.2
                           0.1
                                                                                           0.1
                           0.01   2   3     4   5    8   10 15 20 25 30                    0.01   2   3     4   5    8   10 15 20 25 30
                                          Ratio of Noisy Nodes (%)                                        Ratio of Noisy Nodes (%)
                           (a) Accuracy on abnormal nodes                                    (b) Accuracy on normal nodes
    Figure C.7: Node classification accuracy in noisy features scenario (Amazon Computers).
                           0.7                                                             0.9
                                                                     GAT                                                             GAT
                           0.6                                       GCN                   0.8                                       GCN
                                                                     GCNII                                                           GCNII
                           0.5                                       APPNP                 0.7                                       APPNP
                                                                     AirGNN                                                          AirGNN
                                                                                           0.6
                Accuracy                                                        Accuracy
                           0.4
                                                                                           0.5
                           0.3
                                                                                           0.4
                           0.2                                                             0.3
                           0.1                                                             0.2
                           0.01   2   3     4   5    8   10 15 20 25 30                    0.11   2   3     4   5    8   10 15 20 25 30
                                          Ratio of Noisy Nodes (%)                                        Ratio of Noisy Nodes (%)
                           (a) Accuracy on abnormal nodes                                    (b) Accuracy on normal nodes
        Figure C.8: Node classification accuracy in noisy features scenario (Amazon Photo).
                           0.7                                                             0.8
                                                                     GAT                                                             GAT
                           0.6                                       GCN                   0.7                                       GCN
                                                                     GCNII                                                           GCNII
                           0.5                                       APPNP                 0.6                                       APPNP
                                                                     AirGNN                                                          AirGNN
                Accuracy                                                        Accuracy
                           0.4                                                             0.5
                           0.3                                                             0.4
                           0.2                                                             0.3
                           0.1                                                             0.2
                           0.01   2   3     4   5    8   10 15 20 25 30                    0.11   2   3     4   5    8   10 15 20 25 30
                                          Ratio of Noisy Nodes (%)                                        Ratio of Noisy Nodes (%)
                           (a) Accuracy on abnormal nodes                                    (b) Accuracy on normal nodes
         Figure C.9: Node classification accuracy in noisy features scenario (ogbn-arxiv).
C.2.2    AirGNN with Adjusted 𝜆
Note that in Figures C.5, C.6, C.7, C.8, and C.9, the performance of AirGNN drops significantly
when the noise ratio is very large. This is because the modulation hyperparameter 𝜆 is tuned based
                                                                              124


on the clean dataset such that it is far away from being optimal for highly noisy dataset. In fact, the
performance of AirGNN can be significantly improved by adjusting 𝜆 during test time according
to the performance on the validation set. Taking the Coauthor CS [95] dataset as an example, we
compare AirGNN with APPNP and we tune the hyperparameter 𝜆 and 𝛼 for them (denoted as
AirGNN-tuned and APPNP-tuned) for a fair comparison as showed in Figure C.10. The result
verifies that AirGNN-tuned gets tremendous improvement on both abnormal and normal nodes by
adjusting 𝜆. However, APPNP-tuned only focuses on improving global performance and overlooks
the abnormal nodes after adjusting 𝛼 based on validation performance so that the performance on
abnormal node are much worse.
                          1.0                                                             1.0
                                                              APPNP
                          0.9                                 APPNP-tuned
                          0.8                                 AirGNN                      0.9
                                                              AirGNN-tuned
                          0.7
                          0.6                                                             0.8
               Accuracy                                                        Accuracy
                          0.5
                                                                                          0.7
                          0.4
                          0.3                                                             0.6        APPNP
                          0.2                                                                        APPNP-tuned
                          0.1                                                                        AirGNN
                                                                                          0.5        AirGNN-tuned
                          0.01   2   3     4   5    8   10 15 20 25 30                          1   2 3 4 5         8   10 15 20 25 30
                                         Ratio of Noisy Nodes (%)                                        Ratio of Noisy Nodes (%)
                          (a) Accuracy on abnormal nodes                                    (b) Accuracy on normal nodes
Figure C.10: Node classification accuracy in noisy features scenario with adjustment (Coauthor CS).
C.2.3   Detailed Comparison with APPNP
Figure 4.1 in Section 4.2 shows that APPNP without residual performs well on the noisy nodes.
Therefore, in order to demonstrate the advantages of AirGNN, it is of interest to make a detailed
comparison between AirGNN and the two variants of APPNP (w/Res and wo/Res). We evaluate
their performance on noisy nodes, normal nodes, and overall nodes on Cora dataset, and the results
under varying noise ratio are summarized in Table C.1, Table C.2, and Table C.3. We can make the
following observations:
  • In Table C.1, both AirGNN and APPNP wo/Res significantly outperform APPNP w/Res on
    noisy nodes, and AirGNN achieves comparable performance with APPNP wo/Res. This verifies
                                                                             125


     that the residual connection in GNN amplifies the vulnarability to abnormal features, and
     AirGNN is able to adaptively adjust the residual connections for abnormal nodes to reduce the
     vulnerability.
   • In Table C.2 and Table C.3, AirGNN consistently outperforms APPNP wo/Res, which verifies
     the importance of residual connections in maintaining good performance on normal nodes.
     AirGNN exhibits much better performance than APPNP w/Res, which shows the benefits of
     removing abnormal features by adaptive residual.
   • APPNP wo/Res is a special case of AirGNN with 𝜆 = 0. Moreover, as noted in Section C.2.2,
     the performance of AirGNN in Table C.1, Table C.2, and Table C.3 can be further improved
     by adjusting the modulation hyperparameter 𝜆 for each noise ratio according to validation
     performance.
    As discussed in Section 4.3, in existing GNNs such as APPNP and GCNII, the conflict between
feature aggregation and residual connection can only be partially mitigated by adjusting the residual
weight 𝛼. However, such global adjustment cannot be adaptive to a subset of the nodes, which
explains the advantages of AirGNN in above observations. In the adversarial feature setting, we can
make similar observations but here we omit the comparison.
     Table C.1: Comparison between APPNP and AirGNN on abnormal (noisy) nodes (Cora).
    Noisy ratio        5%           10%           15%           20%           25%           30%
  APPNP w/Res     0.167 ± 0.034 0.170 ± 0.070 0.170 ± 0.027 0.193 ± 0.031 0.187 ± 0.024 0.178 ± 0.026
  APPNP wo/Res    0.469 ± 0.035 0.442 ± 0.062 0.427 ± 0.038 0.381 ± 0.043 0.383 ± 0.045 0.354 ± 0.067
     AirGNN       0.474 ± 0.048 0.433 ± 0.055 0.405 ± 0.050 0.362 ± 0.039 0.353 ± 0.050 0.337 ± 0.057
           Table C.2: Comparison between APPNP and AirGNN on normal nodes (Cora).
    Noisy ratio        5%           10%           15%           20%           25%           30%
  APPNP w/Res     0.773 ± 0.015 0.712 ± 0.024 0.669 ± 0.019 0.622 ± 0.024 0.580 ± 0.032 0.530 ± 0.029
  APPNP wo/Res    0.761 ± 0.014 0.709 ± 0.025 0.664 ± 0.015 0.599 ± 0.025 0.556 ± 0.035 0.497 ± 0.049
     AirGNN       0.791 ± 0.015 0.741 ± 0.021 0.688 ± 0.024 0.625 ± 0.034 0.571 ± 0.039 0.527 ± 0.042
                                                 126


            Table C.3: Comparison between APPNP and AirGNN on all nodes (Cora).
   Noisy ratio                     5%                      10%                  15%                       20%                       25%       30%
 APPNP w/Res           0.743 ± 0.015 0.657 ± 0.026 0.594 ± 0.017 0.536 ± 0.024                                            0.482 ± 0.025   0.425 ± 0.025
 APPNP wo/Res          0.746 ± 0.013 0.682 ± 0.026 0.628 ± 0.015 0.556 ± 0.027                                            0.513 ± 0.034   0.455 ± 0.053
   AirGNN              0.775 ± 0.015 0.710 ± 0.021 0.646 ± 0.025 0.572 ± 0.033                                            0.516 ± 0.038   0.470 ± 0.044
C.2.4   Comparison with Robust Model
To further demonstrate the advantages of the proposed AirGNN, we compare it with a representative
robust model, Robust GCN [128]. Tables C.11, C.12 and C.13 show the performance comparison
between Robust GCN and AirGNN on Cora, Citeseer and PubMed, respectively. The accuracy on
abnormal nodes and normal nodes are plotted separately in (a) and (b), with respect to the ratio of
noisy nodes. These figures show that AirGNN achieves significant better performance than Robust
GCN on both abnormal and normal nodes in the noisy feature scenario.
                            0.7                                                              0.9
                                                                   Robust GCN
                                                                   AirGNN                    0.8
                            0.6
                                                                                             0.7
                            0.5
                 Accuracy                                                         Accuracy
                                                                                             0.6
                            0.4
                                                                                             0.5
                            0.3
                                                                                             0.4
                            0.2                                                              0.3     Robust GCN
                                                                                                     AirGNN
                            0.11   2    3     4   5    8   10 15 20 25 30                    0.21   2 3 4 5         8   10 15 20 25 30
                                            Ratio of Noisy Nodes (%)                                     Ratio of Noisy Nodes (%)
                            (a) Accuracy on abnormal nodes                                     (b) Accuracy on normal nodes
           Figure C.11: Node classification accuracy in noisy features scenario (Cora).
                            0.4
                                                                                             0.7
                            0.3                                                              0.6
                 Accuracy                                                         Accuracy
                                                                                             0.5
                            0.2
                                                                                             0.4
                            0.1
                                                                                             0.3
                                                                 Robust GCN                          Robust GCN
                                                                 AirGNN                              AirGNN
                            0.01   2    3     4   5    8   10 15 20 25 30                    0.21   2 3 4 5       8 10 15 20 25 30
                                            Ratio of Noisy Nodes (%)                                     Ratio of Noisy Nodes (%)
                            (a) Accuracy on abnormal nodes                                     (b) Accuracy on normal nodes
         Figure C.12: Node classification accuracy in noisy features scenario (CiteSeer).
                                                                                127


                0.8
                0.7                                                           0.8
                0.6
                0.5
                                                                   Accuracy
                                                                              0.7
     Accuracy
                0.4
                0.3
                                                                              0.6
                0.2
                0.1                                 Robust GCN                        Robust GCN
                                                    AirGNN                            AirGNN
                0.01   2   3     4   5    8   10 15 20 25 30                  0.51   2 3 4 5       8 10 15 20 25 30
                               Ratio of Noisy Nodes (%)                                   Ratio of Noisy Nodes (%)
                (a) Accuracy on abnormal nodes                                  (b) Accuracy on normal nodes
Figure C.13: Node classification accuracy in noisy features scenario (PubMed).
                                                                 128


BIBLIOGRAPHY
     129


                                      BIBLIOGRAPHY
[1]  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
     Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh
     Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay
     Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A
     system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems
     Design and Implementation, OSDI’16, pages 265–283, 2016.
[2]  Lada A Adamic and Natalie Glance. The political blogosphere and the 2004 us election:
     divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pages
     36–43, 2005.
[3]  Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent.
     In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,
     pages 440–445, Copenhagen, Denmark, September 2017. Association for Computational
     Linguistics.
[4]  Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD:
     Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural
     Information Processing Systems, pages 1709–1720, 2017.
[5]  Angelica I Aviles-Rivero, Nicolas Papadakis, Ruoteng Li, Samar M Alsaleh, Robby T Tan,
     and Carola-Bibiane Schonlieb. When labelled data hurts: Deep semi-supervised classification
     with the graph 1-laplacian. arXiv preprint arXiv:1906.08635, 2019.
[6]  Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator
     Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 1st edition, 2011.
[7]  Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar.
     SIGNSGD: compressed optimisation for non-convex problems. In Proceedings of the 35th
     International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm,
     Sweden, July 10-15, 2018, pages 559–568, 2018.
[8]  Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves
     Lechevallier and Gilbert Saporta, editors, Proceedings of COMPSTAT’2010, pages 177–186,
     Heidelberg, 2010. Physica-Verlag HD.
[9]  Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,
     2004.
[10] Xavier Bresson, Thomas Laurent, David Uminsky, and James H. von Brecht. An adaptive
     total variation algorithm for computing the balanced cut of a graph, 2013.
[11] Xavier Bresson, Thomas Laurent, David Uminsky, and James H Von Brecht. Multiclass total
     variation clustering. arXiv preprint arXiv:1306.1185, 2013.
                                              130


[12] Thomas Bühler and Matthias Hein. Spectral clustering based on the graph p-laplacian. In
     Proceedings of the 26th Annual International Conference on Machine Learning, pages 81–88,
     2009.
[13] Ruggero Carli, Fabio Fagnani, Paolo Frasca, and Sandro Zampieri. Gossip consensus
     algorithms via quantized communication. Automatica, 46(1):70–80, 2010.
[14] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep
     graph convolutional networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the
     37th International Conference on Machine Learning, volume 119 of Proceedings of Machine
     Learning Research, pages 1725–1735. PMLR, 13–18 Jul 2020.
[15] Peijun Chen, Jianguo Huang, and Xiaoqun Zhang. A primal–dual fixed point algorithm for
     convex separable minimization with applications to image restoration. Inverse Problems,
     29(2):025011, 2013.
[16] Siheng Chen, Yonina C. Eldar, and Lingxiao Zhao. Graph unrolling networks: Interpretable
     neural networks for graph signal denoising, 2020.
[17] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing
     Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A flexible and efficient machine learning
     library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
[18] Yu Chen, Lingfei Wu, and Mohammed Zaki. Iterative deep graph learning for graph neural
     networks: Better and robust node embeddings. Advances in Neural Information Processing
     Systems, 33, 2020.
[19] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American
     Mathematical Soc., 1997.
[20] Laurent Condat. A primal–dual splitting method for convex optimization involving lipschitzian,
     proximable and linear composite terms. Journal of optimization theory and applications,
     158(2):460–479, 2013.
[21] Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas
     Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized
     adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020.
[22] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z.
     Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large
     scale distributed deep networks. In Proceedings of the 25th International Conference on
     Neural Information Processing Systems - Volume 1, NIPS’12, pages 1223–1231, USA, 2012.
[23] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks
     on graphs with fast localized spectral filtering. In Proceedings of the 30th International
     Conference on Neural Information Processing Systems, pages 3844–3852, 2016.
[24] Tyler Derr, Yao Ma, Wenqi Fan, Xiaorui Liu, Charu Aggarwal, and Jiliang Tang. Epidemic
     graph convolutional network. In Proceedings of the 13th International Conference on Web
     Search and Data Mining, pages 160–168, 2020.
                                              131


[25] Michael Elad. Sparse and redundant representations: from theory to applications in signal
     and image processing. Springer Science & Business Media, 2010.
[26] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on
     Information Theory, 21(2):194–203, March 1975.
[27] Negin Entezari, Saba A Al-Sayouri, Amirali Darvishzadeh, and Evangelos E Papalexakis.
     All you need is low (rank) defending against adversarial attacks on graphs. In Proceedings of
     the 13th International Conference on Web Search and Data Mining, pages 169–177, 2020.
[28] Wenqi Fan, Xiaorui Liu, Wei Jin, Xiangyu Zhao, Jiliang Tang, and Qing Li. Graph trend
     filtering networks for recommendation. In Proceedings of the 45th International ACM SIGIR
     Conference on Research and Development in Information Retrieval, pages 112–121, 2022.
[29] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl.
     Neural message passing for quantum chemistry. In International Conference on Machine
     Learning, pages 1263–1272. PMLR, 2017.
[30] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing
     adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[31] Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and
     Peter Richtárik. SGD: General analysis and improved rates. arXiv preprint arXiv:1901.09401,
     2019.
[32] William L. Hamilton. Graph representation learning. Synthesis Lectures on Artificial
     Intelligence and Machine Learning, 14(3):1–159, 2020.
[33] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on
     large graphs. arXiv preprint arXiv:1706.02216, 2017.
[34] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity:
     The Lasso and Generalizations. Chapman Hall/CRC, 2015.
[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
     recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
     pages 770–778, 2016.
[36] Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter
     Richtárik. Stochastic distributed learning with gradient quantization and variance reduction.
     arXiv preprint arXiv:1904.05115, 2019.
[37] Wei Jin, Yaxing Li, Han Xu, Yiqi Wang, Shuiwang Ji, Charu Aggarwal, and Jiliang
     Tang. Adversarial attacks and defenses on graphs. ACM SIGKDD Explorations Newsletter,
     22(2):19–34, 2021.
[38] Wei Jin, Xiaorui Liu, Yao Ma, Charu Aggarwal, and Jiliang Tang. Towards feature
     overcorrelation in deeper graph neural networks. In Proceedings of the 28th ACM SIGKDD
     International Conference on Knowledge Discovery & Data Mining, 2022.
                                              132


[39] Wei Jin, Xiaorui Liu, Yao Ma, Tyler Derr, Charu Aggarwal, and Jiliang Tang. Graph feature
     gating networks. In Proceedings of the 30th ACM International Conference on Information
     amp; Knowledge Management, CIKM ’21, page 813–822, New York, NY, USA, 2021.
     Association for Computing Machinery.
[40] Wei Jin, Xiaorui Liu, Xiangyu Zhao, Yao Ma, Neil Shah, and Jiliang Tang. Automated
     self-supervised learning for graphs. In International Conference on Learning Representations,
     2022.
[41] Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph
     structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD
     International Conference on Knowledge Discovery & Data Mining, pages 66–74, 2020.
[42] Michael I Jordan, Jason D Lee, and Yun Yang. Communication-efficient distributed statistical
     inference. Journal of the American Statistical Association, 114(526):668–681, 2019.
[43] A. Jung, A. O. Hero, III, A. C. Mara, S. Jahromi, A. Heimowitz, and Y. C. Eldar. Semi-
     supervised learning in network-structured data via total variation minimization. IEEE
     Transactions on Signal Processing, 67(24):6256–6269, 2019.
[44] Alexander Jung, Alfred O Hero III, Alexandru Mara, and Saeed Jahromi. Semi-supervised
     learning via sparse label propagation. arXiv preprint arXiv:1612.01414, 2016.
[45] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Urban Stich, and Martin Jaggi. Error
     feedback fixes SignSGD and other gradient compression schemes. In Proceedings of the
     36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
     Learning Research, pages 3252–3261. PMLR, 2019.
[46] Seung-Jean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. ℓ1 trend filtering.
     SIAM review, 51(2):339–360, 2009.
[47] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
     preprint arXiv:1412.6980, 2014.
[48] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
     networks. arXiv preprint arXiv:1609.02907, 2016.
[49] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate:
     Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
[50] Anastasia Koloskova, Tao Lin, Sebastian U Stich, and Martin Jaggi. Decentralized deep
     learning with arbitrary communication compression. In International Conference on Learning
     Representations, 2020.
[51] Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic
     optimization and gossip algorithms with compressed communication. In Proceedings of the
     36th International Conference on Machine Learning, pages 3479–3487. PMLR, 2019.
[52] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
     recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
                                                133


[53] Guohao Li, Matthias Müller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as
     deep as cnns? In The IEEE International Conference on Computer Vision (ICCV), 2019.
[54] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski,
     James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with
     the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems
     Design and Implementation, OSDI’14, pages 583–598, Berkeley, CA, USA, 2014. USENIX
     Association.
[55] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional
     networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 32, 2018.
[56] Yao Li, Xiaorui Liu, Jiliang Tang, Ming Yan, and Kun Yuan. Decentralized composite
     optimization with compression. arXiv preprint arXiv:2108.04448, 2021.
[57] Yao Li and Ming Yan. On linear convergence of two decentralized algorithms. arXiv preprint
     arXiv:1906.07225, 2019.
[58] Yaxin Li, Wei Jin, Han Xu, and Jiliang Tang. Deeprobust: A pytorch library for adversarial
     attacks and defenses, 2020.
[59] Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network
     independent step-sizes and separated convergence rates. IEEE Transactions on Signal
     Processing, 67(17):4494–4506, 2019.
[60] Zhi Li and Ming Yan. New convergence analysis of a primal dual algorithm with large
     stepsizes. Advances in Computational Mathematics, 2021.
[61] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic
     gradient for nonconvex optimization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
     and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages
     2737–2745. Curran Associates, Inc., 2015.
[62] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can
     decentralized algorithms outperform centralized algorithms? a case study for decentralized
     parallel stochastic gradient descent. In Advances in Neural Information Processing Systems,
     pages 5330–5340, 2017.
[63] Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro. DLM: Decentralized linearized
     alternating direction method of multipliers. IEEE Transactions on Signal Processing,
     63(15):4051–4064, 2015.
[64] Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili Jain, Yunhao Liu, Anil K.
     Jain, and Jiliang Tang. Trustworthy ai: A computational perspective. ACM Trans. Intell. Syst.
     Technol., jun 2022. Just Accepted.
[65] Meng Liu, Hongyang Gao, and Shuiwang Ji. Towards deeper graph neural networks. In
     Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
     Data Mining. ACM, 2020.
                                               134


[66] Xiaorui Liu, Jiayuan Ding, Wei Jin, Han Xu, Yao Ma, Zitao Liu, and Jiliang Tang. Graph
     neural networks with adaptive residual. In Thirty-Fifth Conference on Neural Information
     Processing Systems, 2021.
[67] Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang.
     Elastic graph neural networks. In Marina Meila and Tong Zhang, editors, Proceedings of the
     38th International Conference on Machine Learning, volume 139 of Proceedings of Machine
     Learning Research, pages 6837–6849. PMLR, 18–24 Jul 2021.
[68] Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression algorithm
     for efficient distributed learning. The 23rd International Conference on Artificial Intelligence
     and Statistics, 2020.
[69] Xiaorui Liu, Yao Li, Rongrong Wang, Jiliang Tang, and Ming Yan. Linear convergent
     decentralized optimization with compression. In International Conference on Learning
     Representations, 2021.
[70] Ignace Loris and Caroline Verhoeven. On a generalization of the iterative soft-thresholding
     algorithm for the case of non-separable penalty. Inverse Problems, 27(12):125007, 2011.
[71] Yucheng Lu and Christopher De Sa. Moniqua: Modulo quantized communication in
     decentralized SGD. In Proceedings of the 37th International Conference on Machine
     Learning, 2020.
[72] Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. Is homophily a necessity for graph neural
     networks? In International Conference on Learning Representations, 2022.
[73] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on
     graph neural networks as graph signal denoising. Proceedings of the 30th ACM International
     Conference on Information and Knowledge Management, 2021.
[74] Yao Ma and Jiliang Tang. Deep Learning on Graphs. Cambridge University Press, 2020.
[75] Sindri Magnússon, Hossein Shokri-Ghadikolaei, and Na Li. On maintaining linear con-
     vergence of distributed learning and optimization under limited communication. IEEE
     Transactions on Signal Processing, 68:6101–6116, 2020.
[76] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in
     social networks. Annual review of sociology, 27(1):415–444, 2001.
[77] Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed
     learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
[78] Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus Püschel. D-ADMM: A
     communication-efficient distributed algorithm for separable optimization. IEEE Transactions
     on Signal Processing, 61(10):2718–2723, 2013.
[79] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed
     optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633,
     2017.
                                                135


[80] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent
     optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
[81] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.
     Springer Science & Business Media, 2013.
[82] Lam Nguyen, PHUONG HA NGUYEN, Marten van Dijk, Peter Richtarik, Katya Scheinberg,
     and Martin Takac. SGD and hogwild! Convergence without the bounded gradients assumption.
     In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference
     on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages
     3750–3758, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[83] Feiping Nie, Hua Wang, Heng Huang, and Chris Ding. Unsupervised and semi-supervised
     learning via ℓ1 -norm graph. In 2011 International Conference on Computer Vision, pages
     2268–2273. IEEE, 2011.
[84] Hoang Nt and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass
     filters. arXiv preprint arXiv:1905.09550, 2019.
[85] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for
     node classification. In International Conference on Learning Representations, 2020.
[86] Xuran Pan, Song Shiji, and Huang Gao. A unified framework for convolution-based graph
     neural networks. https://openreview.net/forum?id=zUMD–Fb9Bt, 2020.
[87] Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods. Mathematical
     Programming, pages 1–49, 2020.
[88] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. An exact
     quantized decentralized gradient descent algorithm. IEEE Transactions on Signal Processing,
     67(19):4934–4947, 2019.
[89] Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, and Ramtin
     Pedarsani. Robust and communication-efficient collaborative learning. In Advances in Neural
     Information Processing Systems, pages 8388–8399, 2019.
[90] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise
     removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
[91] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-
     dini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80,
     2008.
[92] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent
     and application to data-parallel distributed training of speech DNNs. In Interspeech 2014,
     September 2014.
[93] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina
     Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
                                               136


[94] James Sharpnack, Aarti Singh, and Alessandro Rinaldo. Sparsistency of the edge lasso over
      graphs. In Artificial Intelligence and Statistics, pages 1028–1036. PMLR, 2012.
[95] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann.
      Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
[96] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for
      decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
[97] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory.
      In Proceedings of the 32Nd International Conference on Neural Information Processing
      Systems, NIPS’18, pages 4452–4463, USA, 2018. Curran Associates Inc.
[98] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In
      INTERSPEECH, pages 1488–1492, 2015.
[99] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
      Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint
      arXiv:1312.6199, 2013.
[100] Arthur Szlam and Xavier Bresson. Total variation, cheeger cuts. In ICML, 2010.
[101] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression
      for decentralized training. In Advances in Neural Information Processing Systems, pages
      7652–7662. 2018.
[102] Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, Tong Zhang, and Ji Liu. Deep-
      squeeze: Decentralization meets error-compensated compression. CoRR, abs/1907.07346,
      2019.
[103] Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. 𝐷 2 : Decentralized training
      over decentralized data. In Proceedings of the 35th International Conference on Machine
      Learning, pages 4848–4856, 2018.
[104] Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. DoubleSqueeze: Parallel
      stochastic gradient descent with double-pass error-compensated compression. In Proceedings
      of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019,
      Long Beach, California, USA, pages 6155–6165, 2019.
[105] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and
      smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical
      Methodology), 67(1):91–108, 2005.
[106] Ryan J Tibshirani et al. Adaptive piecewise polynomial estimation via trend filtering. Annals
      of statistics, 42(1):285–323, 2014.
[107] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous determinis-
      tic and stochastic gradient optimization algorithms. IEEE transactions on automatic control,
      31(9):803–812, 1986.
                                                137


[108] Rohan Varma, Harlin Lee, Jelena Kovačević, and Yuejie Chi. Vector-valued graph trend
      filtering with non-convex penalties. IEEE Transactions on Signal and Information Processing
      over Networks, 6:48–62, 2019.
[109] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and
      Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[110] Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with
      sparsity. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International
      Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research,
      pages 3636–3645, International Convention Centre, Sydney, Australia, 06–11 Aug 2017.
      PMLR.
[111] Yu-Xiang Wang, James Sharpnack, Alexander J Smola, and Ryan J Tibshirani. Trend filtering
      on graphs. Journal of Machine Learning Research, 17:1–41, 2016.
[112] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for
      communication-efficient distributed optimization. In Proceedings of the 32Nd Interna-
      tional Conference on Neural Information Processing Systems, NIPS’18, pages 1306–1316,
      USA, 2018. Curran Associates Inc.
[113] Marinka Zitnik Yuxiao Dong Hongyu Ren Bowen Liu Michele Catasta Jure Leskovec
      Weihua Hu, Matthias Fey. Open graph benchmark: Datasets for machine learning on graphs.
      arXiv preprint arXiv:2005.00687, 2020.
[114] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li.
      TernGrad: Ternary gradients to reduce communication in distributed deep learning. In
      Advances in neural information processing systems, pages 1509–1519, 2017.
[115] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger.
      Simplifying graph convolutional networks. In International conference on machine learning,
      pages 6861–6871. PMLR, 2019.
[116] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated
      quantized SGD and its applications to large-scale distributed optimization. In Jennifer Dy
      and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine
      Learning, volume 80 of Proceedings of Machine Learning Research, pages 5325–5333,
      10–15 Jul 2018.
[117] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip.
      A comprehensive survey on graph neural networks. IEEE transactions on neural networks
      and learning systems, 2020.
[118] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems &
      Control Letters, 53(1):65–78, 2004.
[119] Han Xu, Xiaorui Liu, Yaxin Li, Anil Jain, and Jiliang Tang. To be robust or to be fair: Towards
      fairness in adversarial training. In Marina Meila and Tong Zhang, editors, Proceedings of the
                                                138


      38th International Conference on Machine Learning, volume 139 of Proceedings of Machine
      Learning Research, pages 11492–11501. PMLR, 18–24 Jul 2021.
[120] Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Accelerated primal-dual algorithms
      for distributed smooth convex optimization over networks. In International Conference on
      Artificial Intelligence and Statistics, pages 2381–2391. PMLR, 2020.
[121] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and
      Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In
      Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference
      on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages
      5453–5462. PMLR, 10–15 Jul 2018.
[122] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. ImageNet training
      in minutes. In Proceedings of the 47th International Conference on Parallel Processing,
      ICPP 2018, pages 1:1–1:10, New York, NY, USA, 2018. ACM.
[123] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent.
      SIAM Journal on Optimization, 26(3):1835–1854, 2016.
[124] Kun Yuan, Wei Xu, and Qing Ling. Can primal methods outperform primal-dual methods in
      decentralized dynamic optimization? arXiv preprint arXiv:2003.00816, 2020.
[125] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for distributed
      optimization and learning—part i: Algorithm development. IEEE Transactions on Signal
      Processing, 67(3):708–723, 2018.
[126] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. In
      International Conference on Learning Representations, 2019.
[127] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf.
      Learning with local and global consistency. 2004.
[128] Dingyuan Zhu, Ziwei Zhang, Peng Cui, and Wenwu Zhu. Robust graph convolutional
      networks against adversarial attacks. In Proceedings of the 25th ACM SIGKDD International
      Conference on Knowledge Discovery & Data Mining, pages 1399–1407, 2019.
[129] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph
      neural networks with an optimization framework, 2021.
[130] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label
      propagation.
[131] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005.
[132] Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. Deep graph
      structure learning for robust representations: A survey. arXiv preprint arXiv:2103.03036,
      2021.
                                                 139


[133] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. Parallelized stochastic
      gradient descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and
      A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603.
      Curran Associates, Inc., 2010.
[134] Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. Adversarial attacks on neural
      networks for graph data. In KDD. ACM, 2018.
[135] Daniel Zügner and Stephan Günnemann. Adversarial attacks on graph neural networks via
      meta learning. arXiv preprint arXiv:1902.08412, 2019.
                                                140