EFFICIENT AND SECURE MESSAGE PASSING FOR MACHINE LEARNING By Xiaorui Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2022 ABSTRACT EFFICIENT AND SECURE MESSAGE PASSING FOR MACHINE LEARNING By Xiaorui Liu Machine learning (ML) techniques have brought revolutionary impact to human society, and they will continue to act as technological innovators in the future. To broaden its impact, it is urgent to solve the emerging and critical challenges in machine learning, such as efficiency and security issues. On the one hand, ML models have become increasingly powerful due to big data and models, but it also brings tremendous challenges in designing efficient optimization algorithms to train the big ML models from big data. The most effective way for large-scale ML is to parallelize the computation tasks on distributed systems composed of many computational devices. However, in practice, the scalability and efficiency of the systems are greatly limited by information synchronization since the message passing between the devices dominates the total running time. In other words, the major bottleneck lies in the high communication cost between devices, especially when the scale of the system and the models becomes larger while the communication bandwidth is relatively limited. This communication bottleneck often limits the practical speedup of distributed ML systems. On the other hand, recent research has generally revealed that many ML models suffer from security vulnerabilities. In particular, deep learning models can be easily deceived by the unnoticeable perturbations in data. Meanwhile, graph is a kind of prevalent data structure for many real-world data that encodes pairwise relations between entities such as social networks, transportation networks, and chemical molecules. Graph neural networks (GNNs) generalize and extend the representation learning power of traditional deep neural networks (DNNs) from regular grids, such as image, video, and text, to irregular graph-structured data through message passing frameworks. Therefore, many important applications on these data can be treated as computational tasks on graphs, such as recommender systems, social network analysis, traffic prediction, etc. Unfortunately, the vulnerability of deep learning models also translates to GNNs, which raises significant concerns about their applications, especially in safety-critical areas. Therefore, it is critical to design intrinsically secure ML models for graph-structured data. The primary objective of this dissertation is to figure out the solutions to solve these challenges via innovative research and principled methods. In particular, we propose multiple distributed optimization algorithms with efficient message passing to mitigate the communication bottleneck and speed up ML model training in distributed ML systems. We also propose multiple secure message passing schemes as the building blocks of graph neural networks aiming to significantly improve the security and robustness of ML models. To my parents and family for their love and support. iv ACKNOWLEDGEMENTS This dissertation is impossible without the help and support from my advisor Dr. Jiliang Tang. I would like to express my deepest appreciation to him for his invaluable advice, inspiration, encouragement, and support during my Ph.D. research. I have learned many skills from him that benefit my life: how to identify important and challenging research problems, how to write papers and give presentations, how to collaborate, how to mentor students, how to write proposals, how to volunteer and serve the community, how to establish my career and build a big vision. Dr. Tang is not only an advisor in research but also a sincere friend who guided me with his experience in many aspects of my life. Dr. Tang, I cannot thank you enough. I would like to thank my committee members, Dr. Ming Yan, Dr. Anil K. Jain, and Dr. Charu Aggarwal, for their helpful suggestions and insightful comments. I enrolled in the Optimization course from Dr. Ming Yan in my first year at MSU, which prepared me with a solid technical background and benefited my Ph.D. research a lot. I always consider Dr. Ming Yan as my secondary advisor because many of my research ideas were initialized with his insightful discussions and a large portion of this dissertation was done with his collaboration. Dr. Anil K. Jain provided numerous suggestions for my research and career, and his pursuit of broader impact greatly reshapes my research vision. I was fortunate to collaborate with Dr. Charu Aggarwal on multiple research papers, and I am impressed by his knowledge and insights. In addition, his dedication to book writing and online education is very inspiring to me. I was fortunate to work as an intern at JD.com, Kwai AI Lab, and TAL AI Lab with amazing colleagues and mentors: Dr. Dawei Yin, Dr. Hongshen Chen, Dr. Ziheng Jiang, and Dr. Zhaochun Ren from JD.com; Dr. Ji Liu and Dr. Xiangru Lian from Kwai AI Lab; and Dr. Zitao Liu from TAL AI Lab. I enjoyed the productive and wonderful summers with you. During my Ph.D. study, many friends and colleagues provided me with consistent support, encouragement, and happiness. I am thankful to my friends and colleagues from the Data Science and Engineering Lab: Dr. Tyler Derr, Dr. Zhiwei Wang, Dr. Wenqi Fan, Dr. Xiangyu Zhao, Dr. v Hamid Karimi, Dr. Yao Ma, Chenxing Wang, Daniel K.O-Dankwa, Hansheng Zhao, Haochen Liu, Han Xu, Jamell Dacon, Wentao Wang, Wei Jin, Yaxin Li, Yiqi Wang, Juanhui Li, Harry Shomer, Jie Ren, Jiayuan Ding, Haoyu Han, Hongzhi Wen, Yuxuan Wan, Pengfei He, Hua Liu, Dr. Xin Wang, Dr. Jiangtao Huang, Dr. Xiaoyang Wang, Dr. Meznah Almutairy, Namratha Shah, Norah Alfadhli. In particular, thanks to my DSE collaborators: Dr. Tyler Derr, Dr. Zhiwei Wang, Dr. Wenqi Fan, Dr. Hamid Karimi, Dr. Xiangyu Zhao, Dr. Yao Ma, Haochen Liu, Han Xu, Wentao Wang, Wei Jin, Yaxin Li, Yiqi Wang, Jiayuan Ding, Haoyu Han, Hua Liu, and Yuxuan Wan. I would like to extend my sincere thanks to all my collaborators from outside the Data Science and Engineering Lab: Dr. Ming Yan, Dr. Yao Li, Dr. Rongrong Wang, Dr. Charu Aggarwal, Dr. Anil K. Jain, Dr. Dawei Yin, Dr. Hongshen Chen, Dr. Qing Li, Dr. Suhang Wang, Dr. Xianfeng Tang, Dr. Hui Liu, Dr. Zitao Liu, Dr. Kuan Yuan, and Dr. Neil Shah. Finally, I would like to thank my family for their love and support. I also dedicate this dissertation to Xinyi Lu for supporting me all the way! vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 CHAPTER 2 A DOUBLE RESIDUAL COMPRESSION ALGORITHM FOR DIS- TRIBUTED LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Strongly Convex Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Nonconvex Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Strongly Convex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.2 Nonconvex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.3 Communication Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 CHAPTER 3 LINEAR CONVERGENT DECENTRALIZED OPTIMIZATION WITH COMPRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Numerical Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 CHAPTER 4 GRAPH NEURAL NETWORKS WITH ADAPTIVE RESIDUAL . . . . . . 42 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 Preliminary Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.2 Understandings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 vii 4.3.1 Design Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.2 Adaptive Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.3 Interpretation of AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.4 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Performance Comparison with Noisy Features . . . . . . . . . . . . . . . . 56 4.4.3 Performance Comparison with Adversarial Features . . . . . . . . . . . . . 57 4.4.4 Adaptive Residual for Abnormal & Normal Nodes . . . . . . . . . . . . . . 58 4.4.5 Performance in the Clean Setting . . . . . . . . . . . . . . . . . . . . . . . 59 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 CHAPTER 5 ELASTIC GRAPH NEURAL NETWORKS . . . . . . . . . . . . . . . . . 62 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.1 GNNs as Graph Signal Denoising . . . . . . . . . . . . . . . . . . . . . . 65 5.2.2 Graph Trend Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.1 Elastic Graph Signal Estimator . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.2 Elastic Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.3 Elastic GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.2 Performance on Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . 77 5.4.3 Robustness Under Adversarial Attack . . . . . . . . . . . . . . . . . . . . 78 5.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 APPENDIX A A DOUBLE RESIDUAL COMPRESSION ALGORITHM FOR DISTRIBUTED LEARNING . . . . . . . . . . . . . . . . . . . . . 88 APPENDIX B LINEAR CONVERGENT DECENTRALIZED OPTIMIZATION WITH COMPRESSION . . . . . . . . . . . . . . . . . . . . . . . . 100 APPENDIX C GRAPH NEURAL NETWORKS WITH ADAPTIVE RESIDUAL . 121 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 viii LIST OF TABLES Table 2.1: A comparison between related algorithms. DORE is able to converges linearly to the O (𝜎) neighborhood of optimal point like full-precision SGD and DIANA in the strongly convex case while achieving much better communication efficiency. DORE also admits linear speedup in the nonconvex case like DoubleSqueeze but DORE doesn’t require the assumptions of bounded compression error or bounded gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Table 4.1: Data statistics on benchmark datasets. . . . . . . . . . . . . . . . . . . . . . . . 55 Table 4.2: Dataset statistics for adversarially attacked datasets. . . . . . . . . . . . . . . . . 55 Table 4.3: Average adaptive score (𝛽) and residual weight (1 − 𝛽) in the noisy feature scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 4.4: Average adaptive score (𝛽) and residual weight (1 − 𝛽) in the adversarial feature scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 4.5: Comparison between AirGNN, APPNP, and Robust GCN in the clean setting. . 59 Table 5.1: Statistics of benchmark datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Table 5.2: Dataset Statistics for adversarially attacked graph. . . . . . . . . . . . . . . . . . 76 Table 5.3: Classification accuracy (%) on benchmark datasets with 10 times random data splits. 78 Table 5.4: Classification accuracy (%) under different perturbation rates of adversarial graph attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Table 5.5: Ratio between average node differences along wrong and correct edges. . . . . . 81 Table 5.6: Sparsity ratio (i.e., ∥( Δ̃F)𝑖 ∥ 2 < 0.1) in node differences Δ̃F. . . . . . . . . . . . 81 Table B.1: Parameter settings for the linear regression problem. . . . . . . . . . . . . . . . 104 Table B.2: Parameter settings for the logistic regression problem (full-batch gradient). . . . 104 Table B.3: Parameter settings for the logistic regression problem (mini-batch gradient). . . . 105 Table B.4: Parameter settings for the deep neural network. (* means divergence for all options we try). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 ix Table C.1: Comparison between APPNP and AirGNN on abnormal (noisy) nodes (Cora). . 126 Table C.2: Comparison between APPNP and AirGNN on normal nodes (Cora). . . . . . . 126 Table C.3: Comparison between APPNP and AirGNN on all nodes (Cora). . . . . . . . . . 127 x LIST OF FIGURES Figure 1.1: Distributed ML systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 1.2: Graph representation learning for node-focus tasks. . . . . . . . . . . . . . . . 3 Figure 2.1: An Illustration of DORE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2.2: Linear regression on synthetic data. When the learning rate is 0.05, Dou- bleSqueeze diverges. In both cases, DORE, SGD, and DIANA converge linearly to the optimal point, while QSGD, MEM-SGD, DoubleSqueeze, and DoubleSqueeze (topk) only converge to the neighborhood even when full gradient is available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 2.3: The norm of variable being compressed in the linear regression experiment. . . 20 Figure 2.4: LeNet trained on MNIST. DORE converges similarly as most baselines. It outperforms DoubleSqueeze using the same compression method while has similar performance as DoubleSqueeze (topk). . . . . . . . . . . . . . . . . . . 21 Figure 2.5: Resnet18 trained on CIFAR10. DORE achieves similar convergence and accuracy as most baselines. DoubeSuqeeze converges slower and suffers from the higher loss but it works well with topk compression. . . . . . . . . . . . . 21 Figure 2.6: Per iteration time cost on Resnet18 for SGD, QSGD, and DORE. It is tested in a shared cluster environment connected by Gigabit Ethernet interface. DORE speeds up the training process significantly by mitigating the communication bottleneck. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 3.1: Linear regression problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 3.2: Logistic regression problem in the heterogeneous case (full-batch gradient). . . 39 Figure 3.3: Logistic regression in the heterogeneous case (mini-batch gradient). . . . . . . . 39 Figure 3.4: Stochastic optimization on deep neural network (∗ means divergence). . . . . . 39 Figure 3.5: Parameter analysis on linear regression problem. . . . . . . . . . . . . . . . . . 40 Figure 4.1: Node classification accuracy on abnormal nodes (Cora). . . . . . . . . . . . . . 45 Figure 4.2: Node classification accuracy on normal nodes (Cora). . . . . . . . . . . . . . . 46 xi Figure 4.3: Diagram of Adaptive Message Passing. . . . . . . . . . . . . . . . . . . . . . . 49 Figure 4.4: Adaptive Message Passing (AMP). . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 4.5: Node classification accuracy on abnormal (noisy) nodes. . . . . . . . . . . . . . 56 Figure 4.6: Node classification accuracy on normal nodes. . . . . . . . . . . . . . . . . . . 57 Figure 4.7: Node classification accuracy on adversarial nodes. . . . . . . . . . . . . . . . . 58 Figure 5.1: Elastic Message Passing (EMP). F0 = Xin and Z0 = 0𝑚×𝑑 . . . . . . . . . . . . 71 Figure 5.2: Classification accuracy under different propagation steps. . . . . . . . . . . . . 82 Figure 5.3: Convergence of the objective value for the problem in Eq. (5.8) during message passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure A.1: Linear regression on synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . 89 Figure A.2: LeNet trained on MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 89 Figure A.3: Resnet18 trained on CIFAR10 dataset. . . . . . . . . . . . . . . . . . . . . . . 89 Figure A.4: Resnet18 trained on CIFAR10 dataset with 1Gbps network bandwidth. . . . . . 89 Figure A.5: Resnet18 trained on CIFAR10 dataset with 200Mbps network bandwidth. . . . . 89 Figure A.6: Training under different compression block sizes. . . . . . . . . . . . . . . . . 90 Figure A.7: Training under different 𝛼. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Figure A.8: Training under different 𝛽. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Figure A.9: Training under different 𝜂. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 ∥x−𝑄(x) ∥ 2 Figure B.1: Relative compression error ∥x∥ 2 for p-norm b-bit quantization. . . . . . . . 102 Figure B.2: Comparison of compression error ∥x−𝑄(x) ∥x∥ 2 ∥2 between different compression methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Figure B.3: Logistic regression in the homogeneous case (full-batch gradient). . . . . . . . 103 Figure B.4: Logistic regression in the homogeneous case (mini-batch gradient). . . . . . . . 104 xii Figure C.1: Node classification accuracy on abnormal nodes (CiteSeer). . . . . . . . . . . . 121 Figure C.2: Node classification accuracy on normal nodes (CiteSeer). . . . . . . . . . . . . 121 Figure C.3: Node classification accuracy on abnormal nodes (PubMed). . . . . . . . . . . . 122 Figure C.4: Node classification accuracy on normal nodes (PubMed). . . . . . . . . . . . . 122 Figure C.5: Node classification accuracy in noisy features scenario (Coauthor CS). . . . . . 123 Figure C.6: Node classification accuracy in noisy features scenario (Coauthor Physics). . . . 123 Figure C.7: Node classification accuracy in noisy features scenario (Amazon Computers). . 124 Figure C.8: Node classification accuracy in noisy features scenario (Amazon Photo). . . . . 124 Figure C.9: Node classification accuracy in noisy features scenario (ogbn-arxiv). . . . . . . 124 Figure C.10: Node classification accuracy in noisy features scenario with adjustment (Coauthor CS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Figure C.11: Node classification accuracy in noisy features scenario (Cora). . . . . . . . . . 127 Figure C.12: Node classification accuracy in noisy features scenario (CiteSeer). . . . . . . . 127 Figure C.13: Node classification accuracy in noisy features scenario (PubMed). . . . . . . . 128 xiii LIST OF ALGORITHMS Algorithm 1: DORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Algorithm 2: DORE with 𝑅(x) = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Algorithm 3: LEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Algorithm 4: LEAD in Agent’s Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 32 xiv CHAPTER 1 INTRODUCTION Machine learning (ML) techniques have brought revolutionary impact to human society, and they will continue to act as technological innovators in the future. In recent years, critical challenges in machine learning such as efficiency and security issues broadly emerge. These issues greatly limit the applications of ML techniques in many scientific and application domains. To broaden the impact of ML techniques, it is urgent to solve these emerging and critical research challenges. On the one hand, ML models have become increasingly powerful due to big data and models, but it also brings tremendous challenges in designing efficient optimization algorithms to train the big ML models from big data. The most effective way for large-scale ML is to parallelize the computation tasks on distributed systems composed of many computational devices. There are two mainstream trends of distributed and parallel ML systems: (1) large-scale ML models trained by powerful distributed computing systems in data centers; and (2) on-device distributed training by resourced-limited edge devices (e.g., smartphones, AR/VR headsets, drones, billions of Internet of Things) as well as the massive amount of data they generate on a daily basis. In both cases, the scalability and efficiency of the systems are greatly limited since the slow information synchronization between the devices dominates the total running time. In other words, the major bottleneck lies in the high communication cost between devices, especially when the scale of the system and the models becomes larger while the communication bandwidth is relatively limited. For instance, in centralized distributed ML systems as shown in Figure 1.1a, every computing node needs to frequently synchronize information with other nodes through the central server by passing the message (e.g., model or gradient information) [4, 92, 114, 68]. But this message passing process becomes dominating when the communication network bandwidth is limited and the number of computing nodes is massive. In decentralized distributed ML systems as shown in Figure 1.1b, every computing node only needs to synchronize information with the one-hope neighbors by passing the message (e.g., the model information). Although it is more scalable in 1 terms of the number of computing nodes, the communication bottleneck still exists under limited network bandwidth [101, 51, 50, 75]. The communication bottleneck often limits the theoretical speedup of distributed ML systems. Therefore, how to design distributed learning algorithms and systems with efficient message passing becomes a promising and key research direction for solving the efficiency challenge in ML. (a) Centralized Learning (b) Decentralized Learning Figure 1.1: Distributed ML systems. On the other hand, recent research has generally revealed that many ML models suffer from security vulnerabilities. In particular, deep learning models can be easily deceived by the unnoticeable perturbations in data [30, 99, 21]. Meanwhile, graph is a kind of prevalent data structure for many real- world data that encodes pairwise relations between entities such as social networks, transportation networks, and chemical molecules. Graph neural networks (GNNs) generalize and extend the representation learning power of traditional deep neural networks (DNNs) from regular grids, such as image, video, and text, to irregular graph-structured data as shown in Figure 1.2. Therefore, many important applications on these data can be treated as computational tasks on graphs [74, 32]. For instance, product recommendation in e-commerce and friend recommendation in social network analysis can be formulated as link prediction tasks on graphs; Traffic prediction in transportation systems can be formulated as node classification or regression on graphs. The key building block for such generalization is the message passing framework that propagates features from neighboring nodes in the graph. The message passing layer offers the node permutation invariance 2 and the support for arbitrary neighboring sizes in graphs. Despite the promising performance of GNNs in clean data settings, unfortunately, the vulnerability of deep learning models also translates to GNNs [134, 135, 37, 41] when the data contains adversarial perturbations. For instance, the performance greatly degrades when the node features or graph structure are modified by adversaries [41]. These raise significant concerns about the applications of GNNs, especially in safety-critical areas. Therefore, it is critical to design intrinsically secure ML models for graph-structured data. Representation Node-focus Learning Tasks Node representations Figure 1.2: Graph representation learning for node-focus tasks. 1.1 Research Challenges From the above research background, we can summarize the research challenges as follows: • How to design scalable distributed ML systems and algorithms with efficient message passing between computing devices such that the communication bottleneck can be largely mitigated? • How to maintain the convergence behaviors of the optimization algorithm when communication efficiency is improved both theoretically and empirically? • How to design intrinsically secure ML models that are more robust to potential threats such as feature or graph structure attacks by adversaries? • How to bypass the tradeoff between the performance under clean and adversarial settings? In other words, can we maintain good performance when the data are clean while providing strong security when the data are adversarially perturbed? 3 1.2 Contributions The primary objective of this dissertation is to figure out the solutions to solve these challenges via innovative research and principled methods. In particular, we propose multiple distributed optimization algorithms with efficient message passing to mitigate the communication bottleneck and speed up ML model training in distributed ML systems. We also propose multiple secure message passing schemes as the building blocks of graph neural networks aiming to significantly improve the security and robustness of ML models. The contributions of this dissertation are summarized as: • To fundamentally improve the efficiency of distributed ML systems, I proposed a series of innovative algorithms to break through the communication bottleneck. In particular, when the communication network is a start network as shown in Figure 1.1a, I proposed DORE [68], a double residual compression algorithm, to compress the bi-directional communication between client devices and the server such that over 95% of the communication bits can be reduced. This is the first algorithm that reduces that much communication cost while maintaining the superior convergence complexities (e.g., linear convergence) as the uncompressed counterpart, both theoretically and numerically. • When the communication network is of any general topology (as long as it is connected) as shown in Figure 1.1b, I proposed LEAD [69], the first linear convergent decentralized optimization algorithm with communication compression, which only requires point-to-point compressed communication between neighboring devices over communication networks. Theoretically, we prove that under certain compression ratios, the convergence complexity of the proposed algorithm does not depend on the compression operator. In other words, it achieves better communication efficiency for free. • To design intrinsically secure ML models against feature attacks, I investigate to denoise the hidden features in neural network layers caused by the adversarial perturbation using the graph structural information. This is achieved by the proposed AirGNN [66] in which the adaptive 4 message passing denoises perturbed features by feature aggregations and maintains feature separability by adaptive residuals. The proposed algorithm has a clear design principle and interpretation as well strong as performance both in the clean and adversarial data settings. • To design intrinsically secure ML models against graph structure attacks, I investigate a new prior knowledge of smoothness in the design of graph neural networks. In particular, we derive an elastic message passing scheme to model the piecewise constant signal in graph data. We demonstrate its stronger resilience to adversarial structure attacks and superior performance when the data is clean through a comprehensive empirical study on the proposed model ElasticGNN [67]. 1.3 Organization The remainder of this dissertation is organized as follows. In Chapter 2, we introduce DORE, a centralized distributed optimization algorithm with communication compression. In Chapter 3, we introduce LEAD, the first linear convergent decentralized distributed optimization algorithm with communication compression. In these two chapters, we demonstrate how to significantly improve the efficiency and scalability of distributed ML systems by the co-design of efficient message passing and optimization algorithms. In Chapter 4, we investigate the possibility to utilize the graph structural information in defending against abnormal features with noise or adversarial perturbations. We derive a novel adaptive message passing scheme from a principled graph signal denoising perspective. In Chapter 5, we study a new smoothness prior knowledge, i.e., piecewise constant signal, for graph representation learning. We derive the elastic message passing to model the adaptive local smoothness in graph data. In these two chapters, we demonstrate how these secure message passing algorithms can be used as fundamental building blocks in the design of graph neural networks to defend against feature and graph structure attacks through the examples of AirGNN and ElasticGNN, respectively. We conclude the dissertation and discuss the broader impact and promising research directions in Chapter 6. 5 CHAPTER 2 A DOUBLE RESIDUAL COMPRESSION ALGORITHM FOR DISTRIBUTED LEARNING Large-scale machine learning models are often trained by parallel stochastic gradient descent algorithms. However, the message passing cost of gradient aggregation and model synchronization between the master and worker nodes becomes the major obstacle for efficient learning as the number of workers and the dimension of the model increase. In this chapter, we propose DORE, a DOuble REsidual compression stochastic gradient descent algorithm, to reduce over 95% of the overall communication in message passing such that the obstacle can be immensely mitigated. Our theoretical analyses demonstrate that the proposed strategy has superior convergence properties for both strongly convex and nonconvex objective functions. The experimental results validate that DORE achieves the best communication efficiency while maintaining similar model accuracy and convergence speed in comparison with start-of-the-art baselines. 2.1 Introduction Stochastic gradient algorithms [8] are efficient at minimizing the objective function 𝑓 : R𝑑 → R which is usually defined as 𝑓 (x) := E𝜉∼D [ℓ(x, 𝜉)], where ℓ(x, 𝜉) is the objective function defined on data sample 𝜉 and model parameter x. A basic stochastic gradient descent (SGD) repeats the gradient “descent” step x 𝑘+1 = x 𝑘 − 𝛾g(x 𝑘 ) where x 𝑘 is the current iteration and 𝛾 is the step size. The stochastic gradient g(x 𝑘 ) is computed based on an i.i.d. sampled mini-batch from the distribution of the training data D and serves as the estimator of the full gradient ∇ 𝑓 (x 𝑘 ). In the context of large-scale machine learning, the number of data samples and the model size are usually very large. Distributed learning utilizes a large number of computers/cores to perform the stochastic algorithms aiming at reducing the training time. It has attracted extensive attention due to the demand for highly efficient model training [1, 17, 54, 122]. In this work, we focus on the data-parallel SGD [22, 61, 133], which provides a scalable solution to speed up the training process by distributing the whole data to multiple computing nodes. The 6 objective can be written as: 𝑛 1 Í minimize 𝑓 (x) + 𝑅(x) = 𝑛 E𝜉∼D𝑖 [ℓ(x, 𝜉)] +𝑅(x), x∈R𝑑 𝑖=1 | {z } B 𝑓𝑖 (x) where each 𝑓𝑖 (x) is a local objective function of the worker node 𝑖 defined based on the allocated data under distribution D𝑖 and 𝑅 : R𝑑 → R is usually a closed convex regularizer. In the well-known parameter server framework [54, 133], during each iteration, each worker node evaluates its own stochastic gradient { e ∇ 𝑓𝑖 (x 𝑘 )}𝑖=1 𝑛 and send it to the master node, which Í𝑛 e collects all gradients and calculates their average (1/𝑛) 𝑖=1 ∇ 𝑓𝑖 (x 𝑘 ). Then the master node further takes the gradient descent step with the averaged gradient and broadcasts the new model parameter x 𝑘+1 to all worker nodes. It makes use of the computational resources from all nodes. In reality, the network bandwidth is often limited. Thus, the communication cost for the gradient transmission and model synchronization becomes the dominating bottlenecks as the number of nodes and the model size increase, which hinders the scalability and efficiency of SGD. One common way to reduce the communication cost is to compress the gradient information by either gradient sparsification or quantization [4, 92, 97, 98, 110, 112, 114, 116] such that many fewer bits of information are needed to be transmitted. However, little attention has been paid on how to reduce the communication cost for model synchronization and the corresponding theoretical guarantees. Obviously, the model shares the same size as the gradient, so does the communication cost. Thus, merely compressing the gradient can reduce at most 50% of the communication cost, which suggests the importance of model compression. Notably, the compression of model parameters is much more challenging than gradient compression. One key obstacle is that its compression error cannot be well controlled by the step size 𝛾 and thus it cannot diminish like that in the gradient compression [101]. In this work, we aim to bridge this gap by investigating algorithms to compress the full communication in the optimization process and understanding their theoretical properties. Our contributions can be summarized as: • We proposed DORE, which can compress both the gradient and the model information such that more than 95% of the communication cost can be reduced. 7 • We provided theoretical analyses to guarantee the convergence of DORE under strongly convex and nonconvex assumptions without the bounded gradient assumption. • Our experiments demonstrate the superior efficiency of DORE comparing with the state-of-art baselines without degrading the convergence speed and the model accuracy. 2.2 Related Work Recently, many works try to reduce the communication cost to speed up the distributed learning, especially for deep learning applications, where the size of the model is typically very large (so is the size of the gradient) while the network bandwidth is relatively limited. Below we briefly review relevant papers. Gradient quantization and sparsification. Recent works [4, 92, 114, 77, 7] have shown that the information of the gradient can be quantized into a lower-precision vector such that fewer bits are needed in communication without loss of accuracy. [92] proposed 1Bit SGD that keeps the sign of each element in the gradient only. It empirically works well, and [7] provided theoretical analysis systematically. QSGD [4] utilizes an unbiased multi-level random quantization to compress the gradient while Terngrad [114] quantizes the gradient into ternary numbers {0, ±1}. In DIANA [77], the gradient difference is compressed and communicated contributing to the estimator of the gradient in the master node. Another effective strategy to reduce the communication cost is sparsification. [112] proposed a convex optimization formulation to minimize the coding length of stochastic gradients. A more aggressive sparsification method is to keep the elements with relatively larger magnitude in gradients, such as top-k sparsification [97, 98, 3]. Model synchronization. The typical way for model synchronization is to broadcast model parameters to all worker nodes. Some works [110, 42] have been proposed to reduce model size by enforcing sparsity, but it cannot be applied to general optimization problems. Some alternatives including QSGD [4] and ECQ-SGD [116] choose to broadcast all quantized gradients to all other workers such that every worker can perform model update independently. However, all-to-all 8 communication is not efficient since the number of transmitted bits increases dramatically in large-scale networks. DoubleSqueeze [104] applies compression on the averaged gradient with error compensation to speed up model synchronization. Error compensation. [92] applied error compensation on 1Bit-SGD and achieved negligible loss of accuracy empirically. Recently, error compensation was further studied [116, 97, 45] to mitigate the error caused by compression. The general idea is to add the compressed error to the next compression step: ĝ = 𝑄(g + e), e = (g + e) − ĝ. However, to the best of our knowledge, most of the algorithms with error compensation [116, 97, 45, 104] need to assume bounded gradient, i.e., E∥g∥ 2 ≤ 𝐵, and the convergence rate depends on this bound. Contributions of DORE. The most related papers to DORE are DIANA [77] and Doub- leSqueeze [104]. Similarly, DIANA compresses gradient difference on the worker side and achieves good convergence rate. However, it doesn’t consider the compression in model synchronization, so at most 50% of the communication cost can be saved. DoubleSqueeze applies compression with error compensation on both worker and server sides, but it only considers non-convex objective functions. Moreover, its analysis relies on a bounded gradient assumption, i.e., E∥g∥ 2 ≤ 𝐵, and the convergence error has a dependency on the gradient bound like most existed error compensation works. In general, the uniform bound on the norm of the stochastic gradient is a strong assumption which might not hold in some cases. For example, it is violated in the strongly convex case [82, 31]. In this work, we design DORE, the first algorithm which utilizes gradient and model compression with error compensation without assuming bounded gradients. Unlike existing error compensation works, we provide a linear convergence rate to the O (𝜎) neighborhood of the optimal solution for strongly convex functions and a sublinear rate to the stationary point for nonconvex functions with linear speedup. In Table 2.1, we compare the asymptotic convergence rates of different quantized SGDs with DORE. 9 2.3 Algorithm In this section, we introduce the proposed DOuble REsidual compression SGD (DORE) algorithm. Before that, we introduce a common assumption for the compression operator. In this work, we adopt an assumption from [4, 114, 77] that the compression variance is linearly proportional to the magnitude. Assumption 1. The stochastic compression operator 𝑄 : R𝑑 → R𝑑 is unbiased, i.e., E𝑄(x) = x and satisfies E∥𝑄(x) − x∥ 2 ≤ 𝐶 ∥x∥ 2 , (2.1) for a nonnegative constant 𝐶 that is independent of x. We use x̂ to denote the compressed x, i.e., x̂ ∼ 𝑄(x). Many feasible compression operators can be applied to our algorithm since our theoretical analyses are built on this common assumption. Some examples of feasible stochastic compression operators include: • No Compression: 𝐶 = 0 when there is no compression. • Stochastic Quantization: A real number 𝑥 ∈ [𝑎, 𝑏], (𝑎 < 𝑏) is set to be 𝑎 with probability 𝑏−𝑥 𝑏−𝑎 𝑥−𝑎 and 𝑏 with probability 𝑏−𝑎 , where 𝑎 and 𝑏 are predefined quantization levels [4]. It satisfies Assumption 1 when 𝑎𝑏 > 0 and 𝑎 < 𝑏. • Stochastic Sparsification: A real number 𝑥 is set to be 0 with probability 1 − 𝑝 and 𝑥 𝑝 with probability 𝑝 [114]. It satisfies Assumption 1 with 𝐶 = (1/𝑝) − 1. • 𝑝-norm Quantization: A vector x is quantized element-wisely by 𝑄 𝑝 (x) = ∥x∥ 𝑝 sign(x) ◦𝜉, where |𝑥 𝑖 | ◦ is the Hadamard product and 𝜉 is a Bernoulli random vector satisfying 𝜉𝑖 ∼ Bernoulli( ∥x∥ 𝑝 ). ∥x∥ 1 ∥x∥ 𝑝 It satisfies Assumption 1 with 𝐶 = maxx∈R𝑑 ∥x∥ 22 − 1 [77]. To decrease the constant 𝐶 for a higher accuracy, a vector x ∈ R𝑑 can be further decomposed into blocks, i.e., x = (x(1) ⊤ , x(2) ⊤ , · · · , x(𝑚) ⊤ ) ⊤ with x(𝑙) ∈ R𝑑𝑙 and 𝑙=1 Í𝑚 𝑑𝑙 = 𝑑, and the blocks can be compressed independently. 10 Master Gr al ad es idu ien r tr nt al M esi ra die idu od du G es … el r esi al r el du od al M … Worker Worker Worker Worker Figure 2.1: An Illustration of DORE. 2.3.1 Proposed Algorithm Many previous works [4, 92, 114] reduce the communication cost of P-SGD by quantizing the stochastic gradient before sending it to the master node, but there are several intrinsic issues. First, these algorithms will incur extra optimization error intrinsically. Let’s consider the case when the algorithm converges to the optimal point x∗ where we have (1/𝑛) 𝑖=1 ∇ 𝑓𝑖 (x∗ ) = 0. Í𝑛 However, the data distributions may be different for different worker nodes in general, and thus we may have ∇ 𝑓𝑖 (x∗ ) ≠ ∇ 𝑓 𝑗 (x∗ ), ∀𝑖, 𝑗 ∈ {1, . . . , 𝑛} and 𝑖 ≠ 𝑗. In other words, each individual ∇ 𝑓𝑖 (x∗ ) may be far away from zero. This will cause large compression variance according to Assumption 1, which indicates that the upper bound of compression variance E∥𝑄(x) − x∥ 2 is linearly proportional to the magnitude of x. Second, most existing algorithms [92, 4, 114, 7, 116, 77] need to broadcast the model or gradient to all worker nodes in each iteration. It is a considerable bottleneck for efficient optimization since the amount of bits to transmit is the same as the uncompressed gradient. DoubleSqueeze [104] is able to apply compression on both worker and server sides. However, its analysis depends on a strong assumption on bounded gradient. Meanwhile, no theoretical guarantees are provided for the convex problems. We proposed DORE to address all aforementioned issues. Our motivation is that the gradient should change smoothly for smooth functions so that each worker node can keep a state variable h𝑖𝑘 to track its previous gradient information. As a result, the residual between new gradient and the 11 Algorithm 1 DORE 1: Input: Stepsize 𝛼, 𝛽, 𝛾, 𝜂, initialize h0 = h𝑖0 = 0𝑑 , x̂𝑖0 = x̂0 , ∀𝑖 ∈ {1, . . . , 𝑛}. 2: for 𝑘 = 1, 2, · · · , 𝐾 − 1 do 3: For each worker 𝑖 ∈ {1, 2, · · · , 𝑛}: 12: For the master: 4: Sample such that E[g𝑖𝑘 | x̂𝑖𝑘 ] = g𝑖𝑘 ∇ 𝑓𝑖 ( x̂𝑖𝑘 ) 13: Receive {Δ̂𝑖𝑘 } from workers Gradient residual: Δ𝑖𝑘 = g𝑖𝑘 − h𝑖𝑘 Í 5: 14: Δ̂ 𝑘 = 1/𝑛 𝑖𝑛 Δ̂𝑖𝑘 Compression: Δ̂𝑖𝑘 = 𝑄(Δ𝑖𝑘 ) Í 6: 15: ĝ 𝑘 = h 𝑘 + Δ̂ 𝑘 {= 1/𝑛 𝑖𝑛 ĝ𝑖𝑘 } 7: h𝑖𝑘+1 = h𝑖𝑘 + 𝛼Δ̂𝑖𝑘 16: x 𝑘+1 = prox𝛾𝑅 ( x̂ 𝑘 − 𝛾 ĝ 𝑘 ) 8: { ĝ𝑖𝑘 = h𝑖𝑘 + Δ̂𝑖𝑘 } 17: h 𝑘+1 = h 𝑘 + 𝛼Δ̂ 𝑘 9: Send Δ̂𝑖𝑘 to the master 18: Model residual: q 𝑘 = x 𝑘+1 − x̂ 𝑘 + 𝜂e 𝑘 10: Receive q̂ 𝑘 from the master 19: Compression: q̂ 𝑘 = 𝑄(q 𝑘 ) 11: x̂𝑖𝑘+1 = x̂𝑖𝑘 + 𝛽q̂ 𝑘 20: e 𝑘+1 = q 𝑘 − q̂ 𝑘 21: x̂ 𝑘+1 = x̂ 𝑘 + 𝛽q̂ 𝑘 23: end for 22: Broadcast q̂ 𝑘 to workers 24: Output: x̂𝐾 or any x̂𝑖𝐾 state h𝑖𝑘 should decrease, and the compression variance of the residual can be well bounded. On the other hand, as the algorithm converges, the model would only change slightly. Therefore, we propose to compress the model residual such that the compression variance can be minimized and also well bounded. We also compensate the model residual compression error into next iteration to achieve a better convergence. Due to the advantages of the proposed double residual compression scheme, we can derive the fastest convergence rate through analyses without the bounded gradient assumption. Note that in Algorithm 1, equations in the curly bracket are just notations for the proof but does not need to computed actually. Below are some key steps of our algorithm as showed in Algorithm 1 and Figure 2.1: [lines 4-9]: each worker node sends the compressed gradient residual (Δ̂𝑖𝑘 ) to the master node and updates its state h𝑖𝑘 with Δ̂𝑖𝑘 ; [lines 13-15]: the master node gathers the compressed gradient residual ({Δ̂𝑖𝑘 )} from all worker nodes and recovers the averaged gradient ĝ 𝑘 based on its state h 𝑘 ; [lines 16]: the master node applies gradient descent algorithms (possibly with the proximal operator); 12 [lines 18-22]: the master node broadcasts the compressed model residual with error compen- sation (q̂ 𝑘 ) to all worker nodes and updates the model; [lines 10-11]: each worker node receives the compressed model residual (q̂ 𝑘 ) and updates its model x𝑖𝑘 . In the algorithm, the state h𝑖𝑘 serves as an exponential moving average of the local gradient in expectation, i.e., E𝑄 h𝑖𝑘+1 = (1 − 𝛼)h𝑖𝑘 + 𝛼g𝑖𝑘 , as proved in Lemma 7. Therefore, as the iteration approaches the optimum, h𝑖𝑘 will also approach the local gradient ∇ 𝑓𝑖 (x∗ ) rapidly which contributes to small gradient residual and consequently small compression variance. Similar difference compression techniques are also proposed in DIANA and its variance-reduced variant [77, 36]. 2.3.2 Discussion In this subsection, we provide more detailed discussions about DORE including model initialization, model update, the special smooth case as well as the compression rate of communication. Initialization. It is important to take the identical initialization x̂0 for all worker and master nodes. It is easy to be ensured by either setting the same random seed or broadcasting the model once at the beginning. In this way, although we don’t need to broadcast the model parameters directly, every worker node updates the model x̂ 𝑘 in the same way. Thus we can keep their model parameters identical. Otherwise, the model inconsistency needs to be considered. Model update. It is worth noting that although we can choose an accurate model x 𝑘+1 as the next iteration in the master node, we use x̂ 𝑘+1 instead. In this way, we can ensure that the gradient descent algorithm is applied based on the exact stochastic gradient which is evaluated on x̂𝑖𝑘 at each worker node. This dispels the intricacy to deal with inexact gradient evaluated on x 𝑘 and thus it simplifies the convergence analysis. Smooth case. In the smooth case, i.e., 𝑅 = 0, Algorithm 1 can be simplified. The master node quantizes the recovered averaged gradient with error compensation and broadcasts it to all worker nodes. This simplified algorithm is shown in Algorithm 2. 13 Algorithm 2 DORE with 𝑅(x) = 0 1: Input: Stepsize 𝛼, 𝛽, 𝛾, 𝜂, initialize h0 = h𝑖0 = 0𝑑 , x̂𝑖0 = x̂0 , ∀𝑖 ∈ {1, . . . , 𝑛}. 2: for 𝑘 = 1, 2, · · · , 𝐾 − 1 do 3: For each worker {𝑖 = 1, 2, · · · , 𝑛}: 12: For the master: 4: Sample such that E[g𝑖𝑘 | x̂𝑖𝑘 ] = g𝑖𝑘 ∇ 𝑓𝑖 ( x̂𝑖𝑘 ) 13: Receive Δ̂𝑖𝑘 s from workers Gradient residual: Δ𝑖𝑘 = g𝑖𝑘 − h𝑖𝑘 Í 5: 14: Δ̂ 𝑘 = 1/𝑛 𝑖𝑛 Δ̂𝑖𝑘 Compression: Δ̂𝑖𝑘 = 𝑄(Δ𝑖𝑘 ) Í 6: 15: ĝ 𝑘 = h 𝑘 + Δ̂ 𝑘 {= 1/𝑛 𝑖𝑛 ĝ𝑖𝑘 } 7: h𝑖𝑘+1 = h𝑖𝑘 + 𝛼Δ̂𝑖𝑘 16: h 𝑘+1 = h 𝑘 + 𝛼Δ̂ 𝑘 8: { ĝ𝑖𝑘 = h𝑖𝑘 + Δ̂𝑖𝑘 } 17: q 𝑘 = −𝛾 ĝ 𝑘 + 𝜂e 𝑘 9: Sent Δ̂𝑖𝑘 to the master 18: Compression: q̂ 𝑘 = 𝑄(q 𝑘 ) 10: Receive q̂ 𝑘 from the master 19: e 𝑘+1 = q 𝑘 − q̂ 𝑘 11: x̂𝑖𝑘+1 = x̂𝑖𝑘 + 𝛽q̂ 𝑘 20: Broadcast q̂ 𝑘 to workers 21: end for 22: Output: any x̂𝑖𝐾 Compression rate. The compression of the gradient information can reduce at most 50% of the communication cost since it only considers compression during gradient aggregation while ignoring the model synchronization. However, DORE can further cut down the remaining 50% communication. Taking the blockwise 𝑝-norm quantization as an example, every element of x can be represented 3 by 2 bits using the simple ternary coding {0, ±1}, along with one magnitude for each block. For example, if we consider the uniform block size 𝑏, the number of bits to represent a 𝑑-dimension vector of 32 bit float-point numbers can be reduced from 32𝑑 bits to 32 𝑑𝑏 + 32 𝑑 bits. As long as the block size 𝑏 is relatively large with respect to the constant 32, the cost 32 𝑑𝑏 for storing the float-point number is relatively small such that the compression rate is close to 32𝑑/( 32 𝑑) ≈ 21.3 times (for example, 19.7 times when 𝑏 = 256). Applying this quantization, QSGD, Terngrad, MEM-SGD, and DIANA need to transmit (32𝑑 + 32 𝑑𝑏 + 32 𝑑) bits per iteration and thus they are able to cut down 47% of the overall 2 × 32𝑑 bits per iteration through gradient compression when 𝑏 = 256. But with DORE, we only need to transmit 2(32 𝑑𝑏 + 32 𝑑) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmission. More efficient coding techniques such as 14 Elias coding [26] can be applied to further reduce the number of bits per iteration. 2.4 Convergence Analysis To show the convergence of DORE, we will make the following commonly used assumptions when needed. Assumption 2. Each worker node samples an unbiased estimator of the gradient stochastically with bounded variance, i.e., for 𝑖 = 1, 2, · · · , 𝑛 and ∀x ∈ R𝑑 , E[g𝑖 |x] = ∇ 𝑓𝑖 (x), E∥g𝑖 − ∇ 𝑓𝑖 (x)∥ 2 ≤ 𝜎𝑖2 , (2.2) 1 where g𝑖 is the estimator of ∇ 𝑓𝑖 at x. In addition, we define 𝜎 2 = 𝜎𝑖2 . Í𝑛 𝑛 𝑖=1 Assumption 3. Each 𝑓𝑖 is 𝐿-Lipschitz differentiable, i.e., for 𝑖 = 1, 2, · · · , 𝑛 and ∀x, y ∈ R𝑑 , 𝑓𝑖 (x) ≤ 𝑓𝑖 (y) + ⟨∇ 𝑓𝑖 (y), x − y⟩ + 𝐿2 ∥x − y∥ 2 . (2.3) Assumption 4. Each 𝑓𝑖 is 𝜇-strongly convex (𝜇 ≥ 0), i.e., for 𝑖 = 1, 2, · · · , 𝑛 and ∀x, y ∈ R𝑑 , 𝑓𝑖 (x) ≥ 𝑓𝑖 (y) + ⟨∇ 𝑓𝑖 (y), x − y⟩ + 𝜇2 ∥x − y∥ 2 . (2.4) For simplicity, we use the same compression operator for all worker nodes, and the master node can apply a different compression operator. We denote the constants in Assumption 1 as 𝐶𝑞 and 𝐶𝑞𝑚 for the worker and master nodes, respectively. Then we set 𝛼 and 𝛽 in both algorithms to satisfy √︃ √︃ 4𝐶 (𝐶 +1) 4𝐶 (𝐶 +1) 1− 1− 𝑞 𝑛𝑐𝑞 1+ 1− 𝑞 𝑛𝑐𝑞 2(𝐶𝑞 +1) ≤𝛼≤ 2(𝐶𝑞 +1) , 1 0<𝛽≤ 𝐶𝑞𝑚 +1 , (2.5) 4𝐶𝑞 (𝐶𝑞 +1) with 𝑐 ≥ 𝑛 . We consider two scenarios in the following two subsections: 𝑓 is strongly convex with a convex regularizer 𝑅 and 𝑓 is non-convex with 𝑅 = 0. 2.4.1 Strongly Convex Case Theorem 1. Under Assumptions 1-4, if 𝛼 and 𝛽 in Algorithm 1 satisfy (2.5), 𝜂 and 𝛾 satisfy  √ −𝐶𝑞𝑚 + (𝐶𝑞𝑚 ) 2 +4(1−(𝐶𝑞𝑚 +1) 𝛽) 𝜂 < min 2𝐶𝑞𝑚 , 15  4𝜇𝐿 (𝜇+𝐿) 2 (1+𝑐𝛼)−4𝜇𝐿 , (2.6) 𝜂(𝜇+𝐿) 2 2(1+𝜂)𝜇𝐿 ≤𝛾 ≤ (1+𝑐𝛼) (𝜇+𝐿) , (2.7) then we have (1+𝜂) (1+𝑛𝑐𝛼) V 𝑘+1 ≤ 𝜌 𝑘 V1 + 𝑛(1−𝜌) 𝛽𝛾 2 𝜎 2 , (2.8) with V 𝑘 =𝛽(1 − (𝐶𝑞𝑚 + 1) 𝛽)E∥q 𝑘−1 ∥ 2 + E∥ x̂ 𝑘 − x∗ ∥ 2 2 Í + (1+𝜂)𝑐𝛽𝛾 𝑛 𝑛 𝑖=1 E∥h𝑖 − ∇ 𝑓𝑖 (x )∥ , 𝑘 ∗ 2  (𝜂2 +𝜂)𝐶 𝑚  𝜌 = max 1−(𝐶𝑞𝑚 +1)𝑞 𝛽 , 1 + 𝜂𝛽 − 2(1+𝜂) 𝜇+𝐿 𝛽𝛾𝜇𝐿 , 1 − 𝛼 < 1. Corollary 1. When there is no error compensation and we set 𝜂 = 0, then 𝜌 = max(1 − 2𝛽𝛾𝜇𝐿 𝜇+𝐿 , 1 − 𝛼). If we further set 1 1 4𝐶𝑞 (𝐶𝑞 +1) 𝛼= 2(𝐶𝑞 +1) , 𝛽= 𝐶𝑞𝑚 +1 , 𝑐= 𝑛 , (2.9) 2 and choose the largest step-size 𝛾 = (𝜇+𝐿) (1+2𝐶𝑞 /𝑛) , the convergent factor is  2   1) (𝜇+𝐿) 1 𝐶𝑞 (1 − 𝜌) −1 = max 2(𝐶𝑞 + 1), (𝐶𝑞𝑚 + 2𝜇𝐿 2 + 𝑛 . (2.10) Remark 1. In particular, suppose {Δ𝑖 }𝑖=1 𝑛 are compressed using the Bernoulli 𝑝-norm quantization 1 ∥x∥ 22 with the largest block size 𝑑max , then 𝐶𝑞 = 𝛼𝑤 − 1, with 𝛼 𝑤 = min0≠x∈R𝑑max ∥x∥ 1 ∥x∥ 𝑝 ≤ 1. Similarly, 1 q is compressed using the Bernoulli 𝑝-norm quantization with 𝐶𝑞𝑚 = 𝛼𝑚 − 1. Then the linear convergent factor is n 2  o (1 − 𝜌) −1 = max 𝛼2𝑤 , 𝛼1𝑚 (𝜇+𝐿) 𝜇𝐿 1 2 − 2 𝑛 + 2 𝑛𝛼 𝑤 . (2.11) n  o While the result of DIANA in [77] is max 𝛼2𝑤 , 𝜇+𝐿 𝜇 1 2 − 1 𝑛 + 𝑛𝛼 1 𝑤 , which is better than (2.11) with 𝛼𝑚 = 1 (no compression for the model). When there is no compression for Δ𝑖 , i.e., 𝛼 𝑤 = 1, the algorithm reduces to the gradient descent, and the linear convergent factor is the same as that of the gradient descent for strongly convex functions. 16 Remark 2. Although error compensation often improves the convergence empirically, in theory, no compensation, i.e., 𝜂 = 0, provides the best convergence rate. This is because we don’t have much information of the error being compensated. Filling this gap will be an interesting future direction. 2.4.2 Nonconvex Case Theorem 2. Under Assumptions 1-3 and the additional assumption that each worker samples the gradient from the full dataset, we set 𝛼 and 𝛽 according to (2.5). By choosing √︂ 48𝐿 2 𝛽 2 (𝐶𝑞𝑚 +1) 2 n −1+ 1+ 𝐶𝑞𝑚 o 1 𝛾 ≤ min 12𝐿 𝛽(𝐶𝑞𝑚 +1) , 6𝐿 𝛽(1+𝑐𝛼) (𝐶𝑞 +1) , 𝑚 we have 𝛽 2 − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 ∑︁ 𝐾 E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 𝐾 𝑘=1 Λ1 − Λ𝐾+1 3(𝐶𝑞𝑚 + 1)(1 + 𝑛𝑐𝛼)𝐿 𝛽2 𝜎 2 𝛾 ≤ + , (2.12) 𝛾𝐾 𝑛 where Λ 𝑘 =(𝐶𝑞𝑚 + 1)𝐿 𝛽2 ∥q 𝑘−1 ∥ 2 + 𝑓 ( x̂ 𝑘 ) − 𝑓 ∗ 𝑛 2 21 ∑︁ + 3𝑐(𝐶𝑞𝑚 + 1)𝐿 𝛽 𝛾 E∥h𝑖𝑘 ∥ 2 . (2.13) 𝑛 𝑖=1 1 1 4𝐶𝑞 (𝐶𝑞 +1) Corollary 2. Let 𝛼 = 2(𝐶𝑞 +1) , 𝛽= 𝐶𝑞𝑚 +1 , and 𝑐 = 𝑛 , then 1 + 𝑛𝑐𝛼 is a fixed constant. If 1 √ 𝛾= , when K is relatively large, we have 12𝐿(1+𝑐𝛼) (1+ 𝐾/𝑛) 𝐾 1 ∑︁ 1 1 E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 ≲ + √ . (2.14) 𝐾 𝑘=1 𝐾 𝐾𝑛 √ Remark 3. The dominant term in (2.14) is 𝑂 (1/ 𝐾𝑛), which implies that the sample complexity of each worker node is 𝑂 (1/(𝑛𝜖 2 )) in average to achieve an 𝜖-accurate solution. It shows that, same as DoubleSqueeze in [104], DORE is able to perform linear speedup. Furthermore, this convergence result is the same as the P-SGD without compression. Note that DoubleSqueeze has an extra term 2 (1/𝐾) 3 , and its convergence requires the bounded variance of the compression operator. 17 Algorithm Compression Compression Assumed Linear rate Nonconvex Rate 1 QSGD Grad 2-norm Quantization N/A 𝐾 +𝐵 DIANA Grad 𝑝-norm Quantization ✓ √1 + 1 𝐾𝑛 𝐾 DoubleSqueeze Grad+Model Bounded Variance N/A √ 1 + 12/3 + 1 𝐾𝑛 𝐾 𝐾 1 1 DORE Grad+Model Assumption 1 ✓ √ +𝐾 𝐾𝑛 Table 2.1: A comparison between related algorithms. DORE is able to converges linearly to the O (𝜎) neighborhood of optimal point like full-precision SGD and DIANA in the strongly convex case while achieving much better communication efficiency. DORE also admits linear speedup in the nonconvex case like DoubleSqueeze but DORE doesn’t require the assumptions of bounded compression error or bounded gradient. 2.5 Experiment In this section, we validate the theoretical results and demonstrate the superior performance of DORE. Our experimental results demonstrate that (1) DORE achieves similar convergence speed as full-precision SGD and state-of-art quantized SGD baselines and (2) its iteration time is much smaller than most existing algorithms, supporting the superior communication efficiency of DORE. To make a fair comparison, we choose the same Bernoulli ∞-norm quantization as described in Section 2.3 and the quantization block size is 256 for all experiments if not being explicitly stated because ∞-norm quantization is unbiased and commonly used. The parameters 𝛼, 𝛽, 𝜂 for DORE are chosen to be 0.1, 1 and 1, respectively. The baselines we choose to compare include SGD, QSGD [4], MEM-SGD [97], DIANA [77], DoubleSqueeze and DoubleSqueeze (topk) [104]. SGD is the vanilla SGD without any compression and QSGD quantizes the gradient directly. MEM-SGD is the QSGD with error compensation. DIANA, which only compresses and transmits the gradient difference, is a special case of the proposed DORE. DoubleSqueeze quantizes both the gradient on the workers and the averaged gradient on the server with error compensation. Although DoubleSqueeze is claimed to work well with both biased and unbiased compression, in our experiment it converges much slower and suffers the loss of accuracy with unbiased compression. Thus, we also compare with DoubleSqueeze using the Top-k compression as presented in [104]. 18 2.5.1 Strongly Convex To verify the convergence for strongly convex and smooth objective functions, we conduct the experiment on a linear regression problem: 𝑓 (x) = ∥Ax−b∥ 2 +𝜆∥x∥ 2 . The data matrix A ∈ R1200×500 and optimal solution x∗ ∈ R500 are randomly synthesized. Then we generate the prediction b by sampling from a Gaussian distribution whose mean is Ax∗ . The rows of the data matrix A are allocated evenly to 20 worker nodes. To better verify the linear convergence to the O (𝜎) neighborhood around the optimal solution, we take the full gradient in each node for all algorithms to exclude the effect of the gradient variance (𝜎 = 0). As showed in Figure 2.2, with full gradient and a constant learning rate, DORE converges linearly, same as SGD and DIANA, but QSGD, MEM-SGD, DoubleSqueeze, as well as DoubleSqueeze (topk) converge to a neighborhood of the optimal point. This is because these algorithms assume the bounded gradient and their convergence errors depend on that bound. Although they converge to the optimal solution using a diminishing step size, their converge rates will be much slower. 1 10 1 10 2 10 1 10 5 10 3 x * ||2 x * ||2 10 8 10 5 10 ||xk ||xk 11 10 7 10 14 10 9 10 0 2000 4000 6000 8000 10000 0 2500 5000 7500 10000 12500 15000 17500 20000 Iteration Iteration (a) Learning rate=0.05 (b) Learning rate=0.025 Figure 2.2: Linear regression on synthetic data. When the learning rate is 0.05, DoubleSqueeze diverges. In both cases, DORE, SGD, and DIANA converge linearly to the optimal point, while QSGD, MEM-SGD, DoubleSqueeze, and DoubleSqueeze (topk) only converge to the neighborhood even when full gradient is available. Compression error The property of the compression operator indicates that the compression error is linearly proportional to the norm of the variable being compressed: E∥𝑄(x) − x∥ 2 ≤ 𝐶 ∥x∥ 2 . 19 We visualize the norm of the variables being compressed, i.e., the gradient residual (the worker side) and model residual (the master side) for DORE as well as error compensated gradient (the worker side) and averaged gradient (the master side) for DoubleSqueeze. As showed in Figure 2.3, the gradient and model residual of DORE decrease exponentially and the compression errors vanish. However, for DoubleSqueeze, their norms only decrease to some certain value and the compression error doesn’t vanish. It explains why algorithms without residual compression cannot converge linearly to the O (𝜎) neighborhood of the optimal solution in the strongly convex case. 4 10 3 10 1 1 10 10 1 10 2 10 3 10 Norm Norm 5 10 5 10 7 8 10 10 9 10 11 10 11 10 0 2500 5000 7500 10000 12500 15000 17500 20000 0 2500 5000 7500 10000 12500 15000 17500 20000 Iteration Iteration (a) Worker side (b) Master side Figure 2.3: The norm of variable being compressed in the linear regression experiment. 2.5.2 Nonconvex To verify the convergence in the nonconvex case, we test the proposed DORE with two classical deep neural networks on two representative datasets, respectively, i.e., LeNet [52] on MNIST and Resnet18 [35] on CIFAR10. In the experiment, we use 1 parameter server and 10 workers, each of which is equipped with an NVIDIA Tesla K80 GPU. The batch size for each worker node is 256. We use 0.1 and 0.01 as the initial learning rates for LeNet and Resnet18, and decrease them by a factor of 0.1 after every 25 and 100 epochs, respectively. All parameter settings are the same for all algorithms. 20 2.0 2.0 1.5 1.5 Training Loss 1.0 Test Loss 1.0 0.5 0.5 0.0 0.0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test Loss Figure 2.4: LeNet trained on MNIST. DORE converges similarly as most baselines. It outperforms DoubleSqueeze using the same compression method while has similar performance as DoubleSqueeze (topk). 4.5 4.5 4.0 4.0 3.5 3.5 Training Loss Test Loss 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0 50 100 150 200 250 0 50 100 150 200 250 Epoch Epoch (a) Training Loss (b) Test Loss Figure 2.5: Resnet18 trained on CIFAR10. DORE achieves similar convergence and accuracy as most baselines. DoubeSuqeeze converges slower and suffers from the higher loss but it works well with topk compression. Figures 2.4 and 2.5 show the training loss and test loss for each epoch during the training of LeNet on the MNIST dataset and Resnet18 on CIFAR10 dataset. The results indicate that in the nonconvex case, even with both compressed gradient and model information, DORE can still achieve similar convergence speed as full-precision SGD and other quantized SGD variants. DORE achieves much better convergence speed than DoubleSqueeze using the same compression method 21 2.5 2.0 Seconds 1.5 1.0 0.5 0.0 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 0.0040 0.0045 0.0050 Bandwidth (1s/Mbit) Figure 2.6: Per iteration time cost on Resnet18 for SGD, QSGD, and DORE. It is tested in a shared cluster environment connected by Gigabit Ethernet interface. DORE speeds up the training process significantly by mitigating the communication bottleneck. and converges similarly with DoubleSqueeze with Topk compression as presented in [104]. We also validate via parameter sensitivity in Appendix A.1.2 that DORE performs consistently well under different parameter settings such as compression block size, 𝛼, 𝛽 and 𝜂. 2.5.3 Communication Efficiency In terms of communication cost, DORE enjoys the benefit of extremely efficient communication. As one example, under the same setting as the Resnet18 experiment described in the previous section, we test the time cost per iteration for SGD, QSGD, and DORE under varied network bandwidth. We didn’t test MEM-SGD, DIANA, and DoubleSqueeze because MEM-SGD, DIANA have similar time cost as QSGD while DoubleSqueeze has similar time cost as DORE. The result showed in Figure 2.6 indicates that as the bandwidth becomes worse, with both gradient and model compression, the advantage of DORE becomes more remarkable compared to the baselines that don’t apply compression for model synchronization. In Appendix A.1.1, we also demonstrate the communication efficiency in terms of communication bits and running time, which clearly suggests the benefit of the proposed algorithm. 22 2.6 Conclusion Message passing is the dominating bottleneck for distributed training of modern large-scale machine learning models. Extensive works have compressed the gradient information to be transferred during the training process, but model compression is rather limited due to its intrinsic difficulty. In this work, we proposed the Double Residual Compression SGD named DORE to compress both gradient and model communication that can mitigate this bottleneck prominently. The theoretical analyses suggest good convergence rate of DORE under weak assumptions. Furthermore, DORE is able to reduce 95% of the communication cost in message passing while maintaining similar convergence rate and model accuracy compared with the full-precision SGD. 23 CHAPTER 3 LINEAR CONVERGENT DECENTRALIZED OPTIMIZATION WITH COMPRESSION Communication compression has become a key strategy to speed up the message passing in distributed optimization. However, existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms. They are unsatisfactory in terms of convergence rate, stability, and the capability to handle heterogeneous data. Motivated by primal-dual algorithms, in this chapter, we propose the first LinEAr convergent Decentralized algorithm with compression, LEAD. Our theory describes the coupled dynamics of the inexact primal and dual update as well as compression error, and we provide the first consensus error bound in such settings without assuming bounded gradients. This is also the first work that proves in certain compression regime, the message compression in message passing do not hurt the convergence, which means it achieves better communication efficiency for free. Experiments on convex problems validate our theoretical analysis, and empirical study on deep neural nets shows that LEAD is applicable to non-convex problems as well. 3.1 Introduction Distributed optimization solves the following optimization problem 𝑛 ∗ h 1 ∑︁ i x := arg min 𝑓 (x) := 𝑓𝑖 (x) (3.1) x∈R𝑑 𝑛 𝑖=1 with 𝑛 computing agents and a communication network. Each 𝑓𝑖 (x) : R𝑑 → R is a local objective function of agent 𝑖 and typically defined on the data D𝑖 settled at that agent. The data distributions {D𝑖 } can be heterogeneous depending on the applications such as in federated learning. The variable x ∈ R𝑑 often represents model parameters in machine learning. A distributed optimization algorithm seeks an optimal solution that minimizes the overall objective function 𝑓 (x) collectively. According to the communication topology, existing algorithms can be conceptually categorized into centralized and decentralized ones. Specifically, centralized algorithms require global communication between 24 agents (through central agents or parameter servers). While decentralized algorithms only require local communication between connected agents and are more widely applicable than centralized ones. In both paradigms, the computation can be relatively fast with powerful computing devices; efficient communication is the key to improve algorithm efficiency and system scalability, especially when the network bandwidth is limited. In recent years, various communication compression techniques, such as quantization and sparsification, have been developed to reduce communication costs. Notably, extensive studies [92, 4, 7, 97, 45, 77, 104, 68] have utilized gradient compression to significantly boost communication efficiency for centralized optimization. They enable efficient large-scale optimization while maintaining comparable convergence rates and practical performance with their non-compressed counterparts. This great success has suggested the potential and significance of communication compression in decentralized algorithms. While extensive attention has been paid to centralized optimization, communication compression is relatively less studied in decentralized algorithms because the algorithm design and analysis are more challenging in order to cover general communication topologies. There are recent efforts trying to push this research direction. For instance, DCD-SGD and ECD-SGD [101] introduce difference compression and extrapolation compression to reduce model compression error. [88, 89] introduce QDGD and QuanTimed-DSGD to achieve exact convergence with small stepsize. DeepSqueeze [102] directly compresses the local model and compensates the compression error in the next iteration. CHOCO-SGD [51, 50] presents a novel quantized gossip algorithm that reduces compression error by difference compression and preserves the model average. Nevertheless, most existing works focus on the compression of primal-only algorithms, i.e., reduce to DGD [80, 123] or P-DSGD [62]. They are unsatisfying in terms of convergence rate, stability, and the capability to handle heterogeneous data. Part of the reason is that they inherit the drawback of DGD-type algorithms, whose convergence rate is slow in heterogeneous data scenarios where the data distributions are significantly different from agent to agent. In the literature of decentralized optimization, it has been proved that primal-dual algorithms 25 can achieve faster converge rates and better support heterogeneous data [63, 96, 59, 124]. However, it is unknown whether communication compression is feasible for primal-dual algorithms and how fast the convergence can be with compression. In this work, we attempt to bridge this gap by investigating the communication compression for primal-dual decentralized algorithms. Our major contributions can be summarized as: • We delineate two key challenges in the algorithm design for communication compression in decentralized optimization, i.e., data heterogeneity and compression error, and motivated by primal-dual algorithms, we propose a novel decentralized algorithm with compression, LEAD. • We prove that for LEAD, a constant stepsize in the range (0, 2/(𝜇+ 𝐿)] is sufficient to ensure linear convergence for strongly convex and smooth objective functions. To the best of our knowledge, LEAD is the first linear convergent decentralized algorithm with compression. Moreover, LEAD provably works with unbiased compression of arbitrary precision. • We further prove that if the stochastic gradient is used, LEAD converges linearly to the 𝑂 (𝜎 2 ) neighborhood of the optimum with constant stepsize. LEAD is also able to achieve exact convergence to the optimum with diminishing stepsize. • Extensive experiments on convex problems validate our theoretical analyses, and the empirical study on training deep neural nets shows that LEAD is applicable for nonconvex problems. LEAD achieves state-of-art computation and communication efficiency in all experiments and significantly outperforms the baselines on heterogeneous data. Moreover, LEAD is robust to parameter settings and needs minor effort for parameter tuning. 3.2 Related Work Decentralized optimization can be traced back to the work by [107]. DGD [80] is the most classical decentralized algorithm. It is intuitive and simple but converges slowly due to the diminishing stepsize that is needed to obtain the optimal solution [123]. Its stochastic version D-PSGD [62] has been shown effective for training nonconvex deep learning models. Algorithms based on primal-dual 26 formulations or gradient tracking are proposed to eliminate the convergence bias in DGD-type algorithms and improve the convergence rate, such as D-ADMM [78], DLM [63], EXTRA [96], NIDS [59], 𝐷 2 [103], Exact Diffusion [125], OPTRA [120], DIGing [79], GSGT [87], etc. Recently, communication compression is applied to decentralized settings by [101]. It proposes two algorithms, i.e., DCD-SGD and ECD-SGD, which require compression of high accuracy and are not stable with aggressive compression. [88, 89] introduce QDGD and QuanTimed-DSGD to achieve exact convergence with small stepsize and the convergence is slow. DeepSqueeze [102] compensates the compression error to the compression in the next iteration. Motivated by the quantized average consensus algorithms, such as [13], the quantized gossip algorithm CHOCO- Gossip [51] converges linearly to the consensual solution. Combining CHOCO-Gossip and D-PSGD leads to a decentralized algorithm with compression, CHOCO-SGD, which converges sublinearly under the strong convexity and gradient boundedness assumptions. Its nonconvex variant is further analyzed in [50]. A new compression scheme using the modulo operation is introduced in [71] for decentralized optimization. A general algorithmic framework aiming to maintain the linear convergence of distributed optimization under compressed communication is considered in [75]. It requires a contractive property that is not satisfied by many decentralized algorithms including the algorithm in this work. 3.3 Algorithm We first introduce notations and definitions used in this work. We use bold upper-case letters such as X to define matrices and bold lower-case letters such as x to define vectors. Let 1 and 0 be vectors with all ones and zeros, respectively. Their dimensions will be provided when necessary. Given two matrices X, Y ∈ R𝑛×𝑑 , we define their inner product as ⟨X, Y⟩ = tr(X⊤ Y) and the norm √︁ √︁ as ∥X∥ = ⟨X, X⟩. We further define ⟨X, Y⟩P = tr(X⊤ PY) and ∥X∥ P = ⟨X, X⟩ P for any given symmetric positive semidefinite matrix P ∈ R𝑛×𝑛 . For simplicity, we will majorly use the matrix notation in this work. For instance, each agent 𝑖 holds an individual estimate x𝑖 ∈ R𝑑 of the global variable x ∈ R𝑑 . Let X 𝑘 and ∇F(X 𝑘 ) be the collections of {x𝑖𝑘 }𝑖=1 𝑛 and {∇ 𝑓 (x 𝑘 )} 𝑛 which are 𝑖 𝑖 𝑖=1 27 defined below:  ⊤  ⊤ X 𝑘 = x1𝑘 , . . . , x𝑛𝑘 ∈ R𝑛×𝑑 , ∇F(X 𝑘 ) = ∇ 𝑓1 (x1𝑘 ), . . . , ∇ 𝑓𝑛 (x𝑛𝑘 ) ∈ R𝑛×𝑑 . (3.2) We use ∇F(X 𝑘 ; 𝜉 𝑘 ) to denote the stochastic approximation of ∇F(X 𝑘 ). With these notations, the update X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) means that x𝑖𝑘+1 = x𝑖𝑘 − 𝜂∇ 𝑓𝑖 (x𝑖𝑘 ; 𝜉𝑖𝑘 ) for all 𝑖. In this work, we need the average of all rows in X 𝑘 and ∇F(X 𝑘 ), so we define X 𝑘 = (1⊤ X 𝑘 )/𝑛 and ∇F(X 𝑘 ) = (1⊤ ∇F(X 𝑘 ))/𝑛. They are row vectors, and we will take a transpose if we need a column vector. The pseudoinverse of a matrix M is denoted as M† . The largest, 𝑖th-largest, and smallest nonzero eigenvalues of a symmetric matrix M are 𝜆 max (M), 𝜆𝑖 (M), and 𝜆min (M). Assumption 5 (Mixing matrix). The connected network G = {V, E} consists of a node set V = {1, 2, . . . , 𝑛} and an undirected edge set E. The primitive symmetric doubly-stochastic matrix W = [𝑤 𝑖 𝑗 ] ∈ R𝑛×𝑛 encodes the network structure such that 𝑤 𝑖 𝑗 = 0 if nodes 𝑖 and 𝑗 are not connected and cannot exchange information. Assumption 5 implies that −1 < 𝜆 𝑛 (W) ≤ 𝜆 𝑛−1 (W) ≤ · · · 𝜆 2 (W) < 𝜆 1 (W) = 1 and W1 = 1 [118, 96]. The matrix multiplication X 𝑘+1 = WX 𝑘 describes that agent 𝑖 takes a weighted sum from its neighbors and itself, i.e., x𝑖𝑘+1 = 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑖 𝑗 x 𝑘𝑗 , where N𝑖 denotes the neighbors of agent 𝑖. Í The proposed algorithm LEAD to solve problem (3.1) is showed in Alg. 3 with matrix notations for conciseness. We will refer to the line number in the analysis. A complete algorithm description from the agent’s perspective can be found in Algorithm 4. The motivation behind Alg. 3 is to achieve two goals: (a) consensus (x𝑖𝑘 − (X 𝑘 ) ⊤ → 0) and (b) convergence ((X 𝑘 ) ⊤ → x∗ ). We first discuss how goal (a) leads to goal (b) and then explain how LEAD fulfills goal (a). In essence, LEAD runs the approximate SGD globally and reduces to the exact SGD under consensus. One key property for LEAD is 1⊤ 𝑛×1 D = 0, regardless of the compression error in Ŷ . 𝑘 𝑘 It holds because that for the initialization, we require D1 = (I − W)Z for some Z ∈ R𝑛×𝑑 , e.g., D1 = 0𝑛×𝑑 , and that the update of D 𝑘 ensures D 𝑘 ∈ Range(I − W) for all 𝑘 and 1⊤ 𝑛×1 (I − W) = 0 as we will explain later. Therefore, multiplying (1/𝑛)1⊤ 𝑛×1 on both sides of Line 7 leads to a global 28 Algorithm 3 LEAD Input: Stepsize 𝜂, parameter (𝛼, 𝛾), X0 , H1 , D1 = (I − W)Z for any Z Í𝑛 Output: X𝐾 or 1/𝑛 𝑖=1 X𝑖𝐾 1: H1𝑤 = WH1 9: procedure COMM(Y, H, H𝑤 ): 2: X1 = X0 − 𝜂∇F(X0 ; 𝜉 0 ) 10: Q = Compress(Y − H) 3: for 𝑘 = 1, 2, · · · , 𝐾 − 1 do 11: Ŷ = H + Q 4: Y 𝑘 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘 12: Ŷ𝑤 = H𝑤 + WQ 5: Ŷ 𝑘 , Ŷ𝑤𝑘 , H 𝑘+1 , H𝑤𝑘+1 = COMM(Y 𝑘 , H 𝑘 , H𝑤𝑘 ) 13: H = (1 − 𝛼)H + 𝛼Ŷ 6: D 𝑘+1 = D 𝑘 + 2𝜂 𝛾 ( Ŷ 𝑘 − Ŷ𝑤𝑘 ) 14: H𝑤 = (1 − 𝛼)H𝑤 + 𝛼Ŷ𝑤 15: return: Ŷ, Ŷ𝑤 , H, H𝑤 7: X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘+1 16: end procedure 8: end for average view of Alg. 3: X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ), (3.3) which doesn’t contain the compression error. Note that this is an approximate SGD step because, as shown in (3.2), the gradient ∇F(X 𝑘 ; 𝜉 𝑘 ) is not evaluated on a global synchronized model X 𝑘 . However, if the solution converges to the consensus solution, i.e., x𝑖𝑘 − (X 𝑘 ) ⊤ → 0, then E𝜉 𝑘 [∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇ 𝑓 (X 𝑘 ; 𝜉 𝑘 )] → 0 and (3.3) gradually reduces to exact SGD. With the establishment of how consensus leads to convergence, the obstacle becomes how to achieve consensus under local communication and compression challenges. It requires addressing two issues, i.e., data heterogeneity and compression error. To deal with these issues, existing algorithms, such as DCD-SGD, ECD-SGD, QDGD, DeepSqueeze, Moniqua, and CHOCO-SGD, need a diminishing or constant but small stepsize depending on the total number of iterations. However, these choices unavoidably cause slower convergence and bring in the difficulty of parameter tuning. In contrast, LEAD takes a different way to solve these issues, as explained below. Data heterogeneity. It is common in distributed settings that there exists data heterogeneity among agents, especially in real-world applications where different agents collect data from different scenarios. In other words, we generally have 𝑓𝑖 (x) ≠ 𝑓 𝑗 (x) for 𝑖 ≠ 𝑗. The optimality condition of problem (3.1) gives 1⊤ ∗ ∗ ∗ ∗ 𝑛×1 ∇F(X ) = 0, where X = [x , · · · , x ] is a consensual and optimal solution. The data heterogeneity and optimality condition imply that there exist at least two agents 𝑖 and 𝑗 29 such that ∇ 𝑓𝑖 (x∗ ) ≠ 0 and ∇ 𝑓 𝑗 (x∗ ) ≠ 0. As a result, a simple D-PSGD algorithm cannot converge to the consensual and optimal solution as X∗ ≠ WX∗ − 𝜂E𝜉 ∇F(X∗ ; 𝜉) even when the stochastic gradient variance is zero. Gradient correction. Primal-dual algorithms or gradient tracking algorithms are able to convergence much faster than DGD-type algorithms by handling the data heterogeneity issue, as introduced in Section 3.2. Specifically, LEAD is motivated by the design of primal-dual algorithm NIDS [59] and the relation becomes clear if we consider the two-step reformulation of NIDS adopted in [57]: I−W 𝑘 D 𝑘+1 = D 𝑘 + (X − 𝜂∇F(X 𝑘 ) − 𝜂D 𝑘 ), (3.4) 2𝜂 X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ) − 𝜂D 𝑘+1 , (3.5) where X 𝑘 and D 𝑘 represent the primal and dual variables respectively. The dual variable D 𝑘 plays the role of gradient correction. As 𝑘 → ∞, we expect D 𝑘 → −∇F(X∗ ) and X 𝑘 will converge to X∗ via the update in (3.5) since D 𝑘+1 corrects the nonzero gradient ∇F(X 𝑘 ) asymptotically. The key design of Alg. 3 is to provide compression for the auxiliary variable defined as Y 𝑘 = X 𝑘 − 𝜂∇F(X 𝑘 ) − 𝜂D 𝑘 . Such design ensures that the dual variable D 𝑘 lies in Range(I−W), which is essential for convergence. Moreover, it achieves the implicit error compression as we will explain later. To stabilize the algorithm with inexact dual update, we introduce a parameter 𝛾 to control the stepsize in the dual update. Therefore, if we ignore the details of the compression, Alg. 3 can be concisely written as Y 𝑘 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘 (3.6) 𝛾 D 𝑘+1 = D 𝑘 + (I − W) Ŷ 𝑘 (3.7) 2𝜂 X 𝑘+1 = X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘+1 (3.8) where Ŷ 𝑘 represents the compression of Y 𝑘 and F(X 𝑘 ; 𝜉 𝑘 ) denote the stochastic gradients. Nevertheless, how to compress the communication and how fast the convergence we can attain with compression error are unknown. In the following, we propose to carefully control the compression error by difference compression and error compensation such that the inexact 30 dual update (Line 6) and primal update (Line 7) can still guarantee the convergence as proved in Section 3.4. Compression error. Different from existing works, which typically compress the primal variable X 𝑘 or its difference, LEAD first construct an intermediate variable Y 𝑘 and apply compression to obtain its coarse representation Ŷ 𝑘 as shown in the procedure 𝐶𝑜𝑚𝑚Y, H, H𝑤 : • Compress the difference between Y and the state variable H as Q; • Q is encoded into the low-bit representation, which enables the efficient local communication step Ŷ𝑤 = H𝑤 + WQ. It is the only communication step in each iteration. • Each agent recovers its estimate Ŷ by Ŷ = H + Q and we have Ŷ𝑤 = WŶ. • States H and H𝑤 are updated based on Ŷ and Ŷ𝑤 , respectively. We have H𝑤 = WH. By this procedure, we expect when both Y 𝑘 and H 𝑘 converge to X∗ , the compression error vanishes asymptotically due to the assumption we make for the compression operator in Assumption 6. Remark 4. Note that difference compression is also applied in DCD-PSGD [101] and CHOCO- SGD [51], but their state update is the simple integration of the compressed difference. We find this update is usually too aggressive and cause instability as showed in our experiments. Therefore, we adopt a momentum update H = (1 − 𝛼)H + 𝛼Ŷ motivated from DIANA [77], which reduces the compression error for gradient compression in centralized optimization. Implicit error compensation. On the other hand, even if the compression error exists, LEAD essentially compensates for the error in the inexact dual update (Line 6), making the algorithm more stable and robust. To illustrate how it works, let E 𝑘 = Ŷ 𝑘 − Y 𝑘 denote the compression error and e𝑖𝑘 be its 𝑖-th row. The update of D 𝑘 gives 𝛾 𝛾 𝛾 D 𝑘+1 = D 𝑘 + ( Ŷ 𝑘 − Ŷ𝑤𝑘 ) = D 𝑘 + (I − W)Y 𝑘 + (E 𝑘 − WE 𝑘 ) 2𝜂 2𝜂 2𝜂 Í where −WE 𝑘 indicates that agent 𝑖 spreads total compression error − 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑗𝑖 e𝑖𝑘 = −e𝑖𝑘 to all agents and E 𝑘 indicates that each agent compensates this error locally by adding e𝑖𝑘 back. This error compensation also explains why the global view in (3.3) doesn’t involve compression error. 31 Remark 5. Note that in LEAD, the compression error is compensated into the model X 𝑘+1 through Line 6 and Line 7 such that the gradient computation in the next iteration is aware of the compression error. This has some subtle but important difference from the error compensation or error feedback in [92, 116, 97, 45, 104, 68, 102], where the error is stored in the memory and only compensated after gradient computation and before the compression. LEAD in agent’s perspective In Algorithm 3, we described the algorithm with matrix notations for concision. Here we further provide a complete algorithm description from the agents’ perspective. Algorithm 4 LEAD in Agent’s Perspective input: stepsize 𝜂, compression parameters (𝛼, 𝛾), initial values x𝑖0 , h𝑖1 , z𝑖 , ∀𝑖 ∈ {1, 2, . . . , 𝑛} Í𝑛 x𝐾 output: x𝑖𝐾 , ∀𝑖 ∈ {1, 2, . . . , 𝑛} or 𝑖=1𝑛 𝑖 1: for each agent 𝑖 ∈ {1, 2, . . . , 𝑛} do d𝑖1 = z𝑖 − 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑖 𝑗 z 𝑗 Í 2: (h𝑤 )𝑖1 = 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑖 𝑗 (h𝑤 ) 1𝑗 Í 3: 4: x𝑖1 = x𝑖0 − 𝜂∇ 𝑓𝑖 (x𝑖0 ; 𝜉𝑖0 ) 5: end for 6: for 𝑘 = 1, 2, . . . , 𝐾 − 1 (in parallel for all agents 𝑖 ∈ {1, 2, . . . , 𝑛}) do 7: compute ∇ 𝑓𝑖 (x𝑖𝑘 ; 𝜉𝑖𝑘 ) ▷ Gradient computation 8: y𝑖 = x𝑖 − 𝜂∇ 𝑓𝑖 (x𝑖 ; 𝜉𝑖 ) − 𝜂d𝑖 𝑘 𝑘 𝑘 𝑘 𝑘 9: q𝑖𝑘 = Compress(y𝑖𝑘 − h𝑖𝑘 ) ▷ Compression 10: ŷ𝑖𝑘 = h𝑖𝑘 + q𝑖𝑘 11: for neighbors 𝑗 ∈ N𝑖 do 12: Send q𝑖𝑘 and receive q 𝑘𝑗 ▷ Communication 13: end for Í 14: ( ŷ𝑤 ) 𝑖𝑘 = (h𝑤 ) 𝑖𝑘 + 𝑗 ∈N𝑖 ∪{𝑖} 𝑤 𝑖 𝑗 q 𝑘𝑗 15: h𝑖𝑘+1 = (1 − 𝛼)h𝑖𝑘 + 𝛼ŷ𝑖𝑘 16: (h𝑤 ) 𝑖𝑘+1 = (1 − 𝛼)(h𝑤 ) 𝑖𝑘 + 𝛼( ŷ𝑤 ) 𝑖𝑘 𝛾 17: d𝑖𝑘+1 = d𝑖𝑘 + 2𝜂 ŷ𝑖𝑘 − ( ŷ𝑤 ) 𝑖𝑘 18: x𝑖𝑘+1 = x𝑖𝑘 − 𝜂∇ 𝑓𝑖 (x𝑖𝑘 ; 𝜉𝑖𝑘 ) − 𝜂d𝑖𝑘+1 ▷ Model update 19: end for Connections with exiting algorithms The non-compressed variant of LEAD in Alg. 3 recovers NIDS [59], 𝐷 2 [103] and Exact Diffusion [125] as shown in Proposition 1. In Corollary 3, we show that the convergence rate of LEAD exactly recovers the rate of NIDS when 𝐶 = 0, 𝛾 = 1 and 𝜎 = 0. 32 Proposition 1 (Connection to NIDS, 𝐷 2 and Exact Diffusion). When there is no communication compression (i.e., Ŷ 𝑘 = Y 𝑘 ) and 𝛾 = 1, Alg. 3 recovers 𝐷 2 : I+W 𝑘  X 𝑘+1 = 2X − X 𝑘−1 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) + 𝜂∇F(X 𝑘−1 ; 𝜉 𝑘−1 ) . (3.9) 2 Furthermore, if the stochastic estimator of the gradient ∇F(X 𝑘 ; 𝜉 𝑘 ) is replaced by the full gradient, it recovers NIDS and Exact Diffusion with specific settings. Corollary 3 (Consistency with NIDS). When 𝐶 = 0 (no communication compression), 𝛾 = 1 and 𝜎 = 0 (full gradient), LEAD has the convergence consistent with NIDS with 𝜂 ∈ (0, 2/(𝜇 + 𝐿)]:   2 1 L 𝑘+1 ≤ max 1 − 𝜇(2𝜂 − 𝜇𝜂 ), 1 − † L𝑘 . (3.10) 2𝜆 max ((I − W) ) See the proof in B.3.5. Proof of Proposition 1. Let 𝛾 = 1 and Ŷ 𝑘 = Y 𝑘 . Combing Lines 4 and 6 of Alg. 3 gives I−W 𝑘 D 𝑘+1 = D 𝑘 + (X − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘 ). (3.11) 2𝜂 Based on Line 7, we can represent 𝜂D 𝑘 from the previous iteration as 𝜂D 𝑘 = X 𝑘−1 − X 𝑘 − 𝜂∇F(X 𝑘−1 ; 𝜉 𝑘−1 ). (3.12) Eliminating both D 𝑘 and D 𝑘+1 by substituting (3.11)-(3.12) into Line 7, we obtain 𝑘+1 𝑘 𝑘 𝑘  𝑘 I−W 𝑘 𝑘 𝑘 𝑘  X = X − 𝜂∇F(X ; 𝜉 ) − 𝜂D + (X − 𝜂∇F(X ; 𝜉 ) − 𝜂D ) (from (3.11)) 2 I+W 𝑘 I+W 𝑘 = (X − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 )) − 𝜂D 2 2 I+W 𝑘 I + W 𝑘−1 = (X − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 )) − (X − X 𝑘 − 𝜂∇F(X 𝑘−1 ; 𝜉 𝑘−1 )) (from (3.12)) 2 2 I+W = (2X 𝑘 − X 𝑘−1 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) + 𝜂∇F(X 𝑘−1 ; 𝜉 𝑘−1 )), (3.13) 2 I+W which is exactly 𝐷 2 . It also recovers Exact Diffusion with A = 2 and M = 𝜂I in Eq. (97) of [125]. 33 3.4 Theoretical Analysis In this section, we show the convergence rate for the proposed algorithm LEAD. Before showing the main theorem, we make some assumptions, which are commonly used for the analysis of decentralized optimization algorithms. All proofs are provided in Appendix B.3. Assumption 6 (Unbiased and 𝐶-contracted operator). The compression operator 𝑄 : R𝑑 → R𝑑 is unbiased, i.e., E𝑄(x) = x, and there exists 𝐶 ≥ 0 such that E∥x − 𝑄(x)∥ 22 ≤ 𝐶 ∥x∥ 22 for all x ∈ R𝑑 . Assumption 7 (Stochastic gradient). The stochastic gradient ∇ 𝑓𝑖 (x; 𝜉) is unbiased, i.e., E𝜉 ∇ 𝑓𝑖 (x; 𝜉) = ∇ 𝑓𝑖 (x), and the stochastic gradient variance is bounded: E𝜉 ∥∇ 𝑓𝑖 (x; 𝜉) − ∇ 𝑓𝑖 (x)∥ 22 ≤ 𝜎𝑖2 for all 𝑖 ∈ [𝑛]. Denote 𝜎 2 = 𝑛1 𝑖=1 𝜎𝑖2 . Í𝑛 Assumption 8. Each 𝑓𝑖 is 𝐿-smooth and 𝜇-strongly convex with 𝐿 ≥ 𝜇 > 0, i.e., for 𝑖 = 1, 2, . . . , 𝑛 and ∀x, y ∈ R𝑑 , we have 𝜇 𝐿 𝑓𝑖 (y) + ⟨∇ 𝑓𝑖 (y), x − y⟩ + ∥x − y∥ 2 ≤ 𝑓𝑖 (x) ≤ 𝑓𝑖 (y) + ⟨∇ 𝑓𝑖 (y), x − y⟩ + ∥x − y∥ 2 . 2 2 Theorem 3 (Constant stepsize). Let {X 𝑘 , H 𝑘 , D 𝑘 } be the sequence generated from Alg. 3 and X∗ is the optimal solution with D∗ = −∇F(X∗ ). Under Assumptions 5-8, for any constant stepsize 𝜂 ∈ (0, 2/(𝜇 + 𝐿)], if the compression parameters 𝛼 and 𝛾 satisfy  o n 2 2𝜇𝜂(2 − 𝜇𝜂) 𝛾 ∈ 0, min , , (3.14) (3𝐶 + 1) 𝛽 [2 − 𝜇𝜂(2 − 𝜇𝜂)]𝐶 𝛽  n 2 − 𝛽𝛾 o 𝐶 𝛽𝛾 1 𝛼∈ , min , 𝜇𝜂(2 − 𝜇𝜂) , (3.15) 2(1 + 𝐶) 𝑎 1 4 − 𝛽𝛾 with 𝛽 B 𝜆 max (I − W). Then, in total expectation we have 1 1 EL 𝑘+1 ≤ 𝜌 EL 𝑘 + 𝜂2 𝜎 2 , (3.16) 𝑛 𝑛 where L 𝑘 B (1 − 𝑎 1 𝛼)∥X 𝑘 − X∗ ∥ 2 + (2𝜂2 /𝛾)E∥D 𝑘 − D∗ ∥ 2(I−W) † + 𝑎 1 ∥H 𝑘 − X∗ ∥ 2 , (3.17)   1 − 𝜇𝜂(2 − 𝜇𝜂) 𝛾 4(1 + 𝐶) 𝜌 B max ,1− † , 1 − 𝛼 < 1, 𝑎 1 B (3.18) 1 − 𝑎1 𝛼 2𝜆 max ((I − W) ) 𝐶 𝛽𝛾 + 2 The result holds for 𝐶 → 0. 34 Corollary 4 (Complexity bounds). Define the condition numbers of the objective function and 𝐿 𝜆max (I−W) communication graph as 𝜅 𝑓 = 𝜇 and 𝜅 𝑔 = 𝜆+min (I−W) , respectively. Under the same setting in Theorem 3, we can choose 𝜂 = 𝐿1 , 𝛾 = min{ 𝐶 𝛽𝜅 1 𝑓 , 1 (1+3𝐶) 𝛽 }, 1 and 𝛼 = O ( (1+𝐶)𝜅 𝑓 ) such that   1   1   1  𝜌 = max 1 − O ,1− O ,1− O . (1 + 𝐶)𝜅 𝑓 (1 + 𝐶)𝜅 𝑔 𝐶𝜅 𝑓 𝜅 𝑔 With full-gradient (i.e., 𝜎 = 0), we obtain the following complexity bounds: • LEAD converges to the 𝜖-accurate solution with the iteration complexity   1 O (1 + 𝐶)(𝜅 𝑓 + 𝜅 𝑔 ) + 𝐶𝜅 𝑓 𝜅 𝑔 log . 𝜖 • When 𝐶 = 0 (i.e., there is no compression), we obtain 𝜌 = max{1 − O ( 𝜅1𝑓 ), 1 − O ( 𝜅1𝑔 )}, and   the iteration complexity O (𝜅 𝑓 + 𝜅 𝑔 ) log 1𝜖 . This exactly recovers the convergence rate of NIDS [59].   𝜅 𝑓 +𝜅 𝑔 • When 𝐶 ≤ 𝜅 𝑓 𝜅 𝑔 +𝜅 𝑓 +𝜅 𝑔 , the asymptotical complexity is O (𝜅 𝑓 + 𝜅 𝑔 ) log 1 𝜖 , which also recovers that of NIDS [59] and indicates that the compression doesn’t harm the convergence in this case. 𝜅 𝑓 +𝜅 𝑔 11⊤ • With 𝐶 = 0 (or 𝐶 ≤ 𝜅 𝑓 𝜅 𝑔 +𝜅 𝑓 +𝜅 𝑔 ) and fully connected communication graph (i.e., W = 𝑛 ), we have 𝛽 = 1 and 𝜅 𝑔 = 1. Therefore, we obtain 𝜌 = 1 − O ( 𝜅1𝑓 ) and the complexity bound O (𝜅 𝑓 𝑙𝑜𝑔 1𝜖 ). This recovers the convergence rate of gradient descent [81]. Remark 6. Under the setting in Theorem 3, LEAD converges linearly to the O (𝜎 2 ) neighborhood of the optimum and converges linearly exactly to the optimum if full gradient is used, e.g., 𝜎 = 0. The linear convergence of LEAD holds when 𝜂 < 2/𝐿, but we omit the proof. Remark 7 (Arbitrary compression precision). Pick any 𝜂 ∈ (0, 2/(𝜇 + 𝐿)], based on the compression- related constant 𝐶 and the network-related constant 𝛽, we can select 𝛾 and 𝛼 in certain ranges to achieve the convergence. It suggests that LEAD supports unbiased compression with arbitrary precision, i.e., any 𝐶 > 0. 35 1 Í𝑛 Corollary 5 (Consensus error). Under the same setting in Theorem 3 , let x 𝑘 = 𝑛 𝑖=1 x𝑖 𝑘 be the averaged model and H0 = H1 , then all agents achieve consensus at the rate 𝑛 1 ∑︁ 2 2L 0 𝑘 2𝜎 2 2 E x𝑖𝑘 − x 𝑘 ≤ 𝜌 + 𝜂 . (3.19) 𝑛 𝑖=1 𝑛 1−𝜌 where 𝜌 is defined as in Corollary 4 with appropriate parameter settings. Theorem 4 (Diminishing stepsize). Let {X 𝑘 , H 𝑘 , D 𝑘 } be the sequence generated from Alg. 3 and 2𝜃 5 X∗ is the optimal solution with D∗ = −∇F(X∗ ). Under Assumptions 5-8, if 𝜂 𝑘 = 𝜃 3 𝜃 4 𝜃 5 𝑘+2 and 𝐶 𝛽𝛾 𝑘 𝛾 𝑘 = 𝜃 4 𝜂 𝑘 , by taking 𝛼𝑘 = 2(1+𝐶) , in total expectation we have 𝑛   1 ∑︁ 2 1 E x𝑖𝑘 − x∗ ≲O (3.20) 𝑛 𝑖=1 𝑘 where 𝜃 1 , 𝜃 2 , 𝜃 3 , 𝜃 4 and 𝜃 5 are constants defined in the proof. The complexity bound for arriving at the 𝜖-accurate solution is O ( 1𝜖 ). Remark 8. Compared with CHOCO-SGD, LEAD requires unbiased compression and the conver- gence under biased compression is not investigated yet. The analysis of CHOCO-SGD relies on the bounded gradient assumptions, i.e., ∥∇ 𝑓𝑖 (x)∥ 2 ≤ 𝐺, which is restrictive because it conflicts with the strong convexity while LEAD doesn’t need this assumption. Moreover, in the theorem of CHOCO-SGD, it requires a specific point set of 𝛾 while LEAD only requires 𝛾 to be within a rather large range. This may explain the advantages of LEAD over CHOCO-SGD in terms of robustness to parameter setting. 3.5 Numerical Experiment We consider three machine learning problems – ℓ2 -regularized linear regression, logistic regression, and deep neural network. The proposed LEAD is compared with QDGD [88], DeepSqueeze [102], CHOCO-SGD [51], and two non-compressed algorithms DGD [123] and NIDS [59]. Setup. We consider eight machines connected in a ring topology network. Each agent can only exchange information with its two 1-hop neighbors. The mixing weight is simply set as 1/3. For 36 compression, we use the unbiased 𝑏-bits quantization method with ∞-norm    2 (𝑏−1) |x|  −(𝑏−1) 𝑄 ∞ (x) := ∥x∥ ∞ 2 sign(x) · +u , (3.21) ∥x∥ ∞ where · is the Hadamard product, |x| is the elementwise absolute value of x, and u is a random vector uniformly distributed in [0, 1] 𝑑 . Only sign(x), norm ∥x∥ ∞ , and integers in the bracket need to be transmitted. Note that this quantization method is similar to the quantization used in QSGD [4] and CHOCO-SGD [51], but we use the ∞-norm scaling instead of the 2-norm. This small change brings significant improvement on compression precision as justified both theoretically and empirically in Appendix B.1. In this section, we choose 2-bit quantization and quantize the data blockwise (block size = 512). For all experiments, we tune the stepsize 𝜂 from {0.01, 0.05, 0.1, 0.5}. For QDGD, CHOCO-SGD and Deepsqueeze, 𝛾 is tuned from {0.01, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0}. Note that different notations are used in their original papers. Here we uniformly denote the stepsize as 𝜂 and the additional parameter in these algorithms as 𝛾 for simplicity. For LEAD, we simply fix 𝛼 = 0.5 and 𝛾 = 1.0 for all experiments since we find LEAD is robust to parameter settings as we validate in the parameter sensitivity analysis in the below. This indicates the minor effort needed for tuning LEAD. Detailed parameter settings for all experiments are summarized in Appendix B.2.2. (∥A𝑖 x − b𝑖 ∥ 2 + 𝜆∥x∥ 2 ). Data matrices Í𝑛 Linear regression. We consider the problem: 𝑓 (x) = 𝑖=1 A𝑖 ∈ R200×200 and the true solution x′ is randomly synthesized. The values b𝑖 are generated by adding Gaussian noise to A𝑖 x′. We let 𝜆 = 0.1 and the optimal solution of the linear regression problem be x∗ . We use full-batch gradient to exclude the impact of gradient variance. The performance is showed in Fig. 3.1. The distance to x∗ in Fig. 3.1a and the consensus error in Fig. 3.1c verify that LEAD converges exponentially to the optimal consensual solution. It significantly outperforms most baselines and matches NIDS well under the same number of iterations. Fig. 3.1b demonstrates the benefit of compression when considering the communication bits. Fig. 3.1d shows that the compression error vanishes for both LEAD and CHOCO-SGD while the compression error is pretty large for QDGD and DeepSqueeze because they directly compress the local models. 37 2 2 10 10 0 0 10 10 ||X k X * ||2F ||X k X * ||2F 2 2 10 10 4 DGD ( 32 bits) 4 DGD ( 32 bits) 10 NIDS ( 32 bits) 10 NIDS ( 32 bits) QDGD ( 2 bits) QDGD ( 2 bits) 6 DeepSqueeze ( 2 bits) 6 DeepSqueeze ( 2 bits) 10 CHOCO-SGD ( 2 bits) 10 CHOCO-SGD ( 2 bits) LEAD ( 2 bits) LEAD ( 2 bits) 0 25 50 75 100 125 150 175 0 1000000 2000000 3000000 4000000 5000000 Epoch Bits transmitted (a) ∥X 𝑘 − X∗ ∥ 𝐹 (b) ∥X 𝑘 − X∗ ∥ 𝐹 2 1 10 10 10 0 Compression Error 2 Consensus Error 1 10 10 4 10 3 10 DGD ( 32 bits) 6 5 NIDS ( 32 bits) 10 10 QDGD ( 2 bits) 10 8 QDGD ( 2 bits) DeepSqueeze ( 2 bits) DeepSqueeze ( 2 bits) 7 CHOCO-SGD ( 2 bits) 10 CHOCO-SGD ( 2 bits) 10 LEAD ( 2 bits) 10 LEAD ( 2 bits) 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Epoch Epoch (c) ∥X 𝑘 − 1𝑛×1 X 𝑘 ∥ 𝐹 (d) Compression error Figure 3.1: Linear regression problem. Logistic regression. We further consider a logistic regression problem on the MNIST dataset. The regularization parameter is 10−4 . We consider both homogeneous and heterogeneous data settings. In the homogeneous setting, the data samples are randomly shuffled before being uniformly partitioned among all agents such that the data distribution from each agent is very similar. In the heterogeneous setting, the samples are first sorted by their labels and then partitioned among agents. Due to the space limit, we mainly present the results in heterogeneous setting here and defer the homogeneous setting to Appendix B.2.1. The results using full-batch gradient and mini-batch gradient (the mini-batch size is 512 for each agent) are showed in Fig. 3.2 and Fig. 3.3 respectively and both settings shows the faster convergence and higher precision of LEAD. Neural network. We empirically study the performance of LEAD in optimizing deep neural network by training AlexNet (240 MB) on CIFAR10 dataset. The mini-batch size is 64 for each agents. Both the homogeneous and heterogeneous case are showed in Fig. 3.4. In the homogeneous 38 DGD (32 bits) DGD (32 bits) NIDS (32 bits) NIDS (32 bits) 0 QDGD (2 bits) QDGD (2 bits) 0 10 10 DeepSqueeze (2 bits) DeepSqueeze (2 bits) CHOCO-SGD (2 bits) CHOCO-SGD (2 bits) LEAD (2 bits) LEAD (2 bits) Loss 6 × 10 −1 Loss −1 6 × 10 −1 4 × 10 −1 4 × 10 −1 3 × 10 −1 3 × 10 0 200 400 600 800 1000 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Epoch Bits transmitted 1e10 (a) Loss 𝑓 (X 𝑘 ) (b) Loss 𝑓 (X 𝑘 ) Figure 3.2: Logistic regression problem in the heterogeneous case (full-batch gradient). 1.1 1.1 DGD (32 bits) DGD (32 bits) 1.0 1.0 NIDS (32 bits) NIDS (32 bits) 0.9 QDGD (2 bits) 0.9 QDGD (2 bits) 0.8 DeepSqueeze (2 bits) DeepSqueeze (2 bits) 0.8 CHOCO-SGD (2 bits) CHOCO-SGD (2 bits) 0.7 0.7 LEAD (2 bits) LEAD (2 bits) Loss Loss 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0 10 20 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Epoch Bits transmitted 1e9 (a) Loss 𝑓 (X 𝑘 ) (b) Loss 𝑓 (X 𝑘 ) Figure 3.3: Logistic regression in the heterogeneous case (mini-batch gradient). Homogeneous data Heterogeneous data DGD (32 bits) DGD (32 bits) 2.0 NIDS (32 bits) 2.5 NIDS (32 bits) QDGD (2 bits) QDGD* (2 bits) DeepSqueeze (2 bits) DeepSqueeze* (2 bits) 2.0 1.5 CHOCO-SGD (2 bits) CHOCO-SGD* (2 bits) LEAD (2 bits) LEAD (2 bits) Loss Loss 1.0 1.5 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bits transmitted 1e14 Bits transmitted 1e14 (a) Loss 𝑓 (X 𝑘 ) (b) Loss 𝑓 (X 𝑘 ) Figure 3.4: Stochastic optimization on deep neural network (∗ means divergence). case, CHOCO-SGD, DeepSqueeze and LEAD perform similarly and outperform the non-compressed variants in terms of communication efficiency, but CHOCO-SGD and DeepSqueeze need more efforts for parameter tuning because their convergence is sensitive to the setting of 𝛾. In the 39 heterogeneous cases, LEAD achieves the fastest and most stable convergence. Note that in this setting, sufficient information exchange is more important for convergence because models from different agents are moving to significantly diverse directions. In such case, DGD only converges with smaller stepsize and its communication compressed variants, including QDGD, DeepSqueeze and CHOCO-SGD, diverge in all parameter settings we try. Parameter sensitivity. In the linear regression problem, the convergence of LEAD under different parameter settings of 𝛼 and 𝛾 are tested. The result showed in Figure 3.5 indicates that LEAD performs well in most settings and is robust to the parameter setting. Therefore, in this work, we simply set 𝛼 = 0.5 and 𝛾 = 1.0 for LEAD in all experiment, which indicates the minor effort needed for parameter tuning. 2 LEAD ( = 0.2) 2 LEAD ( = 0.2) 10 LEAD ( = 0.4) 10 LEAD ( = 0.4) 1 1 LEAD ( = 0.6) 10 LEAD ( = 0.6) 10 LEAD ( = 0.8) 0 LEAD ( = 0.8) LEAD ( = 1) 10 LEAD ( = 1) ||X k X * ||2F ||X k X * ||2F 0 1 10 10 1 2 10 10 3 10 2 10 4 3 10 10 10 5 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Epoch Epoch (a) 𝛾 = 0.4 (b) 𝛾 = 0.6 2 = 0.2) 2 = 0.2) 10 LEAD ( LEAD ( = 0.4) 10 LEAD ( LEAD ( = 0.4) 0 LEAD ( = 0.6) 0 LEAD ( = 0.6) 10 LEAD ( = 0.8) 10 LEAD ( = 0.8) LEAD ( = 1) LEAD ( = 1) ||X k X * ||2F ||X k X * ||2F 2 10 2 10 4 10 4 10 6 6 10 10 0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175 Epoch Epoch (c) 𝛾 = 0.8 (d) 𝛾 = 1.0 Figure 3.5: Parameter analysis on linear regression problem. In summary, our experiments verify our theoretical analysis and show that LEAD is able to handle data heterogeneity very well. Furthermore, the performance of LEAD is robust to parameter 40 settings and needs less effort for parameter tuning, which is critical in real-world applications. 3.6 Conclusion In this work, we investigate the communication compression in message passing for decentralized op- timization. Motivated by primal-dual algorithms, a novel decentralized algorithm with compression, LEAD, is proposed to achieve faster convergence rate and to better handle heterogeneous data while enjoying the benefit of efficient communication. The nontrivial analyses on the coupled dynamics of inexact primal and dual updates as well as compression error establish the linear convergence of LEAD when full gradient is used and the linear convergence to the O (𝜎 2 ) neighborhood of the optimum when stochastic gradient is used. Extensive experiments validate the theoretical analysis and demonstrate the state-of-the-art efficiency and robustness of LEAD. LEAD is also applicable to non-convex problems as empirically verified in the neural network experiments. In addition, we also proposed a linear convergent decentralized algorithm with compression (ProxLEAD) for composite optimization problems [56]. 41 CHAPTER 4 GRAPH NEURAL NETWORKS WITH ADAPTIVE RESIDUAL Graph neural networks (GNNs) have shown the power in graph representation learning for numerous tasks. In this chapter, we discover an interesting phenomenon that although residual connections in the message passing of GNNs help improve the performance, they immensely amplify GNNs’ vulnerability against abnormal node features. This is undesirable because in real-world applications, node features in graphs could often be abnormal such as being naturally noisy or adversarially manipulated. We analyze possible reasons to understand this phenomenon and aim to design GNNs with stronger resilience to abnormal features. Our understandings motivate us to propose and derive a simple, efficient, interpretable, and adaptive message passing scheme, leading to a novel GNN with Adaptive residual, AirGNN. Extensive experiments under various abnormal feature scenarios demonstrate the effectiveness of the proposed algorithm. The implementation is available at https://github.com/lxiaorui/AirGNN. 4.1 Introduction Recent years have witnessed the great success of graph neural networks (GNNs) in representation learning for graph structure data [74]. Essentially, GNNs generalize deep neural networks (DNNs) from regular grids, such as image, video and text, to irregular data such as social, energy, transportation, citation, and biological networks. Such data can be naturally represented as graphs with nodes and edges. The key building block for such generalization is the neural message passing framework [29]: x𝑢(𝑘+1) = UPDATE (𝑘) x𝑢(𝑘) , mN (𝑘)  (𝑢) (4.1) where x𝑢(𝑘) ∈ R𝑑 denotes the feature vector of node 𝑢 in the 𝑘-th iteration of message passing, and (𝑘) mN (𝑢) is the message aggregated from 𝑢’s neighborhood N (𝑢). The specific design of message passing scheme can be motivated from spectral domain [48, 23] or spatial domain [33, 109, 91, 29]. 42 It usually linearly smooths the features in a local neighborhood on the graph. GNNs have achieved superior performance in a large number of benchmark datasets [117] where the node features are assumed to be complete and informative. However, in real-world applications, some node features could be abnormal from various aspects. For instance, in social networks, new users might not have complete profile before they make connections with others, leading to missing user features. In transportation networks, node features can be noisy since there exist certain uncertainty and dynamics in the observation of the traffic information. What is worse, node features can be adversarially chosen by the attacker to maliciously manipulate the prediction made by GNNs. Therefore, it is greatly desired to design GNN models with stronger resilience to abnormal node features. In this work, we first perform empirical investigations on how representative GNN models behave on graphs with abnormal features. Specifically, based upon standard benchmark datasets, we simulate the abnormal features by replacing the features of randomly selected nodes with random Gaussian noise. Then the performance of node classification on abnormal features and normal features are examined separately. From our preliminary study in Section 4.2, we reveal two interesting observations: (1) Feature aggregation can boost the resilience to abnormal features, but too many aggregations could hurt the performance on both normal and abnormal features; and (2) Residual connection helps GNNs benefit from more layers for normal features, while making GNNs more fragile to abnormal features. We then provide possible explanations to understand these observed phenomena from the perspective of graph Laplacian smoothing. Our analyses imply that there might exist an intrinsic tension between feature aggregation and residual connection, which results in a performance tradeoff between normal features and abnormal features. Motivated by these findings and understandings, we aim to design new GNNs with stronger resilience to abnormal features while largely maintaining the performance on normal features. Our contributions can be summarized as follows: • We discover an intrinsic tension between feature aggregation and residual connection in GNNs, and the corresponding performance tradeoff between abnormal and normal features. 43 We also analyze possible reasons to explain and understand these findings. • We propose a simple, efficient, principled and adaptive message passing scheme, which leads to a novel GNN model with adaptive residual, named as AirGNN. • Extensive experiments under various abnormal feature scenarios demonstrate the superiority of the proposed algorithm. The ablation study demonstrates how the adaptive residuals mitigate the impact of abnormal features. 4.2 Preliminary Before introducing the preliminary study, we first define the notations used throughout the paper. Notations. We use bold upper-case letters such as X to denote matrices. Given a matrix X ∈ R𝑛×𝑑 , we use X𝑖 to denote its 𝑖-th row and X𝑖 𝑗 to denote its element in 𝑖-th row and 𝑗-th √︃Í column. The Frobenius norm and ℓ21 norm of a matrix X are defined as ∥X∥ 𝐹 = 2 𝑖 𝑗 X𝑖 𝑗 and Í Í √︃Í 2 ∥X∥ 21 = 𝑖 ∥X𝑖 ∥ 2 = 𝑖 𝑗 X𝑖 𝑗 , respectively. We define ∥X∥ 2 = 𝜎max (X) where 𝜎max (X) is the largest singular value of X. Let G = {V, E} be a graph with the node set V = {𝑣 1 , . . . , 𝑣 𝑛 } and the undirected edge set E = {𝑒 1 , . . . , 𝑒 𝑚 }. We use N (𝑣 𝑖 ) to denote the neighboring nodes of node 𝑣 𝑖 , including 𝑣 𝑖 itself. Suppose that each node is associated with a 𝑑-dimensional feature vector, and the features for all nodes are denoted as Xfea ∈ R𝑛×𝑑 . The graph structure G can be represented as an adjacent matrix A ∈ R𝑛×𝑛 , where A𝑖 𝑗 = 1 when there exists an edge between nodes 𝑣 𝑖 and 𝑣 𝑗 , and A𝑖 𝑗 = 0 otherwise. The graph Laplacian matrix is defined as L = D − A, where D is the diagonal degree matrix. Let 1 1 us denote the commonly used feature aggregation matrix in GNNs [48] as à = D̂− 2 ÂD̂− 2 where  = A + I is the adjacent matrix with self-loop and its degree matrix is D̂. The corresponding Laplacian matrix is defined as L̃ = I − Ã. In this work, we focus on the setting where a subset of nodes in the graph contain abnormal features, while the remaining nodes have normal features. In the remaining of this chapter, we use abnormal/normal features to denote nodes with abnormal/normal features, for simplicity. 44 4.2.1 Preliminary Study Experimental setup. To investigate how GNNs behave on abnormal and normal node features, we design semi-supervised node classification experiments on three common datasets (i.e., Cora, CiteSeer and PubMed), following the data splits in the work [48]. Moreover, we simulate the abnormal features by assigning 10% of the nodes with random features sampled from a standard Gaussian distribution. The experiments are performed on representative GNN models covering coupled and decoupled architectures, including GCN [48], GCNII [14], APPNP [49], and their variants with or without residual connections in feature aggregations, denoted as w/Res and wo/Res. All methods follow the hyperparameter settings in their original papers. We examine how these models perform when the number of layers increases. Note that for the decoupled architectures such as APPNP, we fix the 2-layer MLP and increase the number of propagation layers. While for the coupled architectures such as GCN and GCNII, we increase the number of feature transformation and propagation layers simultaneously. We report the average performance over 10 times of random selection of the noise node sets. The node classification accuracy (mean and standard variance) on nodes with abnormal and normal features is illustrated in Figure 4.1 and Figure. 4.2, separately. 0.70 0.35 0.40 0.65 APPNP w/Res GCNII w/Res GCN w/Res 0.60 APPNP wo/Res 0.30 GCNII wo/Res 0.35 GCN wo/Res 0.55 0.50 0.25 0.30 0.45 Accuracy Accuracy Accuracy 0.40 0.20 0.25 0.35 0.30 0.15 0.20 0.25 0.20 0.10 0.15 0.15 0.10 0.050 0.100 0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Number of layers Number of layers Number of layers (a) APPNP (b) GCNII (c) GCN Figure 4.1: Node classification accuracy on abnormal nodes (Cora). Observations. From Figure 4.1 and Figure 4.2, we can make the following observations: (1) Without residual connection, more layers (e.g., > 2 for GCN and GCNII, > 10 for APPNP) hurt the accuracy on nodes with normal features. However, more layers boost the accuracy on nodes with abnormal features significantly, before finally starting to decrease; (2) With residual connection, the 45 0.8 0.8 0.7 0.7 0.6 0.7 0.6 0.5 Accuracy Accuracy Accuracy 0.5 0.4 0.6 0.4 0.3 APPNP w/Res 0.3 GCNII w/Res 0.2 GCN w/Res APPNP wo/Res GCNII wo/Res GCN wo/Res 0.50 2 4 6 8 10 12 14 16 0.20 2 4 6 8 10 12 14 16 0.10 2 4 6 8 10 12 14 16 Number of layers Number of layers Number of layers (a) APPNP (b) GCNII (c) GCN Figure 4.2: Node classification accuracy on normal nodes (Cora). accuracy on nodes with normal features keeps increasing with more layers1. However, the accuracy on nodes with abnormal features only increases marginally when stacking more layers, and then starts to decrease. While we only present the experiments on Cora, we defer the results on other datasets to Appendix C.1, which provide similar observations. To conclude, we can summarize these observations into two major findings: • Finding I: Feature aggregation can boost the resilience to abnormal features, but too many aggregations could hurt the performance on both normal and abnormal nodes; • Finding II: Residual connection helps GNNs benefit from more layers for nodes with normal features, while making GNNs more fragile to abnormal features. 4.2.2 Understandings In this subsection, we provide the understanding and explanation for aforementioned findings, from the perspective of graph Laplacian smoothing. Understanding Finding I: Feature aggregation as Laplacian smoothing The message passing in GCN [48], GCNII wo/ residual and APPNP wo/ residual (as well as many popular GNN models), follows the feature aggregation Xout = ÃXin , (4.2) 1 GCN w/Res is an exception because its residual is not appropriate, which is consistent with the experiments in the work [48]. 46 where Xin and Xout represent the features before and after message passing layer, respectively. It can be interpreted as one gradient descent step for the Laplacian smoothing problem [73] 1  ⊤  1 ∑︁ X𝑖 X𝑗 arg min L1 (X) B tr X (I − Ã)X = ∥√ − √︁ ∥ 22 , (4.3) X∈R𝑛×𝑑 2 2 𝑑𝑖 + 1 𝑑𝑗 + 1 (𝑣 𝑖 ,𝑣 𝑗 )∈E where 𝑑𝑖 is the node degree of node 𝑣 𝑖 . Eq. (4.2) can be derived from Xout = Xin − (I − Ã)Xin = ÃXin , with the initialization X = Xin and stepsize 𝛾 = 1. The Laplacian smoothing problem penalizes the feature difference between neighboring nodes. To reduce this penalty, the feature aggregation in Eq. (4.2) smooths the node features by taking the average of local neighbors, and thus can be considered as low-pass filter which gradually filters out high-frequency signals [84, 126]. Therefore, it increases the resilience to abnormal features which are likely to be high-frequency signals. In other words, the local neighboring nodes help to correct the abnormal features. Unfortunately, if applied too many times, these low-pass filters could overly smooth the features (well-known as oversmoothing [55, 85]) such that nodes are not distinguishable enough, providing an explanation to the degraded performance on both abnormal and normal features when stacking too many layers. Understanding Finding II: Residual connection maintains feature proximity To adjust the feature smoothness for better performance, APPNP [49] utilizes residual connections in message passing as follows X 𝑘+1 = (1 − 𝛼) ÃX 𝑘 + 𝛼Xin , (4.4) where X0 = Xin . It can be considered as an iterative solution for the regularized Laplacian smoothing problem [73] 𝛼 1   arg min L2 (X) B ∥X − Xin ∥ 2𝐹 + tr X⊤ (I − Ã)X , (4.5) X∈R𝑛×𝑑 2(1 − 𝛼) 2 with initialization X = Xin and stepsize 𝛾 = 1 − 𝛼 due to  𝛼  X 𝑘+1 = X 𝑘 − (1 − 𝛼) (X 𝑘 − Xin ) + (I − Ã)X 𝑘 = (1 − 𝛼) ÃX 𝑘 + 𝛼Xin . 1−𝛼 GCNII [14] adopts a similar message passing but further combines a feature transformation layer in each message passing step, which leads to a coupled architecture, as contrast to the decoupled 47 architecture of APPNP. The residual connection naturally arises when regularizing the proximity between input and output features, as showed in the first term of L2 (X). Such proximity can help avoid the trivial solution for the problem in Eq. (4.3), i.e., totally oversmoothed features only depending on node degrees, and consequently mitigates the oversmoothing issue. More intuitively, residual connections in GNNs provide direct information flows between layers that can preserve some necessary high-frequency signals for better discrimination between classes. More layers with residual provide a more accurate solution to Eq. (4.5), which explains the performance gain from deeper GNNs. Unfortunately, these residual connections also undesirably carry on abnormal features which are detrimental, leading to the inferior performance on abnormal features. 4.3 Algorithm In this section, we first motivate the proposed adaptive message passing scheme (AMP) with further discussions on our preliminary study. We then introduce more details about AMP, its interpretations, convergence guarantee and computation complexity, as well as the model architecture of AirGNN. 4.3.1 Design Motivation Our preliminary study in Section 4.2 reveals an intrinsic tension between feature aggregation and residual connection: (1) feature aggregation helps smooth out abnormal features, while it could cause inappropriate smoothing for normal features; (2) residual connection is essential for adjusting the feature smoothness, but it could be detrimental for abnormal features. Although this conflict can be partially mitigated by adjusting the residual connection such as the residual weight 𝛼 in GCNII [14] and APPNP [49], such global adjustment cannot be adaptive to a subset of the nodes, e.g., the nodes with abnormal features. This is crucial because in practice we often encounter the scenario where only a subset of nodes contain abnormal features. Therefore, how to reconcile this dilemma still desires dedicated efforts. We then naturally ask a question: Can we design a better message passing scheme with node-wise adaptive feature aggregation and residual connection? The motivation of the proposed idea builds upon the following intuition: while it is important to 48 maintain the proximity between input and output features as in Eq. (4.5), it could be over aggressive to penalize their deviations by the square of Frobenius norm, i.e., ∥X − Xin ∥ 2𝐹 = 𝑖=1 ∥X𝑖 − (Xin )𝑖 ∥ 22 . Í𝑛 The fact that this penalty does not tolerate large deviations weakens the capability to remove abnormal features through Laplacian smoothing. This motivates us to consider an alternative proximity penalty ∑︁𝑛 ∥X − Xin ∥ 21 B ∥X𝑖 − (Xin )𝑖 ∥ 2 , (4.6) 𝑖=1 which instead penalizes the deviations by the ℓ1 norm of row-wise ℓ2 norms, namely ℓ21 norm. The ℓ21 norm promotes row sparsity in X − Xin , and it also allows large deviations because the penalty on large values is less aggressive, leading to the potential removal of abnormal features. Therefore, we propose the following Laplacian smoothing problem regularized by ℓ21 norm proximity control: 1 arg min 𝜆∥X − Xin ∥ 21 + tr(X⊤ (I − Ã)X), (4.7) X∈R𝑛×𝑑 2 where 𝜆 ∈ [0, ∞) is a parameter to adjust the balance between proximity and Laplacian smoothing. In order to easy the tuning of 𝜆, we made a modification of Eq. (4.7): arg min L (X) B 𝜆∥X − Xin ∥ 21 + (1 − 𝜆)tr(X⊤ (I − Ã)X), (4.8) X∈R𝑛×𝑑 where 𝜆 ∈ [0, 1] controls the balance. 4.3.2 Adaptive Message Passing Figure 4.3: Diagram of Adaptive Message Passing. L (X) is a composite objective with non-smooth and smooth components. We optimize it by proximal gradient descent [9] and obtain the following iterations as the adaptive message passing 49 (AMP):  Y 𝑘 = X 𝑘 − 2𝛾(1 − 𝜆)(I − Ã)X 𝑘 = 1 − 2𝛾(1 − 𝜆) X 𝑘 + 2𝛾(1 − 𝜆) ÃX 𝑘 (4.9) n 1 o X 𝑘+1 = arg min 𝜆∥X − Xin ∥ 21 + ∥X − Y 𝑘 ∥ 2𝐹 (4.10) X 2𝛾 where X0 = Xin and 𝛾 is the stepsize to be specified later. Let Z = X − Xin , and Eq. (4.10) can be rewritten as: n 1 o Z 𝑘+1 = arg min 𝜆∥Z∥ 21 + ∥Z − (Y 𝑘 − Xin )∥ 2𝐹 Z 2𝛾 = prox𝛾𝜆∥·∥ 21 (Y 𝑘 − Xin ) (4.11) X 𝑘+1 = Xin + Z 𝑘+1 . (4.12) The 𝑖-th row of the proximal operator in Eq. (4.11) can be computed analytically  X𝑖 𝛾𝜆 prox𝛾𝜆∥·∥ 21 (X) 𝑖 = max(∥X𝑖 ∥ 2 − 𝛾𝜆, 0) = max(1 − , 0) · X𝑖 . (4.13) ∥X𝑖 ∥ 2 ∥X𝑖 ∥ 2 Note that the proximal operator returns 0 if the input vector is 0. Substituting X in Eq. (4.13) with Y 𝑘 − Xin and combining Eq. (4.11) and Eq. (4.12), then Eq. (4.12) becomes X𝑖𝑘+1 = (Xin )𝑖 + 𝛽𝑖 (Y𝑖𝑘 − (Xin )𝑖 ) = (1 − 𝛽𝑖 )(Xin )𝑖 + 𝛽𝑖 Y𝑖𝑘 , ∀𝑖 ∈ [𝑛], (4.14) 𝛾𝜆 where 𝛽𝑖 B max(1 − ∥Y𝑖𝑘 −(Xin )𝑖 ∥ 2 , 0). To summarize, the proposed adaptive message passing (AMP) scheme is showed in Figure 4.4, and a diagram is showed in Figure 4.3. In detail, AMP works as follows: • The first step takes a feature aggregation within the local neighbors with a self-loop weighted by 1 − 2𝛾(1 − 𝜆); • The second step computes a weight 𝛽𝑖 ∈ [0, 1] for each node 𝑣 𝑖 depending on the local deviation ∥Y𝑖𝑘 − (Xin )𝑖 ∥ 2 . • The final step takes a linear combination of input features Xin and the aggregated features Y 𝑘 , where the node-wise residual is adaptively weighted by 1 − 𝛽𝑖 for each node 𝑣 𝑖 . 50  Y 𝑘 = 1 − 2𝛾(1 − 𝜆) X 𝑘 + 2𝛾(1 − 𝜆) ÃX 𝑘 𝛾𝜆 𝛽𝑖 = max(1 − , 0) ∀𝑖 ∈ [𝑛] ∥Y𝑖 − (Xin )𝑖 ∥ 2 𝑘 X𝑖𝑘+1 = (1 − 𝛽𝑖 )(Xin )𝑖 + 𝛽𝑖 Y𝑖𝑘 ∀𝑖 ∈ [𝑛] Figure 4.4: Adaptive Message Passing (AMP). The convergence guarantee of AMP and parameter setting for the stepsize 𝛾 are illustrated in 1 1 Theorem 5. According to Theorem 5, if we set 𝛾 = 4(1−𝜆) or 𝛾 = 2(1−𝜆) , then the first step of AMP can be simplified as Y 𝑘 = 12 X 𝑘 + 12 ÃX 𝑘 and Y 𝑘 = ÃX 𝑘 , respectively. The choice of stepsize will only impact the convergence speed but not the ultimate effect of AMP when it convergences to the fixed point solution. We also discuss the computation complexity per iteration of AMP in Remark 9. 1 Theorem 5 (Convergence of AMP). Under the stepsize setting 𝛾 < (1−𝜆) ∥ L̃∥ 2 , the proposed adaptive message passing scheme (AMP) in Eq. (4.9) and Eq. (4.10) converges to the optimal solution of the 1 problem defined in Eq. (4.8). In practice, it is sufficient to choose any 𝛾 < 2(1−𝜆) since ∥ L̃∥ 2 ≤ 2. Moreover, if the connected components of the graph G are not bipartite graphs, it is sufficient to 1 choose 𝛾 = 2(1−𝜆) since ∥ L̃∥ 2 < 2. Proof. The objective that the iterations in AMP try to optimize is arg min L (X) B 𝜆∥X − Xin ∥ 21 + (1 − 𝜆)tr(X⊤ (I − Ã)X) , (4.15) X∈R𝑛×𝑑 | {z } | {z } 𝑔(X) 𝑓 (X) where 𝑓 and 𝑔 are both convex functions. Moreover, 𝑔 is a non-smooth function, while 𝑓 is a smooth function. In particular, 𝑓 is 𝐿-smoothness where 𝐿 = 2(1 − 𝜆)∥ L̃∥ 2 = 2(1 − 𝜆)∥I − Ã∥ 2 due to ∥∇ 𝑓 (X1 ) − ∇ 𝑓 (X2 )∥ 𝐹 = ∥2(1 − 𝜆) L̃(X1 − X2 )∥ 𝐹 ≤ 2(1 − 𝜆)∥ L̃∥ 2 ∥X1 − X2 ∥ 𝐹 . (4.16) AMP essentially applies a forward-backward splitting on the composite objective 𝑔(X) + 𝑓 (X): X 𝑘+1 = (I + 𝛾𝜕𝑔) −1 (X 𝑘 − 𝛾∇ 𝑓 (X 𝑘 )) (4.17) 51 1 = arg min ∥X − (X 𝑘 − 𝛾∇ 𝑓 (X 𝑘 ))∥ 2𝐹 + 𝛾𝑔(X), (4.18) X 2 which is known as proximal gradient method. The convergence of this forward-backward splitting is 2 ensured if the stepsize satifies 𝛾 < 𝐿 according to Lemma 4.4 in [20]. Therefore, AMP provably 1 converges to the optimal solution under the setting 𝛾 < (1−𝜆) ∥ L̃∥ 2 . For the symmetrically normalized 1 1 1 Laplacian matrix, we have ∥ L̃∥ 2 ≤ 2 [19] and thus 2(1−𝜆) ≤ (1−𝜆) ∥ L̃∥ 2 . Therefore, any 𝛾 < 2(1−𝜆) will be sufficient. Moreover, according to [19], if the connected components of the graph G are not 1 1 bipartite graphs, we have ∥ L̃∥ 2 < 2 and thus 𝛾 = 2(1−𝜆) < (1−𝜆) ∥ L̃∥ 2 is sufficient. Remark 9 (Computation complexity). AMP is as efficient as simple feature aggregation Xout = ÃXin because the additional computation cost from the second and third steps in Figure 4.4 is in the order O (𝑛𝑑), where 𝑛 is the number of nodes and 𝑑 is the feature dimension. This is negligible compared with the computation cost O (𝑚𝑑) in feature aggregation, where 𝑚 is the number of edges, due to the fact that usually there are many more edges than nodes in real-world graphs, i.e., 𝑚 ≫ 𝑛. 4.3.3 Interpretation of AMP Interestingly, the proposed AMP has a simple and intuitive interpretation as adaptive residual connection, which aligns well with our design motivation: • If the feature of node 𝑣 𝑖 , i.e., (Xin )𝑖 , is significantly inconsistent with its local neighbors, i.e., the aggregated feature Y𝑖𝑘 , then the local deviation ∥Y𝑖𝑘 − (Xin )𝑖 ∥ 2 will be large, which leads to a 𝛽𝑖 close to 1. Therefore, the final step will assign a small weight to the residual, i.e., (1 − 𝛽𝑖 )(Xin )𝑖 , and the aggregated feature Y𝑖𝑘 will dominate. • On the contrary, if (Xin )𝑖 is already consistent with its local neighbors, ∥Y𝑖𝑘 − (Xin )𝑖 ∥ 2 will be small, which leads to a 𝛽𝑖 close to 0. Thus, the residual will dominate, which is reasonable since there is less need to aggregate features in this case. 52 • To summarize, the local deviation ∥Y𝑖𝑘 − (Xin )𝑖 ∥ 2 provides a natural transition from 𝛽𝑖 → 1 to 𝛽𝑖 → 0, and the transition can be modulated by 𝜆 which can be either learned or tuned as a hyperparameter through cross-validation. This transition provides an node-wise adaptive residual connection for the message passing scheme. Adaptivity for abnormal & normal features. According to the homophily assumption on graph structure data [76, 130, 127, 48], the feature representations of normal features should be more consistent with local neighbors than abnormal features. As a result, AMP will assign more residual (i.e., smaller 𝛽) to normal features but less residual (i.e., larger 𝛽) to abnormal features, providing a customized tradeoff between feature aggregation and residual connection. Consequently, it can promote both the resilience to abnormal features and the performance on normal features. Above discussion also implies a clear physical meaning for 𝛽 in AMP, and we formally define it as the adaptive score. Definition 1 (Adaptive score). The variables {𝛽1 , · · · , 𝛽𝑛 } in the adaptive message passing scheme (AMP) are defined as the adaptive scores for nodes {𝑣 1 , · · · , 𝑣 𝑛 } respectively in graph G. In particular, the larger 𝛽𝑖 is, the more likely the feature of node 𝑣 𝑖 is abnormal. Remark 10 (Nonlinear smoother). Different from most existing message passing scheme which are linear smoothers, AMP is a nonlinear smoother because the weights {𝛽𝑖 } are computed from Y 𝑘 and Xin . This nonlinearity is the key to achieve adaptive residual connection for different nodes. 4.3.4 Model Architecture The proposed adaptive message passing (AMP) can be used as a building block in many GNN models to improve the resilience to abnormal node features. In this work, we choose the the decoupled architectures as APPNP [49] and DAGNN [65], and propose the Adaptive residual GNN (AirGNN): Xin = ℎ𝜃 (Xfea ), (4.19) Ypre = AMP (Xin , 𝐾, 𝜆). (4.20) 53 ℎ𝜃 (·) is any machine learning model parameterized by learnable parameters 𝜃, such as multilayer perceptrons (MLPs). Xfea ∈ R𝑛×𝑑 denotes the initial node features. The model ℎ𝜃 (·) will first transform the initial node features as Xin = ℎ𝜃 (Xfea ). AMP takes ℎ𝜃 (Xfea ) as input, and performs 𝐾 steps of AMP with the hyperparameter 𝜆. Similar to the majority of existing GNN models, the training objective is the cross-entropy classification loss on the labeled nodes, and the whole model is trained in an end-to-end way. Note that AirGNN is very efficient as explained in Remark 9, and it only requires two hyperparameters 𝐾 and 𝜆 without introducing additional parameters to learn, which could reduce the risk of overfitting. 4.4 Experiment In this section, we aim to verify the effective of the proposed adaptive message passing scheme (AMP) and the AirGNN model through the semi-supervised node classification tasks. Specifically, we try to answer the following questions: (1) How does AirGNN perform on abnormal and normal features? (Section 4.4.2 and 4.4.3) and (2) How does AirGNN work by adjusting the adaptive residual? (Section 4.4.4) 4.4.1 Experimental Settings Datasets and baselines. We conduct experiments on 8 real-world datasets including three citation graphs, i.e., Cora, Citeseer, Pubmed [93], two co-authorship graphs, i.e., Coauthor CS and Coauthor Physics [95], two co-purchase graphs, i.e., Amazon Computers and Amazon Photo [95], and one OGB dataset, i.e., ogbn-arxiv [113]. Due to the space limit, we only present the results on Cora, Citeseer, and Pubmed in this section, but defer the results on other datasets to Appendix C.2.1. In the experiments, the data statistics (full graphs) used in Section 4.4.2 are summarized in Table 4.1. The data statistics (largest connected components) used in Section 4.4.3 are summarized in Table 4.2. We use fixed data splits for Cora, CiteSeer, PubMed and ogbn-arxiv datasets, and random data split for other datasets. The proposed AirGNN is compared with representative GNNs, including GCN [48], GAT [109], 54 Table 4.1: Data statistics on benchmark datasets. Dataset Classes Nodes Edges Features Training Nodes Validation Nodes Test Nodes Cora 7 2708 5278 1433 20 per class 500 1000 CiteSeer 6 3327 4552 3703 20 per class 500 1000 PubMed 3 19717 44324 500 20 per class 500 1000 Coauthor CS 15 18333 81894 6805 20 per class 30 per class Rest nodes Coauthor Physics 5 34493 247962 8415 20 per class 30 per class Rest nodes Amazon Computers 10 13381 245778 767 20 per class 30 per class Rest nodes Amazon Photo 8 7487 119043 745 20 per class 30 per class Rest nodes obgn-arxiv 40 169343 1166243 128 54% 18% 28% Table 4.2: Dataset statistics for adversarially attacked datasets. Dataset NLCC ELCC Classes Features Cora 2,485 5,069 7 1,433 CiteSeer 2,110 3,668 6 3,703 PubMed 19,717 44,338 3 500 APPNP [49] and GCNII [14]. We defer the comparison with the variants of APPNP and Robust GCN [128] to Appendix C.2.3 and C.2.4 respectively. Parameter settings. For all baselines, we follow the best hyperparameter settings in their original papers. Additionally, we tune a best residual weight 𝛼 for APPNP and GCNII in the range [0, 1]. For AirGNN, we use a two-layer MLP as the base model ℎ𝜃 (·), following APPNP. We fix the 1 learning rate 0.01, dropout 0.8, and weight decay 0.0005. Moreover, we set 𝛾 = 2(1−𝜆) as suggested by Theorem 5. We choose 𝐾 = 10 and tune 𝜆 in the range [0, 1]. Adam optimizer [47] is used in all experiments. We run all experiments by 10 times, and report the mean and variance. Evaluation setting. We assess the performance of all models under two types of abnormal feature scenarios, including noisy features and adversarial features. The abnormal features are injected to randomly selected test nodes after model training. By default, all hyperparameters are tuned according to the performance on validation sets when the dataset is clean. If tuning the hyperparameter 𝜆 of AirGNN according to the validation sets after injecting abnormal features, the performance will be even better, as discussed in Appendix C.2.2. The performance on clean data are showed in Appendix 4.4.5 to demonstrate that AirGNN doesn’t need to sacrifice accuracy for better robustness against abnormal features. 55 4.4.2 Performance Comparison with Noisy Features In this subsection, we consider the abnormal features in the noisy feature scenario. Specifically, we simulate the noisy features by assigning a subset of the nodes with random features sampled from a multivariate standard Gaussian distribution. Note that the selection of noise subsets has a apparent impact on the performance since some nodes are less vulnerable to abnormal features while others are more vulnerable. To reduce such variance, we report the average performance over 10 times of random selection of the noise node sets, similar to the settings in the preliminary study in Section 4.2. We report the node classification test accuracy on abnormal (noisy) features and normal features in Figure 4.5 and Figure 4.6, separately, under varying noisy ratio. From these figures, we can observe: • Figure 4.5 shows that AirGNN significantly outperforms all baselines on all datasets in terms of the performance on noisy nodes. This verifies that AMP is able to improve the resilience to noisy features, aligning well with the design motivation. • Figure 4.6 shows that AirGNN promotes the performance on normal nodes when abnormal nodes exist. This is because AMP can remove some abnormal features which are detrimental to normal nodes. 0.7 0.4 0.8 GAT GCN 0.7 0.6 GCNII APPNP 0.3 0.6 0.5 AirGNN 0.5 Accuracy Accuracy Accuracy 0.4 0.2 0.4 0.3 GAT 0.3 GAT 0.2 0.1 GCN 0.2 GCN GCNII GCNII 0.1 APPNP 0.1 APPNP AirGNN AirGNN 0.01 2 3 4 5 8 10 15 20 25 30 0.01 2 3 4 5 8 10 15 20 25 30 0.01 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Cora (b) CiteSeer (c) PubMed Figure 4.5: Node classification accuracy on abnormal (noisy) nodes. 56 0.9 0.8 0.8 0.7 0.7 Accuracy Accuracy Accuracy 0.6 0.6 0.5 0.7 GAT GAT GAT 0.4 GCN 0.5 GCN GCN GCNII GCNII GCNII 0.3 APPNP APPNP APPNP AirGNN AirGNN AirGNN 0.2 0.41 0.61 1 2 3 4 5 8 10 15 20 25 30 2 3 4 5 8 10 15 20 25 30 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Cora (b) CiteSeer (c) PubMed Figure 4.6: Node classification accuracy on normal nodes. 4.4.3 Performance Comparison with Adversarial Features In this subsection, we consider the abnormal feature scenario when the node features are maliciously attacked by the attacker to manipulate the prediction of GNNs. We use the Nettack [134] implemented in DeepRobust2 [58], a PyTorch library for adversarial attacks and defenses, to generate the adversarial features. We randomly choose 40 test nodes as the targeted nodes, and assess the performance under increasing perturbation budgets {0, 5, 10, 20, 50, 80}, where the perturbation numbers denote the number of feature dimensions that can be manipulated. The node classification accuracy on these attacked nodes are showed in Figure 4.7. From these figures, we can make the following observations: • AirGNN is significantly more robust against adversarially attacked features than all baselines. MLP is the most vulnerable model, which demonstrates the usefulness of graph structure information in combating against abnormal node features. • The advantages of AirGNN over the baselines become much stronger with larger perturbation budgets. This suggests that AMP can significantly improve the resilience to abnormal features. 2 https://github.com/DSE-MSU/DeepRobust 57 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.6 0.7 Accuracy Accuracy Accuracy 0.5 0.6 0.6 0.4 0.3 GAT 0.5 GAT 0.5 GAT GCN GCN GCN 0.2 GCNII GCNII GCNII APPNP 0.4 APPNP 0.4 APPNP 0.1 AirGNN AirGNN AirGNN MLP MLP MLP 0 1 2 5 10 20 50 80 0.30 1 2 5 10 20 50 80 0.30 1 2 5 10 20 50 80 Perturbation Number Perturbation Number Perturbation Number (a) Cora (b) CiteSeer (c) PubMed Figure 4.7: Node classification accuracy on adversarial nodes. 4.4.4 Adaptive Residual for Abnormal & Normal Nodes To further understand and verify how AMP and AirGNN work, we investigate the adaptive score 𝛽𝑖 for each node 𝑣 𝑖 . Specifically, the average adaptive scores for abnormal nodes and normal nodes in the last layer of AMP are computed separately. In the noisy feature scenario, we fix ratio of noisy nodes as 10%. In the adversarial feature scenario, we choose 40 target nodes and fix the perturbation number as 80. The results in noisy and adversarial feature scenarios are showed in Table 4.3 and Table 4.4, respectively. From these tables, we can observe: • On the one hand, it can be clearly observed that in both scenarios, the average adaptive scores for abnormal nodes are significantly higher than those for normal nodes. Therefore, it verifies our intuition that large adaptive scores are strongly related to abnormal features. • On the other hand, it also implies that the residual weights (i.e., 1 − 𝛽𝑖 ) for abnormal nodes are much lower than those of normal nodes. This perfectly aligns with our motivation to remove abnormal features by reducing their residual connections. The study on adaptive scores verifies how the adaptive residuals in AMP and AirGNN work as designed. It corroborates that AirGNN not only tremendously boosts the resilience to abnormal features but also provides interpretable information for anomaly detection that will be useful in many security-critical scenarios since the adaptive score serves as a good indicator of abnormal nodes. Morever, it is expected that APPNP without residual will perform well on abnormal nodes but it will 58 sacrifice the performance on normal nodes. We provide detailed comparison with APPNP w/Res and APPNP wo/Res in Appendix C.2.3 to show the advantages of adaptive residual of AirGNN. Table 4.3: Average adaptive score (𝛽) and residual weight (1 − 𝛽) in the noisy feature scenario. Measure Cora CiteSeer PubMed Average adaptive score for abnormal nodes 0.998 ± 0.000 0.988 ± 0.000 0.996 ± 0.000 Average adaptive score for normal nodes 0.924 ± 0.002 0.807 ± 0.005 0.869 ± 0.006 Average residual weight for abnormal nodes 0.002 ± 0.000 0.012 ± 0.000 0.004 ± 0.000 Average residual weight for normal nodes 0.076 ± 0.002 0.193 ± 0.005 0.131 ± 0.006 Table 4.4: Average adaptive score (𝛽) and residual weight (1 − 𝛽) in the adversarial feature scenario. Measure Cora CiteSeer PubMed Average adaptive score for abnormal nodes 0.987 ± 0.000 0.930 ± 0.007 0.959 ± 0.005 Average adaptive score for normal nodes 0.922 ± 0.004 0.689 ± 0.024 0.826 ± 0.016 Average residual weight for abnormal nodes 0.013 ± 0.000 0.070 ± 0.007 0.041 ± 0.005 Average residual weight for normal nodes 0.078 ± 0.004 0.311 ± 0.024 0.174 ± 0.016 4.4.5 Performance in the Clean Setting Table 4.5 shows the overall performance when the dataset does not contain abnormal node features. The performance of APPNP and AirGNN are comparable, which supports that AirGNN doesn’t need to sacrifice clean performance for better robustness. AirGNN also outperforms Robust GCN in the clean data setting. Table 4.5: Comparison between AirGNN, APPNP, and Robust GCN in the clean setting. Dataset Cora CiteSeer PubMed Robust GCN 0.817 ± 0.005 0.710 ± 0.005 0.791 ± 0.003 APPNP 0.842 ± 0.004 0.719 ± 0.004 0.804 ± 0.003 AirGNN 0.839 ± 0.004 0.726 ± 0.004 0.806 ± 0.003 4.5 Related Work GNNs generalize convolutional neural networks (CNN) to graph structure data through the message passing framework [74, 29, 91]. The design of message passing and GNN architectures are majorly 59 motivated in spectral domain [48, 23] and spatial domain [33, 109, 91, 29, 24]. Recent works have shown that the message passing in GNNs can be regarded as low-pass graph filters [84, 126]. More generally, it has been proven that message passing in many GNNs can be uniformly derived from graph signal denoising [73, 86, 129, 16]. Classic GNNs such as GCN [48] and GAT [109] achieve their best performance with shallow models, but their performance degrades when stacking more layers, which can be partially explained through oversmoothing analyses [55, 85]. Recent works propose to use residual connections or skip connections to mitigate the oversmoothing issues, and they demonstrate the potential benefits from more feature aggregations. Examples include but not limited to DeepGCNs [53], JKNet [121], GCNII [14], APPNP [49] and DeeperGNN [65]. These models use global residual connection that can not be adaptive for each node, which significantly differ from the proposed AirGNN. Graph-level, neighborhood-wise and pair-level smoothness are studied in the framework of graph feature gating networks [39]. Beyond oversmoothing, feature over-correlation in GNNs [38] and automated self-supervised learning for graphs [40] are studied. Recently, there are growing interests in reducing GNNs’ vulnerability to the graph structure noise, such as Robust GCN [128], GCN-SVD [27], Pro-GNN [41], IDGL [18], ElasticGNN [67], etc. Please refer to the comprehensive surveys [37, 132] for more details. However, how to design GNNs with strong resilience to abnormal node features remains to be developed. To the best of our knowledge, AirGNN is the first GNN model that is intrinsically robust to many types of abnormal node features by design. It improves the performance in various kinds of abnormal scenarios without needing to sacrifice clean accuracy in normal settings. 4.6 Conclusion In this work, we discover an intrinsic tension between feature aggregation and residual connection in the message passing scheme of GNNs, as well as the corresponding performance tradeoff between nodes with abnormal and normal features. We analyze possible reasons to explain these findings from the perspective of graph Laplacian smoothing. Our understandings further motivate us to propose a simple, efficient, interpretable and adaptive message passing scheme as well as a new GNN 60 model with adaptive residual, named AirGNN. AirGNN provides a node-wise adaptive transition between feature aggregation and residual connection, and the significant advantages of AirGNN are demonstrated through extensive experiments. In the future, it is promising to study the interaction between multiple perspectives of trustworthy AI [64] such as robustness and fairness [119]. 61 CHAPTER 5 ELASTIC GRAPH NEURAL NETWORKS In this chapter, we discuss a severe limitation of the message passing schemes in existing graph neural networks (GNNs) – they are proven to perform ℓ2 -based graph smoothing that enforces smoothness globally and such a global smoothness property might lead to the lack of robustness under adversarial graph attacks. In this work, we propose to design a more robust message passing algorithm for GNNs by enhancing the local smoothness adaptivity of GNNs via ℓ1 -based graph smoothing. To this end, we introduce a family of GNNs (Elastic GNNs) based on ℓ1 and ℓ2 -based graph smoothing. In particular, we propose a novel and general message passing scheme into GNNs. This message passing algorithm is not only friendly to back-propagation training but also achieves the desired smoothing properties with a theoretical convergence guarantee. Experiments on semi-supervised learning tasks demonstrate that the proposed Elastic GNNs obtain better adaptivity on benchmark datasets and are significantly robust to graph adversarial attacks. The implementation of Elastic GNNs is available at https://github.com/lxiaorui/ElasticGNN. 5.1 Introduction Graph neural networks (GNNs) generalize traditional deep neural networks (DNNs) from regular grids, such as image, video, and text, to irregular data such as social networks, transportation networks, and biological networks, which are typically denoted as graphs [23, 48]. One popular such generalization is the neural message passing framework [29]: x𝑢(𝑘+1) = UPDATE (𝑘) x𝑢(𝑘) , mN(𝑘)  (𝑢) (5.1) where x𝑢(𝑘) ∈ R𝑑 denotes the feature vector of node 𝑢 in 𝑘-th iteration of message passing and mN(𝑘) (𝑢) is the message aggregated from 𝑢’s neighborhood N (𝑢). The specific architecture design has been motivated from spectral domain [48, 23] and spatial domain [33, 109, 91, 29]. Recent study [73] has proven that the message passing schemes in numerous popular GNNs, such as GCN, GAT, PPNP, 62 and APPNP, intrinsically perform the ℓ2 -based graph smoothing to the graph signal, and they can be considered as solving the graph signal denoising problem: arg min L (F) := ∥F − Xin ∥ 2𝐹 + 𝜆 tr(F⊤ LF), (5.2) F where Xin ∈ R𝑛×𝑑 is the input signal and L ∈ R𝑛×𝑛 is the graph Laplacian matrix encoding the graph structure. The first term guides F to be close to input signal Xin , while the second term enforces global smoothness to the filtered signal F. The resulted message passing schemes can be derived by different optimization solvers, and they typically entail the aggregation of node features from neighboring nodes, which intuitively coincides with the cluster or consistency assumption that neighboring nodes should be similar [130, 127]. While existing GNNs are prominently driven by ℓ2 -based graph smoothing, ℓ2 -based methods enforce smoothness globally and the level of smoothness is usually shared across the whole graph. However, the level of smoothness over different regions of the graph can be different. For instance, node features or labels can change significantly between clusters but smoothly within the cluster [131]. Therefore, it is desired to enhance the local smoothness adaptivity of GNNs. Motivated by the idea of trend filtering [46, 106, 111], we aim to achieve the goal via ℓ1 -based graph smoothing. Intuitively, compared with ℓ2 -based methods, ℓ1 -based methods penalize large values less and thus preserve discontinuity or non-smooth signal better. Theoretically, ℓ1 -based methods tend to promote signal sparsity to trade for discontinuity [90, 105, 94]. Owning to these advantages, trend filtering [106] and graph trend filter [111, 108] demonstrate that ℓ1 -based graph smoothing can adapt to inhomogenous level of smoothness of signals and yield estimators with k-th order piecewise polynomial functions, such as piecewise constant, linear and quadratic functions, depending on the order of the graph difference operator. While ℓ1 -based methods exhibit various appealing properties and have been extensively studied in different domains such as signal processing [25], statistics and machine learning [34], it has rarely been investigated in the design of GNNs. In this work, we attempt to bridge this gap and enhance the local smoothnesss adaptivity of GNNs via ℓ1 -based graph smoothing. 63 Incorporating ℓ1 -based graph smoothing in the design of GNNs faces tremendous challenges. First, since the message passing schemes in GNNs can be derived from the optimization iteration of the graph signal denoising problem, a fast, efficient and scalable optimization solver is desired. Unfortunately, to solve the associated optimization problem involving ℓ1 norm is challenging since the objective function is composed by smooth and non-smooth components and the decision variable is further coupled by the discrete graph difference operator. Second, to integrate the derived messaging passing scheme into GNNs, it has to be composed by simple operations that are friendly to the back-propagation training of the whole GNNs. Third, it requires an appropriate normalization step to deal with diverse node degrees, which is often overlooked by existing graph total variation and graph trend filtering methods. Our attempt to address these challenges leads to a family of novel GNNs, i.e., Elastic GNNs. Our key contributions can be summarized as follows: • We introduce ℓ1 -based graph smoothing in the design of GNNs to further enhance the local smoothness adaptivity, for the first time; • We derive a novel and general message passing scheme, i.e., Elastic Message Passing (EMP), and develop a family of GNN architectures, i.e., Elastic GNNs, by integrating the proposed message passing scheme into deep neural nets; • Extensive experiments demonstrate that Elastic GNNs obtain better adaptivity on various real-world datasets, and they are significantly robust to graph adversarial attacks. The study on different variants of Elastic GNNs suggests that ℓ1 and ℓ2 -based graph smoothing are complementary and Elastic GNNs are more versatile. 5.2 Preliminary We use bold upper-case letters such as X to denote matrices and bold lower-case letters such as x to define vectors. Given a matrix X ∈ R𝑛×𝑑 , we use X𝑖 to denote its 𝑖-th row and X𝑖 𝑗 to denote its element in 𝑖-th row and 𝑗-th column. We define the Frobenius norm, ℓ1 norm, and ℓ21 norm of matrix √︃Í Í √︃Í 2 X as ∥X∥ 𝐹 = 2 , ∥X∥ = Í |X |, and ∥X∥ Í ∥X ∥ X 𝑖𝑗 𝑖𝑗 1 𝑖𝑗 𝑖𝑗 21 = 𝑖 𝑖 2 = 𝑖 𝑗 X𝑖 𝑗 , respectively. We 64 define ∥X∥ 2 = 𝜎max (X) where 𝜎max (X) is the largest singular value of X. Given two matrices X, Y ∈ R𝑛×𝑑 , we define the inner product as ⟨X, Y⟩ = tr(X⊤ Y). Let G = {V, E} be a graph with the node set V = {𝑣 1 , . . . , 𝑣 𝑛 } and the undirected edge set E = {𝑒 1 , . . . , 𝑒 𝑚 }. We use N (𝑣 𝑖 ) to denote the neighboring nodes of node 𝑣 𝑖 , including 𝑣 𝑖 itself. Suppose that each node is associated with a 𝑑-dimensional feature vector, and the features for all nodes are denoted as Xfea ∈ R𝑛×𝑑 . The graph structure G can be represented as an adjacent matrix A ∈ R𝑛×𝑛 , where A𝑖 𝑗 = 1 when there exists an edge between nodes 𝑣 𝑖 and 𝑣 𝑗 . The graph Laplacian matrix is defined as L = D − A, where D is the diagonal degree matrix. Let Δ ∈ {−1, 0, 1} 𝑚×𝑛 be the oriented incident matrix, which contains one row for each edge. If 𝑒 ℓ = (𝑖, 𝑗), then Δ has ℓ-th row as: Δℓ = (0, . . . , −1 , . . . , 1 , . . . , 0) |{z} |{z} 𝑖 𝑗 where the edge orientation can be arbitrary. Note that the incident matrix and unnormalized Laplacian matrix have the equivalence L = Δ⊤ Δ. Next, we briefly introduce some necessary background about the graph signal denoising perspective of GNNs and the graph trend filtering methods. 5.2.1 GNNs as Graph Signal Denoising It is evident from recent work [73] that many popular GNNs can be uniformly understood as graph signal denoising with Laplacian smoothing regularization. Here we briefly describe several representative examples. GCN. The message passing scheme in Graph Convolutional Networks (GCN) [48], Xout = ÃXin , is equivalent to one gradient descent step to minimize tr(F⊤ (I − Ã)F) with the initial F = Xin and 1 1 stepsize 1/2. Here à = D̂− 2 ÂD̂− 2 with  = A + I being the adjacent matrix with self-loop, whose degree matrix is D̂. 65 PPNP & APPNP. The message passing scheme in PPNP and APPNP [49] follow the aggregation rules  −1 Xout = 𝛼 I − (1 − 𝛼) à Xin , and X (𝑘+1) = (1 − 𝛼) ÃX (𝑘) + 𝛼Xin . They are shown to be the exact solution and one gradient descent step with stepsize 𝛼/2 for the following problem min ∥F − Xin ∥ 2𝐹 + (1/𝛼 − 1) tr(F⊤ (I − Ã)F). (5.3) F For more comprehensive illustration, please refer to [73]. We point out that all these message passing schemes adopt ℓ2 -based graph smoothing as the signal differences between neighboring F nodes are penalized by the square of ℓ2 norm, e.g., (𝑣 𝑖 ,𝑣 𝑗 )∈E ∥ √𝑑F𝑖+1 − √ 𝑗 ∥ 22 with 𝑑𝑖 being the Í 𝑖 𝑑 𝑗 +1 node degree of node 𝑣 𝑖 . The resulted message passing schemes are usually linear smoothers which smooth the input signal by their linear transformation. 5.2.2 Graph Trend Filtering In the univariate case, the 𝑘-th order graph trend filtering (GTF) estimator [111] is given by 1 arg min = ∥f − x∥ 22 + 𝜆∥Δ (𝑘+1) f ∥ 1 (5.4) f∈R𝑛 2 where x ∈ R𝑛 is the 1-dimensional input signal of 𝑛 nodes and Δ (𝑘+1) is a 𝑘-th order graph difference operator. When 𝑘 = 0, it penalizes the absolute difference across neighboring nodes in graph G: ∑︁ ∥Δ (1) f ∥ 1 = |f𝑖 − f 𝑗 | (𝑣 𝑖 ,𝑣 𝑗 )∈E where Δ (1) is equivalent to the incident matrix Δ. Generally, 𝑘-th order graph difference operators can be defined recursively: 𝑘+1  Δ⊤ Δ (𝑘) = L 2 ∈ R𝑛×𝑛 for odd k   (𝑘+1)  Δ = 𝑘 (𝑘)  ΔΔ = ΔL ∈ R   2 𝑚×𝑛 for even k. 66 It is demonstrated that GTF can adapt to inhomogeneity in the level of smoothness of signal and tends to provide piecewise polynomials over graphs [111]. For instance, when 𝑘 = 0, the sparsity induced by the ℓ1 -based penalty ∥Δ (1) f ∥ 1 implies that many of the differences f𝑖 − f 𝑗 are zeros across edges (𝑣 𝑖 , 𝑣 𝑗 ) ∈ E in G. The piecewise property originates from the discontinuity of signal allowed by less aggressive ℓ1 penalty, with adaptively chosen knot nodes or knot edges. Note that the smoothers induced by GTF are not linear smoothers and cannot be simply represented by linear transformation of the input signal. 5.3 Algorithm In this section, we first propose a new graph signal denoising estimator. Then we develop an efficient optimization algorithm for solving the denoising problem and introduce a novel, general and efficient message passing scheme, i.e., Elastic Message Passing (EMP), for graph signal smoothing. Finally, the integration of the proposed message passing scheme and deep neural networks leads to Elastic GNNs. 5.3.1 Elastic Graph Signal Estimator To combine the advantages of ℓ1 and ℓ2 -based graph smoothing, we propose the following elastic graph signal estimator: 𝜆2 1 arg min 𝜆 1 ∥ΔF∥ 1 + tr(F⊤ LF) + ∥F − Xin ∥ 2𝐹 (5.5) F∈R𝑛×𝑑 | {z } |2 {z 2 } 𝑔1 (ΔF) 𝑓 (F) where Xin ∈ R𝑛×𝑑 is the 𝑑-dimensional input signal of 𝑛 nodes. The first term can be written in an edge-centric way: ∥Δ (1) F∥ 1 = (𝑣 𝑖 ,𝑣 𝑗 )∈E ∥F𝑖 − F 𝑗 ∥ 1 , which penalizes the absolute difference across Í connected nodes in graph G. Similarly, the second term penalizes the difference quadratically via tr(F⊤ LF) = (𝑣 𝑖 ,𝑣 𝑗 )∈E ∥F𝑖 − F 𝑗 ∥ 22 . The last term is the fidelity term which preserves the similarity Í with the input signal. The regularization coefficients 𝜆 1 and 𝜆2 control the balance between ℓ1 and ℓ2 -based graph smoothing. 67 Remark 11. It is potential to consider higher-order graph differences in both the ℓ1 -based and ℓ2 -based smoothers. But, in this work, we focus on the 0-th order graph difference operator Δ, since we assume the piecewise constant prior for graph representation learning. Normalization. In existing GNNs, it is beneficial to normalize the Laplacian matrix for better numerical stability, and the normalization trick is also crucial for achieving superior performance. Therefore, for the ℓ2 -based graph smoothing, we follow the common normalization trick in GNNs: 1 1 L̃ = I − Ã, where à = D̂− 2 ÂD̂− 2 ,  = A + I and D̂𝑖𝑖 = 𝑑𝑖 = 𝑗 Â𝑖 𝑗 . It leads to a degree normalized Í penalty 2 ⊤ ∑︁ F𝑖 F𝑗 tr(F L̃F) = √ − √︁ . (𝑣 𝑖 ,𝑣 𝑗 )∈E 𝑑𝑖 + 1 𝑑𝑗 + 1 2 In the literature of graph total variation and graph trend filtering, the normalization step is often overlooked and the graph difference operator is directly used as in GTF [111, 108]. To achieve better numerical stability and handle diverse node degrees in real-world graphs, we propose to normalize each column of the incident matrix by the square root of node degrees for the ℓ1 -based graph smoothing as follows1: 1 Δ̃ = ΔD̂− 2 . It leads to a degree normalized total variation penalty 2 ∑︁ F𝑖 F𝑗 ∥ Δ̃F∥ 1 = √ − √︁ . (𝑣 𝑖 ,𝑣 𝑗 )∈E 𝑑𝑖 + 1 𝑑𝑗 + 1 1 Note that this normalized incident matrix maintains the relation with the normalized Laplacian matrix as in the unnormalized case L̃ = Δ̃⊤ Δ̃ (5.6) given that 1 1 1 1 1 1 L̃ = D̂− 2 ( D̂ − Â) D̂− 2 = D̂− 2 LD̂− 2 = D̂− 2 Δ⊤ ΔD̂− 2 . 1 It naturally supports read-value edge weights if the edge weights are set in the incident matrix Δ. 2 With the normalization, the piecewise constant prior is up to the degree scaling, i.e., sparsity in Δ̃F. 68 With the normalization, the estimator defined in (5.5) becomes: 𝜆2 1 arg min 𝜆 1 ∥ Δ̃F∥ 1 + tr(F⊤ L̃F) + ∥F − Xin ∥ 2𝐹 . (5.7) F∈R𝑛×𝑑 | {z } |2 {z 2 } 𝑔1 ( Δ̃F) 𝑓 (F) Capture correlation among dimensions. The node features in real-world graphs are usually multi-dimensional. Although the estimator defined in (5.7) is able to handle multi-dimensional data since the signal from different dimensions are separable under ℓ1 and ℓ2 norm, such estimator treats each feature dimension independently and does not exploit the potential relation between feature dimensions. However, the sparsity patterns of node difference across edges could be shared among feature dimensions. To better exploit this potential correlation, we propose to couple the multi-dimensional features by ℓ21 norm, which penalizes the summation of ℓ2 norm of the node difference ∑︁ F𝑖 F𝑗 ∥ Δ̃F∥ 21 = √ − √︁ . (𝑣 𝑖 ,𝑣 𝑗 )∈E 𝑑𝑖 + 1 𝑑𝑗 + 1 2 This penalty promotes the row sparsity of Δ̃F and enforces similar sparsity patterns among feature dimensions. In other words, if two nodes are similar, all their feature dimensions should be similar. Therefore, we define the ℓ21 -based estimator as 𝜆2 1 arg min 𝜆 1 ∥ Δ̃F∥ 21 + tr(F⊤ L̃F) + ∥F − Xin ∥ 2𝐹 (5.8) F∈R𝑛×𝑑 | {z } |2 {z 2 } 𝑔21 ( Δ̃F) 𝑓 (F) where 𝑔21 (·) = 𝜆 1 ∥ · ∥ 21 . In the following subsections, we will use 𝑔(·) to represent both 𝑔1 (·) and 𝑔21 (·). We use ℓ1 to represent both ℓ1 and ℓ21 if not specified. 5.3.2 Elastic Message Passing For the ℓ2 -based graph smoother, message passing schemes can be derived from the gradient descent iterations of the graph signal denoising problem, as in the case of GCN and APPNP [73]. However, computing the estimators defined by (5.7) and (5.8) is much more challenging because of the nonsmoothness, and the two components, i.e., 𝑓 (F) and 𝑔( Δ̃F), are non-separable as they are coupled by the graph difference operator Δ̃. In the literature, researchers have developed optimization 69 algorithms for the graph trend filtering problem (5.4) such as Alternating Direction Method of Multipliers (ADMM) and Newton type algorithms [111, 108]. However, these algorithms require to solve the minimization of a non-trivial sub-problem in each single iteration, which incurs high computation complexity. Moreover, it is unclear how to make these iterations compatible with the back-propagation training of deep learning models. This motivates us to design an algorithm which is not only efficient but also friendly to back-propagation training. To this end, we propose to solve an equivalent saddle point problem using a primal-dual algorithm with efficient computations. Saddle point reformulation. For a general convex function 𝑔(·), its conjugate function is defined as 𝑔 ∗ (Z) := sup ⟨Z, X⟩ − 𝑔(X). X By using 𝑔( Δ̃F) = sup ⟨Δ̃F, Z⟩ − 𝑔 ∗ (Z), the problem (5.7) and (5.8) can be equivalently written as Z the following saddle point problem: min max 𝑓 (F) + ⟨Δ̃F, Z⟩ − 𝑔 ∗ (Z). (5.9) F Z where Z ∈ R𝑚×𝑑 . Motivated by Proximal Alternating Predictor-Corrector (PAPC) [70, 15], we propose an efficient algorithm with per iteration low computation complexity and convergence guarantee: F̄ 𝑘+1 = F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 ) − 𝛾 Δ̃⊤ Z 𝑘 , (5.10) Z 𝑘+1 = prox 𝛽𝑔∗ (Z 𝑘 + 𝛽Δ̃F̄ 𝑘+1 ), (5.11) F 𝑘+1 = F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 ) − 𝛾 Δ̃⊤ Z 𝑘+1 , (5.12) where prox 𝛽𝑔∗ (X) = arg min 12 ∥Y − X∥ 2𝐹 + 𝛽𝑔 ∗ (Y). The stepsizes, 𝛾 and 𝛽, will be specified Y later. The first step (5.10) obtains a prediction of F 𝑘+1 , i.e., F̄ 𝑘+1 , by a gradient descent step on primal variable F 𝑘 . The second step (5.11) is a proximal dual ascent step on the dual variable Z 𝑘 based on the predicted F̄ 𝑘+1 . Finally, another gradient descent step on the primal variable based on (F 𝑘 , Z 𝑘+1 ) gives next iteration F 𝑘+1 (5.12). Algorithm (5.10)–(5.12) can be interpreted as a “predict-correct” algorithm for the saddle point problem (5.9). Next we demonstrate how to compute the proximal operator in Eq. (5.11). 70 Proximal operators. Using the Moreau’s decomposition principle [6] X = prox 𝛽𝑔∗ (X) + 𝛽prox 𝛽−1 𝑔 (X/𝛽), we can rewrite the step (5.11) using the proximal operator of 𝑔(·), that is, 1 prox 𝛽𝑔∗ (X) = X − 𝛽prox 1 𝑔 ( X). (5.13) 𝛽 𝛽 Y 𝑘+1 = 𝛾Xin + (1 − 𝛾) ÃF 𝑘 F̄ 𝑘+1 = Y 𝑘 − 𝛾 Δ̃⊤ Z 𝑘 Z̄ 𝑘+1 = Z 𝑘 + 𝛽Δ̃F̄ 𝑘+1  Z 𝑘+1 = min(| Z̄ 𝑘+1 |, 𝜆1 ) · sign( Z̄ 𝑘+1 ) (Option I: ℓ1 norm)    Z̄𝑖𝑘+1  Z𝑖 = min(∥ Z̄𝑖 ∥ 2 , 𝜆1 ) · , ∀𝑖 ∈ [𝑚] (Option II: ℓ21 norm)  𝑘+1 𝑘+1 ∥ Z̄𝑖𝑘+1 ∥ 2  F 𝑘+1 = Y 𝑘 − 𝛾 Δ̃⊤ Z 𝑘+1 Figure 5.1: Elastic Message Passing (EMP). F0 = Xin and Z0 = 0𝑚×𝑑 . We discuss the two options for the function 𝑔(·) corresponding to the objectives (5.7) and (5.8). • Option I (ℓ1 norm): 𝑔1 (X) = 𝜆 1 ∥X∥ 1 By definition, the proximal operator of 1𝛽 𝑔1 (X) is 1 1 prox 1 𝑔1 (X) = arg min ∥Y − X∥ 2𝐹 + 𝜆 1 ∥Y∥ 1 , 𝛽 Y 2 𝛽 which is equivalent to the soft-thresholding operator (component-wise): 1 (𝑆 1 𝜆1 (X))𝑖 𝑗 =sign(X𝑖 𝑗 ) max(|X𝑖 𝑗 | − 𝜆 1 , 0) 𝛽 𝛽 1 =X𝑖 𝑗 − sign(X𝑖 𝑗 ) min(|X𝑖 𝑗 |, 𝜆 1 ). 𝛽 Therefore, using (5.13), we have (prox 𝛽𝑔∗ (X))𝑖 𝑗 = sign(X𝑖 𝑗 ) min(|X𝑖 𝑗 |, 𝜆1 ). (5.14) 1 which is a component-wise projection onto the ℓ∞ ball of radius 𝜆 1 . 71 • Option II (ℓ21 norm): 𝑔21 (X) = 𝜆 1 ∥X∥ 21 By definition, the proximal operator of 1𝛽 𝑔21 (X) is 1 1 prox 1 𝑔21 (X) = arg min ∥Y − X∥ 2𝐹 + 𝜆 1 ∥Y∥ 21 𝛽 Y 2 𝛽 with the 𝑖-th row being  X𝑖 1 prox 1 𝑔21 (X) 𝑖 = max(∥X𝑖 ∥ 2 − 𝜆 1 , 0). 𝛽 ∥X𝑖 ∥ 2 𝛽 Similarly, using (5.13), we have the 𝑖-th row of prox 𝛽𝑔∗ (X) being 21 (prox 𝛽𝑔∗ (X))𝑖 21 = X𝑖 − 𝛽prox 1 𝑔21 (X𝑖 /𝛽) 𝛽 X𝑖 /𝛽 = X𝑖 − 𝛽 max(∥X𝑖 /𝛽∥ 2 − 𝜆 1 /𝛽, 0) ∥X𝑖 /𝛽∥ 2 X𝑖 = X𝑖 − max(∥X𝑖 ∥ 2 − 𝜆 1 , 0) ∥X𝑖 ∥ 2 X𝑖 = (∥X𝑖 ∥ 2 − max(∥X𝑖 ∥ 2 − 𝜆 1 , 0)) ∥X𝑖 ∥ 2 X𝑖 = min(∥X𝑖 ∥ 2 , 𝜆1 ), (5.15) ∥X𝑖 ∥ 2 which is a row-wise projection on the ℓ2 ball of radius 𝜆 1 . Note that the proximal operator in the ℓ1 norm case treats each feature dimension independently, while in the ℓ21 norm case, it couples the multi-dimensional features, which is consistent with the motivation to exploit the correlation among feature dimensions. The Algorithm (5.10)–(5.12) and the proximal operators (5.14) and (5.15) enable us to derive the final message passing scheme. Note that the computation F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 ) in steps (5.10) and (5.12) can be shared to save computation. Therefore, we decompose the step (5.10) into two steps: Y 𝑘 = F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 )  = (1 − 𝛾)I − 𝛾𝜆2 L̃ F 𝑘 + 𝛾Xin , (5.16) F̄ 𝑘+1 = Y 𝑘 − 𝛾 Δ̃⊤ Z 𝑘 . (5.17) 72 1 1 In this work, we choose 𝛾 = 1+𝜆2 and 𝛽 = 2𝛾 . Therefore, with L̃ = I − Ã, Eq. (5.16) can be simplified as Y 𝑘+1 = 𝛾Xin + (1 − 𝛾) ÃF 𝑘 . (5.18) Let Z̄ 𝑘+1 := Z 𝑘 + 𝛽Δ̃F̄ 𝑘+1 , then steps (5.11) and (5.12) become Z 𝑘+1 = prox 𝛽𝑔∗ ( Z̄ 𝑘+1 ), (5.19) F 𝑘+1 = F 𝑘 − 𝛾∇ 𝑓 (F 𝑘 ) − 𝛾 Δ̃Z 𝑘+1 = Y 𝑘 − 𝛾 Δ̃⊤ Z 𝑘+1 . (5.20) Substituting the proximal operators in (5.19) with (5.14) and (5.15), we obtain the complete elastic message passing scheme (EMP) as summarized in Figure 5.1. Interpretation of EMP. EMP can be interpreted as the standard message passing (MP) (Y in Fig. 1) with extra operations (the following steps). The extra operations compute Δ̃⊤ Z to adjust the standard MP such that sparsity in Δ̃F is promoted and some large node differences can be preserved. EMP is general and covers some existing propagation rules as special cases as demonstrated in Remark 12. Remark 12 (Special cases). If there is only ℓ2 -based regularization, i.e., 𝜆 1 = 0, then according 1 to the projection operator, we have Z 𝑘 = 0𝑚×𝑛 . Therefore, with 𝛾 = 1+𝜆2 , the proposed message passing scheme reduces to 1 𝜆2 F 𝑘+1 = Xin + ÃF 𝑘 . 1 + 𝜆2 1 + 𝜆2 1 If 𝜆 2 = 𝛼 − 1, it recovers the message passing in APPNP: F 𝑘+1 = 𝛼Xin + (1 − 𝛼) ÃF 𝑘 . If 𝜆 2 = ∞, it recovers the simple aggregation operation in many GNNs: F 𝑘+1 = ÃF 𝑘 . 73 Computation Complexity. EMP is efficient and composed by simple operations. The major computation cost comes from four sparse matrix multiplications, include ÃF 𝑘 , Δ̃⊤ Z 𝑘 , Δ̃F̄ 𝑘+1 and Δ̃⊤ Z 𝑘+1 . The computation complexity is in the order 𝑂 (𝑚𝑑) where 𝑚 is the number of edges in graph G and 𝑑 is the feature dimension of input signal Xin . Other operations are simple matrix additions and projection. The convergence of EMP and the parameter settings are justified by Theorem 6. 2 4 Theorem 6 (Convergence of EMP). Under the stepsize setting 𝛾 < 1+𝜆 2 ∥ L̃∥ 2 and 𝛽 ≤ 3𝛾∥ Δ̃Δ̃⊤ ∥ 2 , the elastic message passing scheme (EMP) in Figure 5.1 converges to the optimal solution of the elastic graph signal estimator defined in (5.7) (Option I) or (5.8) (Option II). It is sufficient to choose any 2 2 𝛾< 1+2𝜆2 and 𝛽 ≤ 3𝛾 since ∥ L̃∥ 2 = ∥ Δ̃⊤ Δ̃∥ 2 = ∥ Δ̃Δ̃⊤ ∥ 2 ≤ 2. Proof. We first consider the general problem min 𝑓 (F) + 𝑔(BF) (5.21) F where 𝑓 and 𝑔 are convex functions and B is a bounded linear operator. It is proved in [70, 15] that the iterations in (5.10)–(5.12) guarantee the convergence of F 𝑘 to the optimal solution of the 2 1 minimization problem (5.21) if the parameters satisfy 𝛾 < 𝐿 and 𝛽 ≤ 𝛾𝜆max (BB⊤ ) , where 𝐿 is the 2 4 Lipschitz constant of ∇ 𝑓 (F). These conditions are further relaxed to 𝛾 < 𝐿 and 𝛽 ≤ 3𝛾𝜆max (BB⊤ ) in [60]. For the specific problems defined in (5.7) and (5.8), the two function components 𝑓 and 𝑔 are both convex, and the linear operator Δ is bounded. The Lipschitz constant of ∇ 𝑓 (F) can be computed by the largest eigenvalue of the Hessian matrix of 𝑓 (F): 𝐿 = 𝜆 max (∇2 𝑓 (F)) = 𝜆 max (I + 𝜆 2 L̃) = 1 + 𝜆 2 ∥ L̃∥ 2 . Therefore, the elastic message passing scheme derived from iterations (5.10)–(5.12) is guaranteed to converge to the optimal solution of problem (5.7) (Option I) or problem (5.8) (Option II) if the 2 4 stepsizes satisfy 𝛾 < 1+𝜆2 ∥ L̃∥ 2 and 𝛽 ≤ 3𝛾∥ Δ̃Δ̃⊤ ∥ 2 . 74 Let Δ̃ = UΣV⊤ be the singular value decomposition of Δ̃ and we derive ∥ Δ̃Δ̃⊤ ∥ 2 = ∥UΣV⊤ VΣU⊤ ∥ 2 = ∥UΣ2 U⊤ ∥ 2 = ∥VΣ2 V⊤ ∥ 2 = ∥VΣU⊤ UΣV⊤ ∥ 2 = ∥ Δ̃⊤ Δ̃∥ 2 . The equivalence L̃ = Δ̃⊤ Δ̃ in (5.6) further gives ∥ L̃∥ 2 = ∥ Δ̃⊤ Δ̃∥ 2 = ∥ Δ̃Δ̃⊤ ∥ 2 . 2 2 2 4 2 2 Since ∥ L̃∥ 2 ≤ 2 [19], we have 1+2𝜆2 ≤ 1+𝜆2 ∥ L̃∥ 2 and 3𝛾 ≤ 3𝛾∥ Δ̃Δ̃⊤ ∥ 2 . Therefore, 𝛾 < 1+2𝜆2 𝛽≤ 3𝛾 are sufficient for the convergence of EMP. 5.3.3 Elastic GNNs Incorporating the elastic message passing scheme from the elastic graph signal estimator (5.7) and (5.8) into deep neural networks, we introduce a family of GNNs, namely Elastic GNNs. In this work, we follow the decoupled way as proposed in APPNP [49], where we first make predictions from node features and aggregate the prediction through the proposed EMP:  Ypre = EMP ℎ𝜃 (Xfea ), 𝐾, 𝜆 1 , 𝜆2 . (5.22) Xfea ∈ R𝑛×𝑑 denotes the node features, ℎ𝜃 (·) is any machine learning model, such as multilayer perceptrons (MLPs), 𝜃 is the learnable parameters in the model, and 𝐾 is the number of message passing steps. The training objective is the cross entropy loss defined by the final prediction Ypre and labels for training data. Elastic GNNs also have the following nice properties: • In addition to the backbone neural network model, Elastic GNNs only require to set up three hyperparameters including two coefficients 𝜆 1 , 𝜆2 and the propagation step 𝐾, but they do not introduce any learnable parameters. Therefore, it reduces the risk of overfitting. • The hyperparameters 𝜆 1 and 𝜆 2 provide better smoothness adaptivity to Elastic GNNs depending on the smoothness properties of the graph data. • The message passing scheme only entails simple and efficient operations, which makes it friendly to the efficient and end-to-end back-propagation training of the whole GNN model. 75 5.4 Experiment In this section, we conduct experiments to validate the effectiveness of the proposed Elastic GNNs. We first introduce the experimental settings. Then we assess the performance of Elastic GNNs and investigate the benefits of introducing ℓ1 -based graph smoothing into GNNs with semi-supervised learning tasks under normal and adversarial settings. In the ablation study, we validate the local adaptive smoothness, sparsity pattern, and convergence of EMP. 5.4.1 Experimental Settings Datasets. We conduct experiments on 8 real-world datasets including three citation graphs, i.e., Cora, Citeseer, Pubmed [93], two co-authorship graphs, i.e., Coauthor CS and Coauthor Physics [95], two co-purchase graphs, i.e., Amazon Computers and Amazon Photo [95], and one blog graph, i.e., Polblogs [2]. In Polblogs graph, node features are not available so we set the feature matrix to be a 𝑛 × 𝑛 identity matrix. The data statistics for the benchmark datasets used in Section 5.4.2 are summarized in Table 5.1. The data statistics for the adversarially attacked graph used in Section 5.4.3 are summarized in Table 5.2. Table 5.1: Statistics of benchmark datasets. Dataset Classes Nodes Edges Features Training Nodes Validation Nodes Test Nodes Cora 7 2708 5278 1433 20 per class 500 1000 CiteSeer 6 3327 4552 3703 20 per class 500 1000 PubMed 3 19717 44324 500 20 per class 500 1000 Coauthor CS 15 18333 81894 6805 20 per class 30 per class Rest nodes Coauthor Physics 5 34493 247962 8415 20 per class 30 per class Rest nodes Amazon Computers 10 13381 245778 767 20 per class 30 per class Rest nodes Amazon Photo 8 7487 119043 745 20 per class 30 per class Rest nodes Table 5.2: Dataset Statistics for adversarially attacked graph. NLCC ELCC Classes Features Cora 2,485 5,069 7 1,433 CiteSeer 2,110 3,668 6 3,703 Polblogs 1,222 16,714 2 / PubMed 19,717 44,338 3 500 76 Baselines. We compare the proposed Elastic GNNs with representative GNNs including GCN [48], GAT [109], ChebNet [23], GraphSAGE [33], APPNP [49] and SGC [115]. For all models, we use 2 layer neural networks with 64 hidden units. Parameter settings. For each experiment, we report the average performance and the standard variance of 10 runs. For all methods, hyperparameters are tuned from the following search space: 1) learning rate: {0.05, 0.01, 0.005}; 2) weight decay: {5e-4, 5e-5, 5e-6}; 3) dropout rate: {0.5, 0.8}. For APPNP, the propagation step 𝐾 is tuned from {5, 10} and the parameter 𝛼 is tuned from {0, 0.1, 0.2, 0.3, 0.5, 0.8, 1.0}. For Elastic GNNs, the propagation step 𝐾 is tuned from {5, 10} and 1 parameters 𝜆 1 and 𝜆 2 are tuned from {0, 3, 6, 9}. As suggested by Theorem 1, we set 𝛾 = 1+𝜆 2 1 and 𝛽 = 2𝛾 in the proposed elastic message passing scheme. Adam optimizer [47] is used in all experiments. 5.4.2 Performance on Benchmark Datasets On commonly used datasets including Cora, CiteSeer, PubMed, Coauthor CS, Coauthor Physics, Amazon Computers and Amazon Photo, we compare the performance of the proposed Elastic GNN (ℓ21 + ℓ2 ) with representative GNN baselines on the semi-supervised learning task. The classification accuracy are showed in Table 5.3. From these results, we can make the following observations: • Elastic GNN outperforms GCN, GAT, ChebNet, GraphSAGE and SGC by significant margins on all datasets. For instance, Elastic GNN improves over GCN by 3.1%, 2.0% and 1.8% on Cora, CiteSeer and PubMed datasets. The improvement comes from the global and local smoothness adaptivity of Elastic GNN. • Elastic GNN (ℓ21 + ℓ2 ) consistently achieves higher performance than APPNP on all datasets. Essentially, Elastic GNN covers APPNP as a special case when there is only ℓ2 regularization, i.e., 𝜆 1 = 0. Beyond the ℓ2 -based graph smoothing, the ℓ21 -based graph smoothing further enhances the local smoothness adaptivity. This comparison verifies the benefits of introducing ℓ21 -based graph smoothing in GNNs. 77 Table 5.3: Classification accuracy (%) on benchmark datasets with 10 times random data splits. Model Cora CiteSeer PubMed CS Physics Computers Photo ChebNet 76.3 ± 1.5 67.4 ± 1.5 75.0 ± 2.0 91.8 ± 0.4 OOM 81.0 ± 2.0 90.4 ± 1.0 GCN 79.6 ± 1.1 68.9 ± 1.2 77.6 ± 2.3 91.6 ± 0.6 93.3 ± 0.8 79.8 ± 1.6 90.3 ± 1.2 GAT 80.1 ± 1.2 68.9 ± 1.8 77.6 ± 2.2 91.1 ± 0.5 93.3 ± 0.7 79.3 ± 2.4 89.6 ± 1.6 SGC 80.2 ± 1.5 68.9 ± 1.3 75.5 ± 2.9 90.1 ± 1.3 93.1 ± 0.6 73.0 ± 2.0 83.5 ± 2.9 APPNP 82.2 ± 1.3 70.4 ± 1.2 78.9 ± 2.2 92.5 ± 0.3 93.7 ± 0.7 80.1 ± 2.1 90.8 ± 1.3 GraphSAGE 79.0 ± 1.1 67.5 ± 2.0 77.6 ± 2.0 91.7 ± 0.5 92.5 ± 0.8 80.7 ± 1.7 90.9 ± 1.0 ElasticGNN 82.7 ± 1.0 70.9 ± 1.4 79.4 ± 1.8 92.5 ± 0.3 94.2 ± 0.5 80.7 ± 1.8 91.3 ± 1.3 Table 5.4: Classification accuracy (%) under different perturbation rates of adversarial graph attack. Basic GNN Elastic GNN Dataset Ptb Rate GCN GAT ℓ2 ℓ1 ℓ21 ℓ1 + ℓ2 ℓ21 + ℓ2 0% 83.5±0.4 84.0±0.7 85.8±0.4 85.1±0.5 85.3±0.4 85.8±0.4 85.8±0.4 5% 76.6±0.8 80.4±0.7 81.0±1.0 82.3±1.1 81.6±1.1 81.9±1.4 82.2±0.9 10% 70.4±1.3 75.6±0.6 76.3±1.5 76.2±1.4 77.9±0.9 78.2±1.6 78.8±1.7 Cora 15% 65.1±0.7 69.8±1.3 72.2±0.9 73.3±1.3 75.7±1.2 76.9±0.9 77.2±1.6 20% 60.0±2.7 59.9±0.6 67.7±0.7 63.7±0.9 70.3±1.1 67.2±5.3 70.5±1.3 0% 72.0±0.6 73.3±0.8 73.6±0.9 73.2±0.6 73.2±0.5 73.6±0.6 73.8±0.6 5% 70.9±0.6 72.9±0.8 72.8±0.5 72.8±0.5 72.8±0.5 73.3±0.6 72.9±0.5 10% 67.6±0.9 70.6±0.5 70.2±0.6 70.8±0.6 70.7±1.2 72.4±0.9 72.6±0.4 Citeseer 15% 64.5±1.1 69.0±1.1 70.2±0.6 68.1±1.4 68.2±1.1 71.3±1.5 71.9±0.7 20% 62.0±3.5 61.0±1.5 64.9±1.0 64.7±0.8 64.7±0.8 64.7±0.8 64.7±0.8 0% 95.7±0.4 95.4±0.2 95.4±0.2 95.8±0.3 95.8±0.3 95.8±0.3 95.8±0.3 5% 73.1±0.8 83.7±1.5 82.8±0.3 78.7±0.6 78.7±0.7 82.8±0.4 83.0±0.3 10% 70.7±1.1 76.3±0.9 73.7±0.3 75.2±0.4 75.3±0.7 81.5±0.2 81.6±0.3 Polblogs 15% 65.0±1.9 68.8±1.1 68.9±0.9 72.1±0.9 71.5±1.1 77.8±0.9 78.7±0.5 20% 51.3±1.2 51.5±1.6 65.5±0.7 68.1±0.6 68.7±0.7 77.4±0.2 77.5±0.2 0% 87.2±0.1 83.7±0.4 88.1±0.1 86.7±0.1 87.3±0.1 88.1±0.1 88.1±0.1 5% 83.1±0.1 78.0±0.4 87.1±0.2 86.2±0.1 87.0±0.1 87.1±0.2 87.1±0.2 10% 81.2±0.1 74.9±0.4 86.6±0.1 86.0±0.2 86.9±0.2 86.3±0.1 87.0±0.1 Pubmed 15% 78.7±0.1 71.1±0.5 85.7±0.2 85.4±0.2 86.4±0.2 85.5±0.1 86.4±0.2 20% 77.4±0.2 68.2±1.0 85.8±0.1 85.4±0.1 86.4±0.1 85.4±0.1 86.4±0.1 5.4.3 Robustness Under Adversarial Attack Locally adaptive smoothness makes Elastic GNNs more robust to adversarial attack on graph structure. This is because the attack tends to connect nodes with different labels, which fuzzes the cluster structure in the graph. But EMP can tolerate large node differences along these wrong edges, 78 and maintain the smoothness along correct edges. To validate this, we evaluate the performance of Elastic GNNs under untargeted adversarial graph attack, which tries to degrade GNN models’ overall performance by deliberately modifying the graph structure. We use the MetaAttack [135] implemented in DeepRobust [58]3, a PyTorch library for adversarial attacks and defenses, to generate the adversarially attacked graphs based on four datasets including Cora, CiteSeer, Polblogs and PubMed. We randomly split 10%/10%/80% of nodes for training, validation and test. Note that following the works [134, 135, 27, 41], we only consider the largest connected component (LCC) in the adversarial graphs. Therefore, the results in Table 5.4 are not directly comparable with the results in Table 5.3. We focus on investigating the robustness introduced by ℓ1 -based graph smoothing but not on adversarial defense so we don’t compare with defense strategies. Existing defense strategies can be applied on Elastic GNNs to further improve the robustness against attacks. Variants of Elastic GNNs. To make a deeper investigation of Elastic GNNs, we consider the following variants: (1) ℓ2 (𝜆 1 = 0); (2) ℓ1 (𝜆2 = 0, Option I); (3) ℓ21 (𝜆 2 = 0, Option II); (4) ℓ1 + ℓ2 (Option I); (5) ℓ21 + ℓ2 (Option II). To save computation, we fix the learning rate as 0.01, weight decay as 0.0005, dropout rate as 0.5 and 𝐾 = 10 since this setting works well for the chosen datasets and models. Only 𝜆1 and 𝜆 2 are tuned. The classification accuracy under different perturbation rates ranging from 0% to 20% is summarized in Table 5.4. From the results, we can make the following observations: • All variants of Elastic GNNs outperforms GCN and GAT by significant margins under all perturbation rates. For instance, when the pertubation rate is 15%, Elastic GNN (ℓ21 + ℓ2 ) improves over GCN by 12.1%, 7.4%, 13.7% and 7.7% on the four datasets being considered. This is because Elastic GNN can adapt to the change of smoothness while GCN and GAT can not adapt well when the perturbation rate increases. • ℓ21 outperforms ℓ1 in most cases, and ℓ21 + ℓ2 outperforms ℓ1 + ℓ2 in almost all cases. It demonstrates the benefits of exploiting the correlation between feature channels by coupling 3 https://github.com/DSE-MSU/DeepRobust 79 multi-dimensional features via ℓ21 norm. • ℓ21 outperforms ℓ2 in most cases, which suggests the benefits of local smoothness adaptivity. When ℓ21 and ℓ2 is combined, the Elastic GNN (ℓ21 + ℓ2 ) achieves significantly better performance than solely ℓ2 , ℓ21 or ℓ1 variant in almost all cases. It suggests that ℓ1 and ℓ2 -based graph smoothing are complementary to each other, and combining them provides significant better robustness against adversarial graph attacks. 5.4.4 Ablation Study We provide ablation study to further investigate the adaptive smoothness, sparsity pattern, and convergence of EMP in Elastic GNN, based on three datasets including Cora, CiteSeer and PubMed. In this section, we fix 𝜆 1 = 3, 𝜆2 = 3 for Elastic GNN, and 𝛼 = 0.1 for APPNP. We fix learning rate as 0.01, weight decay as 0.0005 and dropout rate as 0.5 since this setting works well for both methods. Adaptive smoothness. It is expected that ℓ1 -based smoothing enhances local smoothness adaptivity by increasing the smoothness along correct edges (connecting nodes with same labels) while lowering smoothness along wrong edges (connecting nodes with different labels). To validate this, we compute the average adjacent node differences (based on node features in the last layer) along wrong and correct edges separately, and use the ratio between these two averages to measure the smoothness adaptivity. The results are summarized in Table 5.5. It is clearly observed that for all datasets, the ratio for ElasticGNN is significantly higher than ℓ2 based method such as APPNP, which validates its better local smoothness adaptivity. Sparsity pattern. To validate the piecewise constant property enforced by EMP, we also investigate the sparsity pattern in the adjacent node differences, i.e., Δ̃F, based on node features in the last layer. Node difference along edge 𝑒𝑖 is defined as sparse if ∥( Δ̃F)𝑖 ∥ 2 < 0.1. The sparsity ratios for ℓ2 -based method such as APPNP and ℓ1 -based method such as Elastic GNN are summarized in Table 5.6. It can be observed that in Elastic GNN, a significant portion of Δ̃F are sparse for all datasets. While in APPNP, this portion is much smaller. This sparsity pattern validates the 80 piecewise constant prior as designed. Table 5.5: Ratio between average node differences along wrong and correct edges. Model Cora CiteSeer PubMed ℓ2 (APPNP) 1.57 1.35 1.43 ℓ21 +ℓ2 (ElasticGNN) 2.03 1.94 1.79 Table 5.6: Sparsity ratio (i.e., ∥( Δ̃F)𝑖 ∥ 2 < 0.1) in node differences Δ̃F. Model Cora CiteSeer PubMed ℓ2 (APPNP) 2% 16% 11% ℓ21 +ℓ2 (ElasticGNN) 37% 74% 42% Convergence of EMP. We provide two additional experiments to demonstrate the impact of propagation step 𝐾 on classification performance and the convergence of message passing scheme. Figure 5.2 shows that the increase of classification accuracy when the propagation step 𝐾 increases. It verifies the effectiveness of EMP in improving graph representation learning. It also shows that a small number of propagation step can achieve very good performance, and therefore the computation cost for EMP can be small. Figure 5.3 shows the decreasing of the objective value defined in Eq. (5.8) during the forward message passing process, and it verifies the convergence of the proposed EMP as suggested by Theorem 6. 5.5 Related Work The design of GNN architectures can be majorly motivated in spectral domain [48, 23] and spatial domain [33, 109, 91, 29]. The message passing scheme [29, 74] for feature aggregation is one central component of GNNs. Recent works have proven that the message passing in GNNs can be regarded as low-pass graph filters [84, 126]. Generally, it is recently proved that message passing in many GNNs can be unified in the graph signal denosing framework [73, 86, 129, 16]. We point out that they intrinsically perform ℓ2 -based graph smoothing and typically can be represented as linear smoothers. 81 84 82 Test Accuracy (%) 80 Cora (ElasticGNN) Cora (APPNP) 78 CiteSeer (ElasticGNN) CiteSeer (APPNP) 76 PubMed (ElasticGNN) PubMed (APPNP) 74 72 70 2 4 6 8 10 12 Step K Figure 5.2: Classification accuracy under different propagation steps. 1500 Cora Value 1000 500 1000 CiteSeer Value 500 20000 PubMed Value 10000 0 2 4 6 8 10 12 14 Step K Figure 5.3: Convergence of the objective value for the problem in Eq. (5.8) during message passing. ℓ1 -based graph signal denoising has been explored in graph trend filtering [111, 108] which tends to provide estimators with 𝑘-th order piecewise polynomials over graphs. Graph total variation has also been utilized in semi-supervised learning [83, 44, 43, 5], spectral clustering [12, 11] and graph cut problems [100, 10]. However, it is unclear whether these algorithms can be used to design GNNs. To the best of our knowledge, we make first such investigation in this work. 82 5.6 Conclusion In this work, we propose to enhance the smoothness adaptivity of GNNs via ℓ1 and ℓ2 -based graph smoothing. Through the proposed elastic graph signal estimator, we derive a novel, efficient and general message passing scheme, i.e., elastic message passing (EMP). Integrating the proposed message passing scheme and deep neural networks leads to a family of GNNs, i.e., Elastic GNNs. Extensitve experiments on benchmark datasets and adversarially attacked graphs demonstrate the benefits (e.g., intrinsic robustness) of introducing ℓ1 -based graph smoothing in the design of GNNs. The empirical study suggests that ℓ1 and ℓ2 -based graph smoothing is complementary to each other, and the proposed Elastic GNNs has better smoothnesss adaptivity owning to the integration of ℓ1 and ℓ2 -based graph smoothing. We hope the proposed elastic message passing scheme can inspire more powerful GNN architecture design and more general smoothness assumptions such as low homophily [72] can be made in the future. In addition, we also demonstrate the significant advantages of elastic message passing (EMP) in capturing reliable user-item interactions in noisy recommendation systems through the proposed framework of Graph Trend Filtering Networks [28]. 83 CHAPTER 6 CONCLUSION In this chapter, we summarize the research results in this dissertation and their broader impact, and discuss promising research directions. 6.1 Summary In this dissertation, we proposed fours solutions to solve the efficiency and security challenges in machine learning - (1) centralized distributed optimization algorithm with bidirectional commu- nication compression; (2) decentralized distributed optimization algorithm with communication compression; (3) graph neural networks with adaptive message passing that is robust to adversarial features; (4) graph neural networks with elastic message passing that is robust to adversarial graph structures. To fundamentally improve the efficiency of distributed ML systems, I proposed a series of innovative algorithms to break through the communication bottleneck. In particular, when the communication network is a start network, I proposed DORE [68], a double residual compression algorithm, to compress the bi-directional communication between client devices and the server such that over 95% of the communication bits can be reduced. This is the first algorithm that reduces that much communication cost while maintaining the superior convergence complexities (e.g., linear convergence) as the uncompressed counterpart, both theoretically and numerically. When the communication network is of any general topology (as long as it is connected), I proposed LEAD [69], the first linear convergent decentralized optimization algorithm with communication compression, which only requires point-to-point compressed communication between neighboring devices over communication networks. Theoretically, we prove that under certain compression ratios, the convergence complexity of the proposed algorithm does not depend on the compression operator. In other words, it achieves better communication efficiency for free. These algorithms significantly improve the efficiency and scalability of large-scale ML systems 84 with solid theoretical guarantees and remarkable empirical performance. They have the great potential to accelerate scientific discovery through machine learning and data science. To design intrinsically secure ML models against feature attacks, I investigate to denoise the hidden features in neural network layers caused by the adversarial perturbation using the graph structural information. This is achieved by the proposed AirGNN [66] in which the adaptive message passing denoises perturbed features by feature aggregations and maintains feature separability by adaptive residuals. The proposed algorithm has a clear design principle and interpretation as well strong as performance both in the clean and adversarial data settings. This points out a promising direction of achieving adversarial robustness through feature denoising in hidden layers. To design intrinsically secure ML models against graph structure attacks, I investigate a new prior knowledge of smoothness in the design of graph neural networks. In particular, we derive an elastic message passing scheme to model the piecewise constant signal in graph data. We demonstrate its stronger resilience to adversarial structure attacks and superior performance when the data is clean through a comprehensive empirical study on the proposed model ElasticGNN [67]. These secure ML models immensely boost the security of ML models under potential adversarial threats. They might not only be applied in safety-critical applications but also inspire further research in this emerging direction. 6.2 Future Direction Large-scale ML and secure ML are active areas of exploration. Below we discuss some promising research directions: • Distributed machine learning under heterogeneous environments: Our study in centralized learning and decentralized learning suggests that the convergence properties of distributed optimization algorithms might be sensitive to heterogeneous environments - (1) data hetero- geneity; (2) network heterogeneity; (3) computation heterogeneity. Due to data heterogeneity, the data distributions from multiple computation devices might significantly differ from each other, which causes direction conflict for model updating if synchronization is not timely. 85 Due to network heterogeneity, the network bandwidths and conditions are often uneven in large distributed systems. Therefore, it is critical to take the potential communication decay into consideration. Due to computation heterogeneity, different computation devices might have diverse computation power. To fully utilize the computation power, it is vital to design algorithms that support flexible computation tasks and avoid idle time. • Secure machine learning for more data types. Our study in designing secure graph neural networks suggests a new promising direction for designing intrinsically robust ML models through feature denoising in hidden layers of deep neural networks. Therefore, it is promising to generalize these ideas to general data types such as images, videos, and text where the graph structure information is not explicitly available but can be constructed from the data. Moreover, it is also promising to consider more advanced and flexible smoothing assumptions beyond homophily graphs in the design of GNN models. 86 APPENDICES 87 APPENDIX A A DOUBLE RESIDUAL COMPRESSION ALGORITHM FOR DISTRIBUTED LEARNING A.1 Additional Experiments A.1.1 Communication Efficiency To make an explicit comparison of communication efficiency, we report the training loss convergence with respect to communication bits in Figure A.1, A.2 and A.3 for the experiments on synthetic data, MNIST and CIFAR10 dataset respectively. These results are independent of the system architectures and network bandwidth. It suggests that the proposed DORE reduce the communication cost significantly while maintaining good convergence speed. Furthermore, we also test the running time of ResNet18 trained on CIFAR10 dataset under two different network bandwidth configurations, i.e. 1Gbps and 200Mbps, as showed in Figure A.4 and A.5. Due to its superior communication efficiency, the proposed DORE runs faster in both configurations. Moreover, when the network bandwidth reduces from 1Gbps to 200Mbps, the running time of DORE only increases slightly, which indicates that DORE is more robust to network bandwidth change and can work more efficiently under limited bandwidth. These results clearly suggest the advantages of the proposed algorithm. All the experiments in this section are under the exactly same setting as described in Section 2.5. The running time is tested in a High Performance Computing Cluster with NVIDIA Tesla K80 GPUs and the computing nodes are connected by Gigabit Ethernet interfaces and we use mpi4py as the communication backend. All algorithms in this work are implemented with PyTorch. 88 3 10 1 10 2.0 1 10 Training Loss 1.5 x * ||2 3 10 ||xk 10 5 1.0 7 10 0.5 9 10 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 Communication cost (bits) 1e8 Communication cost (bits) 1e9 Figure A.1: Linear regression on synthetic data. Figure A.2: LeNet trained on MNIST dataset. 4.5 4.0 3.5 Training Loss 3.0 2.5 2.0 1.5 1.0 0.0 0.5 1.0 1.5 2.0 2.5 Communication cost (bits) 1e12 Figure A.3: Resnet18 trained on CIFAR10 dataset. 4.5 4.5 4.0 4.0 3.5 3.5 Training Loss Training Loss 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0 500 1000 1500 2000 2500 0 1000 2000 3000 4000 5000 Time (seconds) Time (seconds) Figure A.4: Resnet18 trained on CIFAR10 Figure A.5: Resnet18 trained on CIFAR10 dataset with 1Gbps network bandwidth. dataset with 200Mbps network bandwidth. 89 A.1.2 Parameter sensitivity Continuing the MNIST experiment in Section 2.5, we further conduct parameter analysis on DORE. The basic setting for block size, learning rate, 𝛼, 𝛽 and 𝜂 are 256, 0.1, 0.1, 1, 1, respectively. We change each parameter individually. Figures A.6, A.7, A.8, and A.9 demonstrate that DORE performs consistently well under different parameter settings. 2.00 2.0 1.75 1.50 1.5 Training Loss 1.25 Test Loss 1.00 1.0 0.75 0.50 0.5 0.25 0.0 0.00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test loss Figure A.6: Training under different compression block sizes. 2.00 2.0 1.75 1.50 1.5 Training Loss 1.25 Test Loss 1.00 1.0 0.75 0.50 0.5 0.25 0.0 0.00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test loss Figure A.7: Training under different 𝛼. 90 2.00 2.0 1.75 1.50 1.5 Training Loss 1.25 Test Loss 1.00 1.0 0.75 0.50 0.5 0.25 0.0 0.00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test loss Figure A.8: Training under different 𝛽. 2.00 2.0 1.75 1.50 1.5 Training Loss Test Loss 1.25 1.00 1.0 0.75 0.5 0.50 0.25 0.0 0.00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test loss Figure A.9: Training under different 𝜂. A.2 Proofs of the theorems A.2.1 Proof of Theorem 1 We first provide two lemmas. We define E𝑄 , E 𝑘 , and E be the expectation taken over the quantization, the 𝑘th iteration based on x̂ 𝑘 , and the overall expectation, respectively. Lemma 7. For every 𝑖, we can estimate the first two moments of h𝑖𝑘+1 as E𝑄 h𝑖𝑘+1 =(1 − 𝛼)h𝑖𝑘 + 𝛼g𝑖𝑘 , (A.1) 91 E𝑄 ∥h𝑖𝑘+1 − s𝑖 ∥ 2 ≤(1 − 𝛼)∥h𝑖𝑘 − s𝑖 ∥ 2 + 𝛼∥g𝑖𝑘 − s𝑖 ∥ 2 + 𝛼[(𝐶𝑞 + 1)𝛼 − 1] ∥Δ𝑖𝑘 ∥ 2 . (A.2) Proof. The first equality follows from lines 5-7 of Algorithm 1 and Assumption 1. For the second equation, we have the following variance decomposition E∥ 𝑋 ∥ 2 = ∥E𝑋 ∥ 2 + E∥ 𝑋 − E𝑋 ∥ 2 (A.3) for any random vector 𝑋. By taking 𝑋 = h𝑖𝑘+1 − s𝑖 , we get E𝑄 ∥h𝑖𝑘+1 − s𝑖 ∥ 2 = ∥(1 − 𝛼)(h𝑖𝑘 − s𝑖 ) + 𝛼(g𝑖𝑘 − s𝑖 )∥ 2 + 𝛼2 E𝑄 ∥ Δ̂𝑖𝑘 − Δ𝑖𝑘 ∥ 2 . (A.4) Using the basic equality ∥𝜆a + (1 − 𝜆)b∥ 2 + 𝜆(1 − 𝜆)∥a − b∥ 2 = 𝜆∥a∥ 2 + (1 − 𝜆)∥b∥ 2 (A.5) for all a, b ∈ R𝑑 and 𝜆 ∈ [0, 1], as well as Assumption 1, we have E𝑄 ∥h𝑖𝑘+1 − s𝑖 ∥ 2 ≤ (1 − 𝛼)∥h𝑖𝑘 − s𝑖 ∥ 2 + 𝛼∥g𝑖𝑘 − s𝑖 ∥ 2 − 𝛼(1 − 𝛼)∥Δ𝑖𝑘 ∥ 2 + 𝛼2𝐶𝑞 ∥Δ𝑖𝑘 ∥ 2 , (A.6) which is the inequality (A.2). Next, from the variance decomposition (A.3), we also derive Lemma 8. Lemma 8. The following inequality holds 𝑛 ∗ 2 ∗ 2 𝐶𝑞 ∑︁ 𝜎2 𝑘 𝑘 E[∥ ĝ − h ∥ ] ≤ E∥∇ 𝑓 ( x̂ ) − h ∥ + 2 E∥Δ𝑖𝑘 ∥ 2 + , (A.7) 𝑛 𝑖=1 𝑛 1 ∗ 1 where h∗ = ∇ 𝑓 (x∗ ) = and 𝜎 2 = 𝜎𝑖2 . Í𝑛 Í𝑛 𝑛 𝑖=1 h𝑖 𝑛 𝑖=1 Proof. By taking the expectation over the quantization of g, we have E∥ ĝ 𝑘 − h∗ ∥ 2 = E∥g 𝑘 − h∗ ∥ 2 + E∥ ĝ 𝑘 − g 𝑘 ∥ 2 𝑛 ∗ 2 𝐶𝑞 ∑︁ 𝑘 ≤ E∥g − h ∥ + 2 E∥Δ𝑖𝑘 ∥ 2 , (A.8) 𝑛 𝑖=1 where the inequality is from Assumption 1. 92 For ∥g 𝑘 − h∗ ∥, we take the expectation over the sampling of gradients and derive E∥g 𝑘 − h∗ ∥ 2 = E∥∇ 𝑓 ( x̂ 𝑘 ) − h∗ ∥ 2 + E∥g 𝑘 − ∇ 𝑓 ( x̂ 𝑘 )∥ 2 𝜎2 ≤ E∥∇ 𝑓 ( x̂ 𝑘 ) − h∗ ∥ 2 + (A.9) 𝑛 by Assumption 2. Combining (A.8) with (A.9) gives (A.7). Proof of Theorem 1. We consider x 𝑘+1 − x∗ first. Since x∗ is the solution of (2.1), it satisfies x∗ = prox𝛾𝑅 (x∗ − 𝛾h∗ ). (A.10) Hence E∥x 𝑘+1 − x∗ ∥ 2 =E∥prox𝛾𝑅 ( x̂ 𝑘 − 𝛾 ĝ 𝑘 ) − prox𝛾𝑅 (x∗ − 𝛾h∗ )∥ 2 ≤E∥ x̂ 𝑘 − x∗ − 𝛾( ĝ 𝑘 − h∗ )∥ 2 =E∥ x̂ 𝑘 − x∗ ∥ 2 − 2𝛾E⟨x̂ 𝑘 − x∗ , ĝ 𝑘 − h∗ ⟩ + 𝛾 2 E∥ ĝ 𝑘 − h∗ ∥ 2 =E∥ x̂ 𝑘 − x∗ ∥ 2 − 2𝛾E⟨x̂ 𝑘 − x∗ , ∇ 𝑓 ( x̂ 𝑘 ) − h∗ ⟩ + 𝛾 2 E∥ ĝ 𝑘 − h∗ ∥ 2 , (A.11) where the inequality comes from the non-expansiveness of the proximal operator and the last equality is derived by taking the expectation of the stochastic gradient ĝ 𝑘 . Combining (A.7) and (A.11), we have E∥x 𝑘+1 − x∗ ∥ 2 ≤E∥ x̂ 𝑘 − x∗ ∥ 2 − 2𝛾E⟨x̂ 𝑘 − x∗ , ∇ 𝑓 ( x̂ 𝑘 ) − h∗ ⟩ 𝑛 𝛾 2 ∑︁ ∗ 2 𝐶𝑞 𝛾 2 ∑︁ 𝑛 𝛾2 + 𝑘 E∥∇ 𝑓𝑖 ( x̂ ) − h𝑖 ∥ + 2 E∥Δ𝑖𝑘 ∥ 2 + 𝜎 2 . (A.12) 𝑛 𝑖=1 𝑛 𝑖=1 𝑛 Then we consider E∥ x̂ 𝑘+1 − x∗ ∥ 2 . According to Algorithm 1, we have: E𝑄 [ x̂ 𝑘+1 − x∗ ] = x̂ 𝑘 + 𝛽q 𝑘 − x∗ = (1 − 𝛽)( x̂ 𝑘 − x∗ ) + 𝛽(x 𝑘+1 − x∗ + 𝜂e 𝑘 ) (A.13) where the expectation is taken on the quantization of q 𝑘 . 93 By variance decomposition (A.3) and the basic equality (A.5), E∥ x̂ 𝑘+1 − x∗ ∥ 2 ≤(1 − 𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 + 𝛽E∥x 𝑘+1 + 𝜂e 𝑘 − x∗ ∥ 2 − 𝛽(1 − 𝛽)E∥q 𝑘 ∥ 2 + 𝛽2𝐶𝑞𝑚 E∥q 𝑘 ∥ 2 ≤(1 − 𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 + (1 + 𝜂2 𝜖) 𝛽E∥x 𝑘+1 − x∗ ∥ 2 − 𝛽(1 − (𝐶𝑞𝑚 + 1) 𝛽)E∥q 𝑘 ∥ 2 1 + (𝜂2 + ) 𝛽𝐶𝑞𝑚 E∥q 𝑘−1 ∥ 2 , (A.14) 𝜖 where 𝜖 is generated from Cauchy inequality of inner product. For convenience, we let 𝜖 = 𝜂1 . 1 Choose a 𝛽 such that 0 < 𝛽 ≤ 1+𝐶𝑞𝑚 . Then we have 𝛽(1 − (𝐶𝑞𝑚 + 1) 𝛽)E∥q 𝑘 ∥ 2 + E∥ x̂ 𝑘+1 − x∗ ∥ 2 ≤(1 − 𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 + (1 + 𝜂) 𝛽E∥x 𝑘+1 − x∗ ∥ 2 + (𝜂2 + 𝜂) 𝛽𝐶𝑞𝑚 E∥q 𝑘−1 ∥ 2 . (A.15) Letting s𝑖 = h𝑖∗ in (A.2), we have 𝑛 (1 + 𝜂)𝑐𝛽𝛾 2 ∑︁ E∥h𝑖𝑘+1 − h𝑖∗ ∥ 2 𝑛 𝑖=1 𝑛 𝑛 (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁ 𝑘 (1 + 𝜂)𝛼𝑐𝛽𝛾 2 ∑︁ 𝑘 ≤ ∥h𝑖 − h𝑖∗ ∥ 2 + ∥g𝑖 − h𝑖∗ ∥ 2 𝑛 𝑖=1 𝑛 𝑖=1 (1 + 𝜂)𝛼[(𝐶𝑞 + 1)𝛼 − 1]𝑐𝛽𝛾 2 ∑︁ 𝑛 + ∥Δ𝑖𝑘 ∥ 2 . (A.16) 𝑛 𝑖=1 Then we let R 𝑘 = 𝛽(1 − (𝐶𝑞𝑚 + 1) 𝛽)E∥q 𝑘 ∥ 2 and define 𝑛 (1 + 𝜂)𝑐𝛽𝛾 2 ∑︁ 𝑘 V =R 𝑘−1 𝑘 + E∥ x̂ − x ∥ +∗ 2 E∥h𝑖𝑘 − h𝑖∗ ∥ 2 . 𝑛 𝑖=1 Thus, we obtain V 𝑘+1 ≤(𝜂2 + 𝜂) 𝛽𝐶𝑞𝑚 E∥q 𝑘−1 ∥ 2 + (1 + 𝜂𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 − 2(1 + 𝜂) 𝛽𝛾E⟨x̂ 𝑘 − x∗ , ∇ 𝑓 ( x̂ 𝑘 ) − h∗ ⟩ 𝑛 (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁ + E∥h𝑖𝑘 − h𝑖∗ ∥ 2 𝑛 𝑖=1 2 𝑛 (1 + 𝜂) 𝛽𝛾 h 2 i ∑︁ + 2 𝑛𝑐(𝐶 𝑞 + 1)𝛼 − 𝑛𝑐𝛼 + 𝐶 𝑞 E∥Δ𝑖𝑘 ∥ 2 𝑛 𝑖=1 𝑛 (1 + 𝜂)(1 + 𝑐𝛼) 2 ∑︁ (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2 + 𝛽𝛾 E∥∇ 𝑓𝑖 ( x̂ 𝑘 ) − h𝑖∗ ∥ 2 + 𝛽𝛾 𝜎 . (A.17) 𝑛 𝑖=1 𝑛 94 The E∥Δ𝑖𝑘 ∥ 2 -term can be ignored if 𝑛𝑐(𝐶𝑞 + 1)𝛼2 − 𝑛𝑐𝛼 + 𝐶𝑞 ≤ 0, which can be guaranteed by 4𝐶𝑞 (𝐶𝑞 +1) 𝑐≥ 𝑛 and √︃ √︃ 4𝐶𝑞 (𝐶𝑞 +1) 4𝐶𝑞 (𝐶𝑞 +1) ©1 − 1 − 𝑛𝑐 1+ 1− 𝑛𝑐 ª 𝛼 ∈ ­­ , ®. 2(𝐶𝑞 + 1) 2(𝐶𝑞 + 1) ® « ¬ Given that each 𝑓𝑖 is 𝐿-Lipschitz differentiable and 𝜇-strongly convex, we have 𝑛 𝜇𝐿 1 1 ∑︁ 𝑘 ∗ 𝑘 ∗ E⟨∇ 𝑓 ( x̂ ) − h , x̂ − x ⟩ ≥ E∥ x̂ 𝑘 − x∗ ∥ 2 + E∥∇ 𝑓𝑖 ( x̂ 𝑘 ) − h𝑖∗ ∥ 2 . (A.18) 𝜇+𝐿 𝜇 + 𝐿 𝑛 𝑖=1 Hence V 𝑘+1 ≤𝜌1 R 𝑘−1 + (1 + 𝜂𝛽)E∥ x̂ 𝑘 − x∗ ∥ 2 − 2(1 + 𝜂) 𝛽𝛾E⟨x̂ 𝑘 − x∗ , ∇ 𝑓 ( x̂ 𝑘 ) − h∗ ⟩ 𝑛 𝑛 (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁ (1 + 𝜂)(1 + 𝑐𝛼) 2 ∑︁ + 𝑘 ∗ 2 E∥h𝑖 − h𝑖 ∥ + 𝛽𝛾 E∥∇ 𝑓𝑖 ( x̂ 𝑘 ) − h𝑖∗ ∥ 2 𝑛 𝑖=1 𝑛 𝑖=1 (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2 + 𝛽𝛾 𝜎 𝑛 𝑛 h 2(1 + 𝜂) 𝛽𝛾𝜇𝐿 i ∗ 2 (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁ ≤𝜌1 R 𝑘−1 + 1 + 𝜂𝛽 − 𝑘 E∥ x̂ − x ∥ + E∥h𝑖𝑘 − h𝑖∗ ∥ 2 𝜇+𝐿 𝑛 𝑖=1 𝑛 h 2(1 + 𝜂) 𝛽𝛾 1 ∑︁ i (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2 + (1 + 𝜂)(1 + 𝑐𝛼) 𝛽𝛾 2 − E∥∇ 𝑓𝑖 ( x̂ 𝑘 ) − h𝑖∗ ∥ 2 + 𝛽𝛾 𝜎 𝜇+𝐿 𝑛 𝑖=1 𝑛 𝑛 (1 + 𝜂)(1 − 𝛼)𝑐𝛽𝛾 2 ∑︁ (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2 ≤𝜌1 R 𝑘−1 + 𝜌2 E∥ x̂ 𝑘 − x∗ ∥ 2 + E∥h𝑖𝑘 − h𝑖∗ ∥ 2 + 𝛽𝛾 𝜎 𝑛 𝑖=1 𝑛 (A.19) where (𝜂2 + 𝜂)𝐶𝑞𝑚 𝜌1 = , 1 − (𝐶𝑞𝑚 + 1) 𝛽 2(1 + 𝜂) 𝛽𝛾𝜇𝐿 𝜌2 =1 + 𝜂𝛽 − . 𝜇+𝐿 2 2(1+𝜂) 𝛽𝛾 Here we let 𝛾 ≤ (1+𝑐𝛼) (𝜇+𝐿) such that (1 + 𝜂)(1 + 𝑐𝛼) 𝛽𝛾 2 − 𝜇+𝐿 ≤ 0 and the last inequality holds. In order to get max(𝜌1 , 𝜌2 , 1 − 𝛼) < 1, we have the following conditions 0 ≤ (𝜂2 + 𝜂)𝐶𝑞𝑚 ≤1 − (𝐶𝑞𝑚 + 1) 𝛽, 2(1 + 𝜂)𝛾𝜇𝐿 𝜂< . 𝜇+𝐿 95 Therefore, the condition for 𝛾 is 𝜂(𝜇 + 𝐿) 2 ≤𝛾≤ , 2(1 + 𝜂)𝜇𝐿 (1 + 𝑐𝛼)(𝜇 + 𝐿) which implies an additional condition for 𝜂. Therefore, the condition for 𝜂 is √︃  𝑚 2 © −𝐶𝑞 + (𝐶𝑞 ) + 4(1 − (𝐶𝑞 + 1) 𝛽) 𝑚 𝑚  4𝜇𝐿 ªª 𝜂 ∈ 0, min ­  ®® . ­ ,  2𝐶𝑞𝑚 (𝜇 + 𝐿) 2 (1 + 𝑐𝛼) − 4𝜇𝐿 ®®   « ¬¬ 4𝜇𝐿 𝜂(𝜇+𝐿) 2 where 𝜂 ≤ (𝜇+𝐿) 2 (1+𝑐𝛼)−4𝜇𝐿 is to ensure 2(1+𝜂)𝜇𝐿 ≤ (1+𝑐𝛼) (𝜇+𝐿) such that we don’t get an empty set for 𝛾. If we define 𝜌 = max{𝜌1 , 𝜌2 , 1 − 𝛼}, we obtain (1 + 𝜂)(1 + 𝑛𝑐𝛼) 2 2 V 𝑘+1 ≤ 𝜌V 𝑘 + 𝛽𝛾 𝜎 (A.20) 𝑛 and the proof is completed by applying (A.20) recurrently. A.2.2 Proof of Theorem 2 Proof. In Algorithm 2, we can show E∥ x̂ 𝑘+1 − x̂ 𝑘 ∥ 2 = 𝛽2 E∥ q̂ 𝑘 ∥ 2 = 𝛽2 E∥Eq̂ 𝑘 ∥ 2 + 𝛽2 E∥ q̂ 𝑘 − Eq̂ 𝑘 ∥ 2 = 𝛽2 E∥q 𝑘 ∥ 2 + 𝛽2 E∥ q̂ 𝑘 − q 𝑘 ∥ 2 (A.21) ≤ (1 + 𝐶𝑞𝑚 ) 𝛽2 E∥q 𝑘 ∥ 2 . and E∥q 𝑘 ∥ 2 = E∥ − 𝛾 ĝ 𝑘 + 𝜂e 𝑘 ∥ 2 ≤ 2𝛾 2 E∥ ĝ 𝑘 ∥ 2 + 2𝜂2 E∥e 𝑘 ∥ 2 ≤ 2𝛾 2 E∥ ĝ 𝑘 ∥ 2 + 2𝐶𝑞𝑚 𝜂2 E∥q 𝑘−1 ∥ 2 . (A.22) Using (A.21)(A.22) and the Lipschitz continuity of ∇ 𝑓 (x), we have E 𝑓 ( x̂ 𝑘+1 ) + (𝐶𝑞𝑚 + 1)𝐿 𝛽2 E∥q 𝑘 ∥ 2 𝐿 ≤E 𝑓 ( x̂ 𝑘 ) + E⟨∇ 𝑓 ( x̂ 𝑘 ), x̂ 𝑘+1 − x̂ 𝑘 ⟩ + E∥ x̂ 𝑘+1 − x̂ 𝑘 ∥ 2 + (𝐶𝑞𝑚 + 1)𝐿 𝛽2 E∥q 𝑘 ∥ 2 2 (1 + 𝐶𝑞𝑚 )𝐿 𝛽2 𝑘 𝑘 𝑘 =E 𝑓 ( x̂ ) + 𝛽E⟨∇ 𝑓 ( x̂ ), −𝛾 ĝ + 𝜂e ⟩ + 𝑘 E∥q 𝑘 ∥ 2 + (𝐶𝑞𝑚 + 1)𝐿 𝛽2 E∥q 𝑘 ∥ 2 2 96 3(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝑘 𝑘 𝑘 =E 𝑓 ( x̂ ) + 𝛽E⟨∇ 𝑓 ( x̂ ), −𝛾∇ 𝑓 ( x̂ ) + 𝜂e ⟩ + 𝑘 E∥q 𝑘 ∥ 2 2 𝛽𝜂 𝛽𝜂 ≤E 𝑓 ( x̂ 𝑘 ) − 𝛽𝛾E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 + E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 + E∥e 𝑘 ∥ 2 2 2 h i + 3(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 E∥ ĝ 𝑘 ∥ 2 + 𝐶𝑞𝑚 𝜂2 E∥q 𝑘−1 ∥ 2 h 𝛽𝜂 i ≤E 𝑓 ( x̂ 𝑘 ) − 𝛽𝛾 − − 3(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 2 3𝐶𝑞 (𝐶𝑞 + 1)𝐿 𝛽2 𝛾 2 ∑︁ 𝑚 𝑛 𝑘 2 3(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 2 + E∥Δ ∥ + 𝜎 𝑛2 𝑖=1 𝑖 𝑛 h 𝛽𝜂𝐶𝑞𝑚 i + + (3𝐶𝑞 + 1)𝐶𝑞 𝐿 𝛽 𝜂 E∥q 𝑘−1 ∥ 2 , 𝑚 𝑚 2 2 (A.23) 2 where the last inequality is from (A.7) with h∗ = 0. Letting s𝑖 = 0 in (A.2), we have E𝑄 ∥h𝑖𝑘+1 ∥ 2 ≤(1 − 𝛼)∥h𝑖𝑘 ∥ 2 + 𝛼∥g𝑖𝑘 ∥ 2 + 𝛼[(𝐶𝑞 + 1)𝛼 − 1] ∥Δ𝑖𝑘 ∥ 2 . (A.24) Due to the assumption that each worker samples the gradient from the full dataset, we have Eg𝑖𝑘 = E∇ 𝑓 ( x̂ 𝑘 ), E∥g𝑖𝑘 ∥ 2 ≤ E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 + 𝜎𝑖2 . (A.25) Define Λ 𝑘 = (𝐶𝑞𝑚 +1)𝐿 𝛽2 ∥q 𝑘−1 ∥ 2 + 𝑓 ( x̂ 𝑘 )− 𝑓 ∗ +3𝑐(𝐶𝑞𝑚 +1)𝐿 𝛽2 𝛾 2 𝑛1 𝑘 2 Í𝑛 𝑖=1 E∥h𝑖 ∥ , and from (A.23), (A.24), and (A.25), we have 𝑛 2 21 ∑︁ ∗ EΛ 𝑘+1 𝑘 ≤E 𝑓 ( x̂ ) − 𝑓 + 3(1 − 𝛼)𝑐(𝐶𝑞𝑚 + 1)𝐿 𝛽 𝛾 E∥h𝑖𝑘 ∥ 2 𝑛 𝑖=1 h 𝛽𝜂 i − 𝛽𝛾 − − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 2 (𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 h 2 i ∑︁𝑛 + 3𝑛𝑐(𝐶𝑞 + 1)𝛼 − 3𝑛𝑐𝛼 + 3𝐶𝑞 E∥Δ𝑖𝑘 ∥ 2 𝑛2 𝑖=1 (𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 𝜎 2 + 3(1 + 𝑛𝑐𝛼) 𝑛 h 𝛽𝜂𝐶𝑞𝑚 i + + 3(𝐶𝑞𝑚 + 1)𝐶𝑞𝑚 𝐿 𝛽2 𝜂2 E∥q 𝑘−1 ∥ 2 . (A.26) 2 4𝐶𝑞 (𝐶𝑞 +1) If we let 𝑐 = 𝑛 , then the condition of 𝛼 in (2.5) gives 3𝑛𝑐(𝐶𝑞 + 1)𝛼2 − 3𝑛𝑐𝛼 + 3𝐶𝑞 ≤ 0 and 𝑛 2 21 ∑︁ ∗ EΛ 𝑘+1 𝑘 ≤E 𝑓 ( x̂ ) − 𝑓 + 3(1 − 𝛼)𝑐(𝐶𝑞𝑚 + 1)𝐿 𝛽 𝛾 E∥h𝑖𝑘 ∥ 2 𝑛 𝑖=1 97 h 𝛽𝜂 i − 𝛽𝛾 − − 3(1 + 𝑐𝛼)(𝐶𝑞 + 1)𝐿 𝛽 𝛾 E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 𝑚 2 2 2 (𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 𝜎 2 + 3(1 + 𝑛𝑐𝛼) 𝑛 𝛽𝜂𝐶𝑞𝑚 +[ + 3(𝐶𝑞𝑚 + 1)𝐶𝑞𝑚 𝐿 𝛽2 𝜂2 ]E∥q 𝑘−1 ∥ 2 . (A.27) 2 1 Let 𝜂 = 𝛾 and 𝛽𝛾 ≤ 6(1+𝑐𝛼) (𝐶𝑞𝑚 +1)𝐿 , we have 𝛽𝜂 𝛽𝛾 𝛽𝛾 − − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 = − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 ≥ 0. 2 2 √︂ 48𝐿 2 𝛽 2 (𝐶𝑞𝑚 +1) 2 n −1+ 1+ 𝐶𝑞𝑚 o 1 Take 𝛾 ≤ min 12𝐿 𝛽(𝐶𝑞𝑚 +1) , 6𝐿 𝛽(1+𝑐𝛼) (𝐶𝑞 +1) will guarantee 𝑚 h 𝛽𝜂𝐶𝑞𝑚 i + 3(𝐶𝑞𝑚 + 1)𝐶𝑞𝑚 𝐿 𝛽2 𝜂2 ≤ (𝐶𝑞𝑚 + 1)𝐿 𝛽2 . 2 Hence we obtain 𝑘+1 𝑘 h 𝛽𝛾 2 2 i 2 (𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 2 𝜎 2 EΛ ≤ EΛ − − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 𝑘 + 1)𝐿 𝛽 𝛾 E∥∇ 𝑓 ( x̂ )∥ + 3(1 + 𝑛𝑐𝛼) . 2 𝑛 (A.28) Taking the telescoping sum and plugging the initial conditions, we derive (2.12). A.2.3 Proof of Corollary 2 1 4𝐶𝑞 (𝐶𝑞 +1) 1 Proof. With 𝛼 = 2(𝐶𝑞 +1) and 𝑐 = 𝑛 , 1 + 𝑛𝑐𝛼 = 1 + 2𝐶𝑞 is a constant. We set 𝛽 = 𝐶𝑞𝑚 +1 and √︂ 2 n −1+ 1+ 48𝐿 𝐶𝑚 o 𝑞 1 √ 𝛾 = min 12𝐿 , . In general, 𝐶𝑞𝑚 is bounded which makes the first bound 12𝐿(1+𝑐𝛼) (1+ 𝐾/𝑛) 1 √ negligible, i.e., 𝛾 = when 𝐾 is large enough. Therefore, we have 12𝐿(1+𝑐𝛼) (1+ 𝐾/𝑛) 𝛽 1 − 6(1 + 𝑐𝛼)𝐿𝛾 1 − 3(1 + 𝑐𝛼)(𝐶𝑞𝑚 + 1)𝐿 𝛽2 𝛾 = ≤ . (A.29) 2 2(𝐶𝑞 + 1) 𝑚 4(𝐶𝑞 + 1) 𝑚 From Theorem 2, we derive 𝐾 1 ∑︁ E∥∇ 𝑓 ( x̂ 𝑘 )∥ 2 𝐾 𝑘=1 4(𝐶𝑞𝑚 + 1)(EΛ1 − EΛ𝐾+1 ) 12(1 + 𝑛𝑐𝛼)𝐿𝜎 2 𝛾 ≤ + 𝛾𝐾 𝑛 98 1 1 (1 + 𝑛𝑐𝛼)𝜎 2 1 ≤48𝐿 (𝐶𝑞𝑚 + 1)(1 + 𝑐𝛼)(EΛ1 − EΛ𝐾+1 )( +√ )+ √ , (A.30) 𝐾 𝑛𝐾 (1 + 𝑐𝛼) 𝑛𝐾 which completes the proof. 99 APPENDIX B LINEAR CONVERGENT DECENTRALIZED OPTIMIZATION WITH COMPRESSION B.1 Compression method B.1.1 p-norm b-bits quantization Theorem 9 (p-norm b-bit quantization). Let us define the quantization operator as    2𝑏−1 |x|  −(𝑏−1) 𝑄 𝑝 (x) := ∥x∥ 𝑝 sign(x)2 · +u (B.1) ∥x∥ 𝑝 where · is the Hadamard product, |x| is the elementwise absolute value and u is a random dither vector uniformly distributed in [0, 1] 𝑑 . 𝑄 𝑝 (x) is unbiased, i.e., E𝑄 𝑝 (x) = x, and the compression variance is upper bounded by 1 E∥x − 𝑄 𝑝 (x)∥ 2 ≤ ∥sign(x)2−(𝑏−1) ∥ 2 ∥x∥ 2𝑝 , (B.2) 4 which suggests that ∞-norm provides the smallest upper bound for the compression variance due to ∥x∥ 𝑝 ≤ ∥x∥ 𝑞 , ∀x if 1 ≤ 𝑞 ≤ 𝑝 ≤ ∞. Remark 13. For the compressor defined in (B.1), we have the following the compression constant ∥sign(x)2−(𝑏−1) ∥ 2 ∥x∥ 2𝑝 𝐶 = sup . x 4∥x∥ 2 j k l m 2𝑏−1 |x| 2𝑏−1 |x| 2𝑏−1 |x| Proof. Let denote v = ∥x∥ 𝑝 sign(x)2−(𝑏−1) , 𝑠 = ∥x∥ 𝑝 , 𝑠1 = ∥x∥ 𝑝 and 𝑠2 = ∥x∥ 𝑝 . We can rewrite x as x = 𝑠 · v. For any coordinate 𝑖 such that 𝑠𝑖 = (𝑠1 )𝑖 , we have 𝑄 𝑝 (x𝑖 ) = (𝑠1 )𝑖 v𝑖 with probability 1. Hence E𝑄 𝑝 (x)𝑖 = 𝑠𝑖 v𝑖 = x𝑖 and E(x𝑖 − 𝑄 𝑝 (x)𝑖 ) 2 = (x𝑖 − 𝑠𝑖 v𝑖 ) 2 = 0. For any coordinate 𝑖 such that 𝑠𝑖 ≠ (𝑠1 )𝑖 , we have (𝑠2 )𝑖 − (𝑠1 )𝑖 = 1 and 𝑄 𝑝 (x)𝑖 satisfies   (𝑠1 )𝑖 v𝑖 , w.p. (𝑠2 )𝑖 − 𝑠𝑖 ,    𝑄 𝑝 (x)𝑖 =  (𝑠2 )𝑖 v𝑖 , w.p. 𝑠𝑖 − (𝑠1 )𝑖 .    100 Thus, we derive E𝑄 𝑝 (x)𝑖 = v𝑖 (𝑠1 )𝑖 (𝑠2 − 𝑠)𝑖 + v𝑖 (𝑠2 )𝑖 (𝑠 − 𝑠1 )𝑖 = v𝑖 𝑠𝑖 (𝑠2 − 𝑠1 )𝑖 = v𝑖 𝑠𝑖 = x𝑖 , and E[x𝑖 − 𝑄 𝑝 (x)𝑖 ] 2 = (x𝑖 − v𝑖 (𝑠1 )𝑖 ) 2 (𝑠2 − 𝑠)𝑖 + (x𝑖 − v𝑖 (𝑠2 )𝑖 ) 2 (𝑠 − 𝑠1 )𝑖 = (𝑠2 − 𝑠1 )𝑖 x𝑖2 + (𝑠1 )𝑖 (𝑠2 )𝑖 (𝑠1 − 𝑠2 )𝑖 + 𝑠𝑖 ((𝑠2 )𝑖2 − (𝑠1 )𝑖2 ) v𝑖2 − 2𝑠𝑖 (𝑠2 − 𝑠1 )𝑖 x𝑖 v𝑖  = x𝑖2 + − (𝑠1 )𝑖 (𝑠2 )𝑖 + 𝑠𝑖 (𝑠2 + 𝑠1 )𝑖 v𝑖2 − 2𝑠𝑖 x𝑖 v𝑖  = (x𝑖 − 𝑠𝑖 v𝑖 ) 2 + − (𝑠1 )𝑖 (𝑠2 )𝑖 + 𝑠𝑖 (𝑠2 + 𝑠1 )𝑖 − 𝑠𝑖2 v𝑖2  = (x𝑖 − 𝑠𝑖 v𝑖 ) 2 + (𝑠2 − 𝑠)𝑖 (𝑠 − 𝑠1 )𝑖 v𝑖2 = (𝑠2 − 𝑠)𝑖 (𝑠 − 𝑠1 )𝑖 v𝑖2 1 2 ≤ v . 4 𝑖 Considering both cases, we have E𝑄(x) = x and ∑︁ ∑︁ E∥x − 𝑄 𝑝 (x)∥ 2 = E[x𝑖 − 𝑄 𝑝 (x)𝑖 ] 2 + E[x𝑖 − 𝑄 𝑝 (x)𝑖 ] 2 {𝑠𝑖 =(𝑠1 )𝑖 } {𝑠𝑖 ≠(𝑠1 )𝑖 } 1 ∑︁ ≤ 0+ v𝑖2 4 {𝑠𝑖 ≠(𝑠1 )𝑖 } 1 ≤ ∥v∥ 2 4 1 = ∥sign(x)2−(𝑏−1) ∥ 2 ∥x∥ 2𝑝 . 4 B.1.2 Compression error To verify Theorem 9, we compare the compression error of the quantization method defined in (B.1) with different norms (𝑝 = 1, 2, 3, . . . , 6, ∞). Specifically, we uniformly generate 100 random vectors in R10000 and compute the average compression error. The result shown in Figure B.1 verifies our proof in Theorem 9 that the compression error decreases when 𝑝 increases. This suggests that ∞-norm provides the best compression precision under the same bit constraint. 101 2 bits 3 bits 4 bits 1 10 5 bits 6 bits ||X Q(X)||2/||X||2 7 bits 0 10 1 10 2 10 1 2 3 4 5 6 inf p-norm ∥x−𝑄(x) ∥ 2 Figure B.1: Relative compression error ∥x∥ 2 for p-norm b-bit quantization. 1 10 0 10 1 10 ||X Q(X)||2/||X||2 2 10 3 10 4 10 2-norm b-bits quantization 4-norm b-bits quantization 10 5 inf-norm b-bits quantization top-k sparsification 6 random-k sparsification 10 2 bits 4 bits 8 bits 16 bits 20 bits Bits constraint ∥x−𝑄(x) ∥ 2 Figure B.2: Comparison of compression error ∥x∥ 2 between different compression methods. Under similar setting, we also compare the compression error with other popular compression methods, such as top-k and random-k sparsification. The x-axes represents the average bits needed to represent each element of the vector. The result is showed in Fig. B.2. Note that intuitively top-k methods should perform better than random-k method, but the top-k method needs extra bits to transmitted the index while random-k method can avoid this by using the same random seed. Therefore, top-k method doesn’t outperform random-k too much under the same communication budget. The result in Fig. B.2 suggests that ∞-norm b-bits quantization provides significantly better compression precision than others under the same bit constraint. 102 B.2 Experiments B.2.1 Experiments in homogeneous setting The experiments on logistic regression problem in homogeneous case are showed in Fig. B.3 and Fig. B.4. It shows that DeepSqueeze, CHOCO-SGD and LEAD converges similarly while DeepSqueeze and CHOCO-SGD require to tune a smaller 𝛾 for convergence as showed in the parameter setting in Section B.2.2. Generally, a smaller 𝛾 decreases the model propagation between agents since 𝛾 changes the effective mixing matrix and this may cause slower convergence. However, in the setting where data from different agents are very similar, the models move to close directions such that the convergence is not affected too much. DGD (32 bits) DGD (32 bits) NIDS (32 bits) NIDS (32 bits) 0 0 10 QDGD (2 bits) 10 QDGD (2 bits) DeepSqueeze (2 bits) DeepSqueeze (2 bits) CHOCO-SGD (2 bits) CHOCO-SGD (2 bits) LEAD (2 bits) LEAD (2 bits) Loss Loss −1 6 × 10 −1 6 × 10 −1 4 × 10 −1 4 × 10 −1 3 × 10 −1 3 × 10 0 200 400 600 800 1000 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Epoch Bits transmitted 1e10 (a) Loss 𝑓 (X 𝑘 ) (b) Loss 𝑓 (X 𝑘 ) Figure B.3: Logistic regression in the homogeneous case (full-batch gradient). B.2.2 Parameter settings The best parameter settings we search for all algorithms and experiments are summarized in Tables B.1– B.4. QDGD and DeepSqueeze are more sensitive to 𝛾 and CHOCO-SGD is slight more robust. LEAD is most robust to parameter settings and it works well for the setting 𝛼 = 0.5 and 𝛾 = 1.0 in all experiments in this work. 103 1.2 1.2 DGD (32 bits) DGD (32 bits) NIDS (32 bits) NIDS (32 bits) 1.0 QDGD (2 bits) 1.0 QDGD (2 bits) DeepSqueeze (2 bits) DeepSqueeze (2 bits) 0.8 CHOCO-SGD (2 bits) CHOCO-SGD (2 bits) 0.8 LEAD (2 bits) LEAD (2 bits) Loss Loss 0.6 0.6 0.4 0.4 0 10 20 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Epoch Bits transmitted 1e9 (a) Loss 𝑓 (X 𝑘 ) (b) Loss 𝑓 (X 𝑘 ) Figure B.4: Logistic regression in the homogeneous case (mini-batch gradient). Algorithm 𝜂 𝛾 𝛼 DGD 0.1 - - NIDS 0.1 - - QDGD 0.1 0.2 - DeepSqueeze 0.1 0.2 - CHOCO-SGD 0.1 0.8 - LEAD 0.1 1.0 0.5 Table B.1: Parameter settings for the linear regression problem. Algorithm 𝜂 𝛾 𝛼 Algorithm 𝜂 𝛾 𝛼 DGD 0.1 - - DGD 0.1 - - NIDS 0.1 - - NIDS 0.1 - - QDGD 0.1 0.4 - QDGD 0.1 0.2 - DeepSqueeze 0.1 0.4 - DeepSqueeze 0.1 0.6 - CHOCO-SGD 0.1 0.6 - CHOCO-SGD 0.1 0.6 - LEAD 0.1 1.0 0.5 LEAD 0.1 1.0 0.5 Homogeneous case Heterogeneous case Table B.2: Parameter settings for the logistic regression problem (full-batch gradient). B.3 Proofs of the theorems B.3.1 Illustrative flow The following flow graph depicts the relation between iterative variables and clarifies the range of conditional expectation. {G𝑘 }∞ ∞ 𝑘=0 and {F𝑘 } 𝑘=0 are two 𝜎−algebras generated by the gradient 104 Algorithm 𝜂 𝛾 𝛼 Algorithm 𝜂 𝛾 𝛼 DGD 0.1 - - DGD 0.1 - - NIDS 0.1 - - NIDS 0.1 - - QDGD 0.05 0.2 - QDGD 0.05 0.2 - DeepSqueeze 0.1 0.6 - DeepSqueeze 0.1 0.6 - CHOCO-SGD 0.1 0.6 - CHOCO-SGD 0.1 0.6 - LEAD 0.1 1.0 0.5 LEAD 0.1 1.0 0.5 Homogeneous case Heterogeneous case Table B.3: Parameter settings for the logistic regression problem (mini-batch gradient). Algorithm 𝜂 𝛾 𝛼 Algorithm 𝜂 𝛾 𝛼 DGD 0.1 - - DGD 0.05 - - NIDS 0.1 - - NIDS 0.1 - - QDGD 0.05 0.1 - QDGD * * - DeepSqueeze 0.1 0.2 - DeepSqueeze * * - CHOCO-SGD 0.1 0.6 - CHOCO-SGD * * - LEAD 0.1 1.0 0.5 LEAD 0.1 1.0 0.5 Homogeneous case Heterogeneous case Table B.4: Parameter settings for the deep neural network. (* means divergence for all options we try). sampling and the stochastic compression respectively. They satisfy G0 ⊂ F0 ⊂ G1 ⊂ F1 ⊂ · · · ⊂ G𝑘 ⊂ F𝑘 ⊂ · · · (X1 , D1 , H1 ) (X2 , D2 , H2 ) (X3 , D3 , H3 ) (X 𝑘 , D 𝑘 , H 𝑘 ) ··· ∇F(X1 ;𝜉 1 )∈G0 E1 ∇F(X2 ;𝜉 2 )∈G1 E2 ··· E 𝑘−1 ∇F(X 𝑘 ;𝜉 𝑘 )∈G𝑘−1 Y1 1st round Y2 ··· Y 𝑘−1 Y𝑘 (𝑘−1)th round ⊂ ⊂ F0 F1 ··· F𝑘−2 F𝑘−1 The solid and dashed arrows in the top flow illustrate the dynamics of the algorithm, while in the bottom, the arrows stand for the relation between successive F -𝜎-algebras. The downward arrows determine the range of F -𝜎-algebras. E.g., up to E 𝑘 , all random variables are in F𝑘−1 and up to ∇F(X 𝑘 ; 𝜉 𝑘 ), all random variables are in G𝑘−1 with G𝑘−1 ⊂ F𝑘−1 . Throughout the appendix, without specification, E is the expectation conditioned on the corresponding stochastic estimators given the 105 context. B.3.2 Two central Lemmas Lemma 10 (Fundamental equality). Let X∗ be the optimal solution, D∗ B −∇F(X∗ ) and E 𝑘 denote the compression error in the 𝑘th iteration, that is E 𝑘 = Q 𝑘 − (Y𝑘 − H 𝑘 ) = Ŷ 𝑘 − Y 𝑘 . From Alg. 3, we have ∥X 𝑘+1 − X∗ ∥ 2 + (𝜂2 /𝛾)∥D 𝑘+1 − D∗ ∥ 2M =∥X 𝑘 − X∗ ∥ 2 + (𝜂2 /𝛾)∥D 𝑘 − D∗ ∥ 2M − (𝜂2 /𝛾)∥D 𝑘+1 − D 𝑘 ∥ 2M − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2 − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + 2𝜂⟨E 𝑘 , D 𝑘+1 − D∗ ⟩, where M B 2(I − W) † − 𝛾I and 𝛾 < 2/𝜆 max (I − W) ensures the positive definiteness of M over range(I − W). Lemma 11 (State inequality). Let the same assumptions in Lemma 10 hold. From Alg. 3, if we take the expectation over the compression operator conditioned on the 𝑘-th iteration, we have E∥H 𝑘+1 − X∗ ∥ 2 ≤ (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝛼E∥X 𝑘+1 − X∗ ∥ 2 + 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 2𝛼𝜂2 + E∥D 𝑘+1 − D 𝑘 ∥ 2M + 𝛼2 E∥E 𝑘 ∥ 2 − 𝛼𝛾E∥E 𝑘 ∥ 2I−W − 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 . 𝛾 B.3.3 Proof of Lemma 10 Before proving Lemma 10, we let E 𝑘 = Ŷ 𝑘 − Y 𝑘 and introduce the following three Lemmas. Lemma 12. Let X∗ be the consensus solution. Then, from Line 4-7 of Alg. 3, we obtain   I − W 𝑘+1 ∗ 𝐼 I−W I−W 𝑘 (X − X ) = − (D 𝑘+1 − D 𝑘 ) − E . (B.3) 2𝜂 𝛾 2 2𝜂 Proof. From the iterations in Alg. 3, we have 𝛾 D 𝑘+1 = D 𝑘 + (I − W) Ŷ 𝑘 (from Line 6) 2𝜂 106 𝛾 = D𝑘 + (I − W)(Y 𝑘 + E 𝑘 ) 2𝜂 𝛾 = D 𝑘 + (I − W)(X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘 + E 𝑘 ) (from Line 4) 2𝜂 𝛾 = D 𝑘 + (I − W)(X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘+1 − X∗ + 𝜂(D 𝑘+1 − D 𝑘 ) + E 𝑘 ) 2𝜂 𝛾 𝛾 𝛾 = D 𝑘 + (I − W)(X 𝑘+1 − X∗ ) + (I − W)(D 𝑘+1 − D 𝑘 ) + (I − W)E 𝑘 , 2𝜂 2 2𝜂 where the fourth equality holds due to (I − W)X∗ = 0 and the last equality comes from Line 7 of Alg. 3. Rewriting this equality, and we obtain (B.3). Lemma 13. Let D∗ = −∇F(X∗ ) ∈ span{I − W}, we have 𝜂 ⟨X 𝑘+1 − X∗ , D 𝑘+1 − D 𝑘 ⟩ = ∥D 𝑘+1 − D 𝑘 ∥ 2M − ⟨E 𝑘 , D 𝑘+1 − D 𝑘 ⟩, (B.4) 𝛾 𝜂 ⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩ = ⟨D 𝑘+1 − D 𝑘 , D 𝑘+1 − D∗ ⟩M − ⟨E 𝑘 , D 𝑘+1 − D∗ ⟩, (B.5) 𝛾 where M = 2(I − W) † − 𝛾I and 𝛾 < 2/𝜆 max (I − W) ensures the positive definiteness of M over span{I − W}. Proof. Since D 𝑘+1 ∈ span{I − W} for any 𝑘, we have ⟨X 𝑘+1 − X∗ , D 𝑘+1 − D 𝑘 ⟩ =⟨(I − W)(X 𝑘+1 − X∗ ), (I − W) † (D 𝑘+1 − D 𝑘 )⟩   𝜂 𝑘+1 𝑘 𝑘 † 𝑘+1 𝑘 = (2I − 𝛾(I − W))(D − D ) − (I − W)E , (I − W) (D − D ) (from (B.3)) 𝛾    𝜂 † 𝑘+1 𝑘 𝑘 𝑘+1 𝑘 = (2(I − W) − 𝛾I (D − D ) − E , D − D 𝛾 𝜂 = ∥D 𝑘+1 − D 𝑘 ∥ 2M − ⟨E 𝑘 , D 𝑘+1 − D 𝑘 ⟩. 𝛾 Similarly, we have ⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩ =⟨(I − W)(X 𝑘+1 − X∗ ), (I − W) † (D 𝑘+1 − D∗ )⟩   𝜂 𝑘+1 𝑘 𝑘 † 𝑘+1 ∗ = (2I − 𝛾(I − W))(D − D ) − (I − W)E , (I − W) (D − D ) 𝛾 107   𝜂 = (2(I − W) † − I)(D 𝑘+1 − D 𝑘 ) − E 𝑘 , D 𝑘+1 − D∗ 𝛾 𝜂 = ⟨D 𝑘+1 − D 𝑘 , D 𝑘+1 − D∗ ⟩M − ⟨E 𝑘 , D 𝑘+1 − D∗ ⟩. 𝛾 To make sure that M is positive definite over span{I − W}, we need 𝛾 < 2/𝜆 max (I − W). Lemma 14. Taking the expectation conditioned on the compression in the 𝑘th iteration, we have   𝑘 𝑘+1 ∗ 𝑘 𝑘 𝛾 𝑘 𝛾 𝑘 ∗ 2𝜂E⟨E , D − D ⟩ = 2𝜂E E , D + (I − W)Y + (I − W)E − D 2𝜂 2𝜂 = 𝛾E⟨E 𝑘 , (I − W)E 𝑘 ⟩ = 𝛾E∥E 𝑘 ∥ 2I−W ,   𝑘 𝑘+1 𝑘 𝑘 𝛾 𝑘 𝛾 𝑘 2𝜂E⟨E , D − D ⟩ = 2𝜂E E , (I − W)Y + (I − W)E 2𝜂 2𝜂 = 𝛾E⟨E 𝑘 , (I − W)E 𝑘 ⟩ = 𝛾E∥E 𝑘 ∥ 2I−W . Proof. The proof is straightforward and omitted here. Proof of Lemma 10. From Alg. 3, we have 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ =2⟨X 𝑘 − X∗ , 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂∇F(X∗ )⟩ =2⟨X 𝑘 − X∗ , X 𝑘 − X 𝑘+1 − 𝜂(D 𝑘+1 − D∗ )⟩ (from Line 7) =2⟨X 𝑘 − X∗ , X 𝑘 − X 𝑘+1 ⟩ − 2𝜂⟨X 𝑘 − X∗ , D 𝑘+1 − D∗ ⟩ =2⟨X 𝑘 − X∗ , X 𝑘 − X 𝑘+1 ⟩ − 2𝜂⟨X 𝑘 − X 𝑘+1 , D 𝑘+1 − D∗ ⟩ − 2𝜂⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩ =2⟨X 𝑘 − X∗ − 𝜂(D 𝑘+1 − D∗ ), X 𝑘 − X 𝑘+1 ⟩ − 2𝜂⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩ =2⟨X 𝑘+1 − X∗ + 𝜂(∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )), X 𝑘 − X 𝑘+1 ⟩ − 2𝜂⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩ (from Line 7) =2⟨X 𝑘+1 − X∗ , X 𝑘 − X 𝑘+1 ⟩ + 2𝜂⟨∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ ), X 𝑘 − X 𝑘+1 ⟩ − 2𝜂⟨X 𝑘+1 − X∗ , D 𝑘+1 − D∗ ⟩. (B.6) Then we consider the terms on the right hand side of (B.6) separately. Using 2⟨A − B, B − C⟩ = ∥A − C∥ 2 − ∥B − C∥ 2 − ∥A − B∥ 2 , we have 2⟨X 𝑘+1 − X∗ , X 𝑘 − X 𝑘+1 ⟩ =2⟨X∗ − X 𝑘+1 , X 𝑘+1 − X 𝑘 ⟩ 108 =∥X 𝑘 − X∗ ∥ 2 − ∥X 𝑘+1 − X 𝑘 ∥ 2 − ∥X 𝑘+1 − X∗ ∥ 2 . (B.7) Using 2⟨A, B⟩ = ∥A∥ 2 + ∥B∥ 2 − ∥A − B∥ 2 , we have 2𝜂⟨∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ ), X 𝑘 − X 𝑘+1 ⟩ =𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + ∥X 𝑘 − X 𝑘+1 ∥ 2 − ∥X 𝑘 − X 𝑘+1 − 𝜂(∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ ))∥ 2 =𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + ∥X 𝑘 − X 𝑘+1 ∥ 2 − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2 . (from Line 7) (B.8) Combining (B.6), (B.7), (B.8), and (B.4), we obtain 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ = ∥X 𝑘 − X∗ ∥ 2 − ∥X 𝑘+1 − X 𝑘 ∥ 2 − ∥X 𝑘+1 − X∗ ∥ 2 | {z } 2⟨X 𝑘+1 −X∗ ,X 𝑘 −X 𝑘+1 ⟩ + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + ∥X 𝑘 − X 𝑘+1 ∥ 2 − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2 | {z } 2𝜂⟨∇F(X 𝑘 ;𝜉 𝑘 )−∇F(X∗ ),X 𝑘 −X 𝑘+1 ⟩  2𝜂2  − ⟨D 𝑘+1 − D 𝑘 , D 𝑘+1 − D∗ ⟩M − 2𝜂⟨E 𝑘 , D 𝑘+1 − D∗ ⟩ 𝛾 | {z } 2𝜂⟨X 𝑘+1 −X∗ ,D 𝑘+1 −D∗ ⟩ =∥X 𝑘 − X∗ ∥ 2 − ∥X 𝑘+1 − X 𝑘 ∥ 2 − ∥X 𝑘+1 − X∗ ∥ 2 + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + ∥X 𝑘 − X 𝑘+1 ∥ 2 − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2 𝜂2  𝑘  + ∥D − D ∥ M − ∥D − D ∥ M − ∥D − D ∥ M +2𝜂⟨E 𝑘 , D 𝑘+1 − D∗ ⟩, ∗ 2 𝑘+1 ∗ 2 𝑘+1 𝑘 2 𝛾 | {z } −2⟨D 𝑘+1 −D 𝑘 ,D 𝑘+1 −D∗ ⟩M where the last equality holds because 2⟨D 𝑘 − D 𝑘+1 , D 𝑘+1 − D∗ ⟩M =∥D 𝑘 − D∗ ∥ 2M − ∥D 𝑘+1 − D∗ ∥ 2M − ∥D 𝑘+1 − D 𝑘 ∥ 2M . Thus, we reformulate it as 𝜂2 𝑘+1 ∥X 𝑘+1 − X∗ ∥ 2 + ∥D − D∗ ∥ 2M 𝛾 𝜂2 𝜂2 =∥X 𝑘 − X∗ ∥ 2 + ∥D 𝑘 − D∗ ∥ 2M − ∥D 𝑘+1 − D 𝑘 ∥ 2M − 𝜂2 ∥D 𝑘+1 − D∗ ∥ 2 𝛾 𝛾 − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + 2𝜂⟨E 𝑘 , D 𝑘+1 − D∗ ⟩, which completes the proof. 109 B.3.4 Proof of Lemma 11 Proof of Lemma 11. From Alg. 3, we take the expectation conditioned on 𝑘th compression and obtain E∥H 𝑘+1 − X∗ ∥ 2 =E∥(1 − 𝛼)(H 𝑘 − X∗ ) + 𝛼(Y 𝑘 − X∗ ) + 𝛼E 𝑘 ∥ 2 (from Line 13) =∥(1 − 𝛼)(H 𝑘 − X∗ ) + 𝛼(Y 𝑘 − X∗ )∥ 2 + 𝛼2 E∥E 𝑘 ∥ 2 =(1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝛼∥Y 𝑘 − X∗ ∥ 2 − 𝛼(1 − 𝛼)∥H 𝑘 − Y 𝑘 ∥ 2 + 𝛼2 E∥E 𝑘 ∥ 2 . (B.9) In the second equality, we used the unbiasedness of the compression, i.e., EE 𝑘 = 0. The last equality holds because of ∥(1 − 𝛼)A + 𝛼B∥ 2 = (1 − 𝛼)∥A∥ 2 + 𝛼∥B∥ 2 − 𝛼(1 − 𝛼)∥A − B∥ 2 . In addition, by taking the conditional expectation on the compression, we have ∥Y 𝑘 − X∗ ∥ 2 =∥X 𝑘 − 𝜂∇F(X 𝑘 ; 𝜉 𝑘 ) − 𝜂D 𝑘 − X∗ ∥ 2 (from Line 4) =E∥X 𝑘+1 + 𝜂D 𝑘+1 − 𝜂D 𝑘 − X∗ ∥ 2 (from Line 7) =E∥X 𝑘+1 − X∗ ∥ 2 + 𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 + 2𝜂E⟨X 𝑘+1 − X∗ , D 𝑘+1 − D 𝑘 ⟩ =E∥X 𝑘+1 − X∗ ∥ 2 + 𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 2𝜂2 + E∥D 𝑘+1 − D 𝑘 ∥ 2M − 2𝜂E⟨E 𝑘 , D 𝑘+1 − D 𝑘 ⟩. (from (B.4)) 𝛾 =E∥X 𝑘+1 − X∗ ∥ 2 + 𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 2𝜂2 + E∥D 𝑘+1 − D 𝑘 ∥ 2M − 𝛾E∥E 𝑘 ∥ 2I−W . (from Line 6) (B.10) 𝛾 Combing the above two equations (B.9) and (B.10) together, we have E∥H 𝑘+1 − X∗ ∥ 2 2𝛼𝜂2 ≤(1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝛼E∥X 𝑘+1 − X∗ ∥ 2 + 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 + E∥D 𝑘+1 − D 𝑘 ∥ 2M 𝛾 − 𝛼𝛾E∥E 𝑘 ∥ 2I−W + 𝛼2 E∥E 𝑘 ∥ 2 − 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 , (B.11) which completes the proof. 110 B.3.5 Proof of Theorem 3 Proof of Theorem 3. Combining Lemmas 10, 11, and 14, we have the expectation conditioned on the compression satisfying 𝜂2 E∥X 𝑘+1 − X∗ ∥ 2 + E∥D 𝑘+1 − D∗ ∥ 2M + 𝑎 1 E∥H 𝑘+1 − X∗ ∥ 2 𝛾 𝜂2 𝜂2 ≤∥X 𝑘 − X∗ ∥ 2 + ∥D 𝑘 − D∗ ∥ 2M − E∥D 𝑘+1 − D 𝑘 ∥ 2M − 𝜂2 E∥D 𝑘+1 − D∗ ∥ 2 𝛾 𝛾 − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 + 𝛾E∥E 𝑘 ∥ 2I−W + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝑎 1 𝛼E∥X 𝑘+1 − X∗ ∥ 2 + 𝑎 1 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 2𝑎 1 𝛼𝜂2 + E∥D 𝑘+1 − D 𝑘 ∥ 2M + 𝑎 1 𝛼2 E∥E 𝑘 ∥ 2 − 𝑎 1 𝛼𝛾E∥E 𝑘 ∥ 2I−W − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 𝛾 = ∥X 𝑘 − X∗ ∥ 2 − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 | {z } A 𝜂2 + 𝑎 1 𝛼E∥X 𝑘+1 − X∗ ∥ 2 + ∥D 𝑘 − D∗ ∥ 2M − 𝜂2 E∥D 𝑘+1 − D∗ ∥ 2 𝛾 𝜂2 + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 −(1 − 2𝑎 1 𝛼) E∥D 𝑘+1 − D 𝑘 ∥ 2M + 𝑎 1 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 𝛾 | {z } B + 𝑎 1 𝛼2 E∥E 𝑘 ∥ 2 + (1 − 𝑎 1 𝛼)𝛾E∥E 𝑘 ∥ 2I−W − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 , (B.12) | {z } C where 𝑎 1 is a non-negative number to be determined. Then we deal with the three terms on the right hand side separately. We want the terms B and C to be nonpositive. First, we consider B. Note that D 𝑘 ∈ Range(I − W). If we want B ≤ 0, then, we need 1 − 2𝑎 1 𝛼 > 0, i.e., 𝑎 1 𝛼 < 1/2. Therefore we have 𝜂2 B = − (1 − 2𝑎 1 𝛼) E∥D 𝑘+1 − D 𝑘 ∥ 2M + 𝑎 1 𝛼𝜂2 E∥D 𝑘+1 − D 𝑘 ∥ 2 𝛾   (1 − 2𝑎 1 𝛼)𝜆 𝑛−1 (M) 2 ≤ 𝑎1 𝛼 − 𝜂 E∥D 𝑘+1 − D 𝑘 ∥ 2 , 𝛾 where 𝜆 𝑛−1 (M) > 0 is the second smallest eigenvalue of M. It means that we also need (2𝑎 1 𝛼 − 1)𝜆 𝑛−1 (M) 𝑎1 𝛼 + ≤ 0, 𝛾 111 which is equivalent to 𝜆 𝑛−1 (M) 𝑎1 𝛼 ≤ < 1/2. (B.13) 𝛾 + 2𝜆 𝑛−1 (M) Then we look at C. We have C =𝑎 1 𝛼2 E∥E 𝑘 ∥ 2 + (1 − 𝑎 1 𝛼)𝛾E∥E 𝑘 ∥ 2I−W − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 ≤((1 − 𝑎 1 𝛼) 𝛽𝛾 + 𝑎 1 𝛼2 )E∥E 𝑘 ∥ 2 − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 ≤𝐶 ((1 − 𝑎 1 𝛼) 𝛽𝛾 + 𝑎 1 𝛼2 )∥Y 𝑘 − H 𝑘 ∥ 2 − 𝑎 1 𝛼(1 − 𝛼)∥Y 𝑘 − H 𝑘 ∥ 2 Because we have 1 − 𝑎 1 𝛼 > 1/2, so we need 𝐶 ((1 − 𝑎 1 𝛼) 𝛽𝛾 + 𝑎 1 𝛼2 ) − 𝑎 1 𝛼(1 − 𝛼) = (1 + 𝐶)𝑎 1 𝛼2 − 𝑎 1 (𝐶 𝛽𝛾 + 1)𝛼 + 𝐶 𝛽𝛾 ≤ 0. (B.14) That is √︃ 𝑎 1 (𝐶 𝛽𝛾 + 1) − 𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾 𝛼≥ C 𝛼0 , (B.15) 2(1 + 𝐶)𝑎 1 √︃ 𝑎 1 (𝐶 𝛽𝛾 + 1) + 𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾 𝛼≤ C 𝛼1 . (B.16) 2(1 + 𝐶)𝑎 1 Next, we look at A. Firstly, by the bounded variance assumption, we have the expectation conditioned on the gradient sampling in 𝑘th iteration satisfying E∥X 𝑘 − X∗ ∥ 2 − 2𝜂E⟨X 𝑘 − X∗ , ∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 E∥∇F(X 𝑘 ; 𝜉 𝑘 ) − ∇F(X∗ )∥ 2 ≤∥X 𝑘 − X∗ ∥ 2 − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩ + 𝜂2 ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2 + 𝑛𝜂2 𝜎 2 Then with the smoothness and strong convexity from Assumptions 8, we have the co-coercivity of ∇𝑔𝑖 (x) with 𝑔𝑖 (x) := 𝑓𝑖 (x) − 𝑢2 ∥x∥ 22 , which gives 𝜇𝐿 1 ⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩ ≥ ∥X 𝑘 − X∗ ∥ 2 + ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2 . 𝜇+𝐿 𝜇+𝐿 When 𝜂 ≤ 2/(𝜇 + 𝐿), we have ⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩ 112   𝜂(𝜇 + 𝐿) 𝜂(𝜇 + 𝐿) 𝑘 = 1− ⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩ + ⟨X − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩ 2 2   𝜂𝜇(𝜇 + 𝐿) 𝜂𝜇𝐿 𝜂 ≥ 𝜇− + ∥X 𝑘 − X∗ ∥ 2 + ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2 2 2 2  𝜂𝜇  𝜂 =𝜇 1 − ∥X 𝑘 − X∗ ∥ 2 + ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2 . 2 2 Therefore, we obtain − 2𝜂⟨X 𝑘 − X∗ , ∇F(X 𝑘 ) − ∇F(X∗ )⟩ ≤ − 𝜂2 ∥∇F(X 𝑘 ) − ∇F(X∗ )∥ 2 − 𝜇(2𝜂 − 𝜇𝜂2 )∥X 𝑘 − X∗ ∥ 2 . (B.17) Conditioned on the 𝑘the iteration, (i.e., conditioned on the gradient sampling in 𝑘th iteration), the inequality (B.12) becomes 𝜂2 E∥X 𝑘+1 − X∗ ∥ 2 + E∥D 𝑘+1 − D∗ ∥ 2M + 𝑎 1 E∥H 𝑘+1 − X∗ ∥ 2 𝛾   ≤ 1 − 𝜇(2𝜂 − 𝜇𝜂2 ) ∥X 𝑘 − X∗ ∥ 2 + 𝑎 1 𝛼E∥X 𝑘+1 − X∗ ∥ 2 𝜂2 𝑘 + ∥D − D∗ ∥ 2M − 𝜂2 E∥D 𝑘+1 − D∗ ∥ 2 + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝑛𝜂2 𝜎 2 , (B.18) 𝛾 2 if the step size satisfies 𝜂 ≤ 𝜇+𝐿 . Rewriting (B.18), we have 𝜂2 (1 − 𝑎 1 𝛼)E∥X 𝑘+1 − X∗ ∥ 2 + E∥D 𝑘+1 − D∗ ∥ 2M + 𝜂2 E∥D 𝑘+1 − D∗ ∥ 2 + 𝑎 1 E∥H 𝑘+1 − X∗ ∥ 2 𝛾   𝜂2 ≤ 1 − 𝜇(2𝜂 − 𝜇𝜂2 ) ∥X 𝑘 − X∗ ∥ 2 + ∥D 𝑘 − D∗ ∥ 2M + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝑛𝜂2 𝜎 2 , (B.19) 𝛾 and thus 𝜂2 (1 − 𝑎 1 𝛼)E∥X − X ∥ + E∥D 𝑘+1 − D∗ ∥ 2M+𝛾I + 𝑎 1 E∥H 𝑘+1 − X∗ ∥ 2 𝑘+1 ∗ 2 𝛾   𝜂2 ≤ 1 − 𝜇(2𝜂 − 𝜇𝜂2 ) ∥X 𝑘 − X∗ ∥ 2 + ∥D 𝑘 − D∗ ∥ 2M + 𝑎 1 (1 − 𝛼)∥H 𝑘 − X∗ ∥ 2 + 𝑛𝜂2 𝜎 2 . (B.20) 𝛾 With the definition of L 𝑘 in (3.17), we have EL 𝑘+1 ≤ 𝜌L 𝑘 + 𝑛𝜂2 𝜎 2 , (B.21) with 1 − 𝜇(2𝜂 − 𝜇𝜂2 ) 𝜆 max (M)   𝜌 = max , ,1−𝛼 . 1 − 𝑎1 𝛼 𝛾 + 𝜆 max (M) 113 where 𝜆 max (M) = 2𝜆 max ((I − W) † ) − 𝛾. Recall all the conditions on the parameters 𝑎 1 , 𝛼, and 𝛾 to make sure that 𝜌 < 1: 𝜆 𝑛−1 (M) 𝑎1 𝛼 ≤ , (B.22) 𝛾 + 2𝜆 𝑛−1 (M) 𝑎 1 𝛼 ≤ 𝜇(2𝜂 − 𝜇𝜂2 ), (B.23) √︃ 𝑎 1 (𝐶 𝛽𝛾 + 1) − 𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾 𝛼≥ C 𝛼0 , (B.24) 2(1 + 𝐶)𝑎 1 √︃ 𝑎 1 (𝐶 𝛽𝛾 + 1) + 𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾 𝛼≤ C 𝛼1 . (B.25) 2(1 + 𝐶)𝑎 1 In the following, we show that there exist parameters that satisfy these conditions. Since we can choose any 𝑎 1 , we let 4(1 + 𝐶) 𝑎1 = , 𝐶 𝛽𝛾 + 2 such that 𝑎 21 (𝐶 𝛽𝛾 + 1) 2 − 4(1 + 𝐶)𝐶𝑎 1 𝛽𝛾 = 𝑎 21 . Then we have 𝐶 𝛽𝛾 𝛼0 = → 0, as 𝛾 → 0, 2(1 + 𝐶) 𝐶 𝛽𝛾 + 2 1 𝛼1 = → , as 𝛾 → 0. 2(1 + 𝐶) 1+𝐶 Conditions (B.24) and (B.25) show   2𝐶 𝛽𝛾 𝑎1 𝛼 ∈ , 2 → [0, 2], if 𝐶 = 0 or 𝛾 → 0. 𝐶 𝛽𝛾 + 2 Hence in order to make (B.22) and (B.23) satisfied, it’s sufficient to make (2 ) 2𝐶 𝛽𝛾  𝜆 𝑛−1 (M)  𝛽 −𝛾 ≤ min , 𝜇(2𝜂 − 𝜇𝜂2 ) = min 4 , 𝜇(2𝜂 − 𝜇𝜂2 ) . (B.26) 𝐶 𝛽𝛾 + 2 𝛾 + 2𝜆 𝑛−1 (M) 𝛽 − 𝛾 2 2 where we use 𝜆 𝑛−1 (M) = 𝜆 max (I−W) −𝛾 = 𝛽 − 𝛾. 114 When 𝐶 > 0, the condition (B.26) is equivalent to ( √︁ ) (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶 2𝜇𝜂(2 − 𝜇𝜂) 𝛾 ≤ min , . (B.27) 𝐶𝛽 [2 − 𝜇𝜂(2 − 𝜇𝜂)]𝐶 𝛽 The first term can be simplified using √︁ (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶 2 ≥ 𝐶𝛽 (3𝐶 + 1) 𝛽 √ due to 1 − 𝑥 ≤ 1 − 2𝑥 when 𝑥 ∈ (0, 1). Therefore, for a given stepsize 𝜂, if we choose   n 2 2𝜇𝜂(2 − 𝜇𝜂) o 𝛾 ∈ 0, min , (3𝐶 + 1) 𝛽 [2 − 𝜇𝜂(2 − 𝜇𝜂)]𝐶 𝛽 and  n 𝐶 𝛽𝛾 + 2 2 − 𝛽𝛾 𝐶 𝛽𝛾 + 2  𝐶 𝛽𝛾 𝐶 𝛽𝛾 + 2 o 𝛼∈ , min , , 𝜇𝜂(2 − 𝜇𝜂) , 2(1 + 𝐶) 2(1 + 𝐶) 4 − 𝛽𝛾 4(1 + 𝐶) 4(1 + 𝐶) then, all conditions (B.22)-(B.25) hold. 2 2 Note that 𝛾 < (3𝐶+1) 𝛽 implies 𝛾 < 𝛽, which ensures the positive definiteness of M over span{I − W} in Lemma 13. 2 Note that 𝜂 ≤ 𝜇+𝐿 ensures 𝐶 𝛽𝛾 + 2 𝐶 𝛽𝛾 + 2 𝜇𝜂(2 − 𝜇𝜂) ≤ . (B.28) 4(1 + 𝐶) 2(1 + 𝐶) So, we can simplify the bound for 𝛼 as  n 2 − 𝛽𝛾 𝐶 𝛽𝛾 + 2  𝐶 𝛽𝛾 𝐶 𝛽𝛾 + 2 o 𝛼∈ , min , 𝜇𝜂(2 − 𝜇𝜂) . 2(1 + 𝐶) 4 − 𝛽𝛾 4(1 + 𝐶) 4(1 + 𝐶) Lastly, taking the total expectation on both sides of (B.21) and using tower property, we complete the proof for 𝐶 > 0. 𝜆 max (I−W) Proof of Corollary 4. Let’s first define 𝜅 𝑓 = 𝐿 𝜇 and 𝜅 𝑔 = 𝜆+min (I−W) = 𝜆 max (I − W)𝜆 max ((I − W) † ). 115 1 We can choose the stepsize 𝜂 = 𝐿 such that the upper bound of 𝛾 is   2 1 n 2 𝜅𝑓 2 − 𝜅𝑓 2o  2 1  𝛾upper = min ,h  i , ≥ min , , (3𝐶 + 1) 𝛽 2 − 1 2 − 1 𝐶 𝛽 𝛽 (3𝐶 + 1) 𝛽 𝜅 𝑓 𝐶 𝛽 𝜅𝑓 𝜅𝑓 𝑥(2−𝑥) 𝑥 due to 2−𝑥(2−𝑥) ≥ 2−𝑥 ≥ 𝑥 when 𝑥 ∈ (0, 1). 1 1 Hence we can take 𝛾 = min{ (3𝐶+1) 𝛽 , 𝜅 𝑓 𝐶 𝛽 }. The bound of 𝛼 is    𝐶 𝛽𝛾 2 − 𝛽𝛾 𝐶 𝛽𝛾 + 2 1 1 𝐶 𝛽𝛾 + 2 𝛼∈ , min , (2 − ) 2(1 + 𝐶) 4 − 𝛽𝛾 4(1 + 𝐶) 𝜅 𝑓 𝜅 𝑓 4(1 + 𝐶) 1 When 𝛾 is chosen as 𝜅 𝑓 𝐶𝛽 , pick 𝐶 𝛽𝛾 1 𝛼= = . (B.29) 2(1 + 𝐶) 2(1 + 𝐶)𝜅 𝑓 1 1 When (3𝐶+1) 𝛽 ≤ 𝜅 𝑓 𝐶𝛽 , the upper bound of 𝛼 is   2 − 𝛽𝛾 𝐶 𝛽𝛾 + 2 1 1 𝐶 𝛽𝛾 + 2 𝛼upper = min , (2 − ) 4 − 𝛽𝛾 4(1 + 𝐶) 𝜅 𝑓 𝜅 𝑓 4(1 + 𝐶)   6𝐶 + 1 1 1 7𝐶 + 2 = min , (2 − ) 12𝐶 + 3 𝜅 𝑓 𝜅 𝑓 4(𝐶 + 1)(3𝐶 + 1)   6𝐶 + 1 1 7𝐶 + 2 ≥ min , . 12𝐶 + 3 𝜅 𝑓 4(𝐶 + 1)(3𝐶 + 1) In this case, we pick   6𝐶 + 1 1 7𝐶 + 2 𝛼 = min , . (B.30) 12𝐶 + 3 𝜅 𝑓 4(𝐶 + 1)(3𝐶 + 1)   1 6𝐶+1 Note 𝛼 = O (1+𝐶)𝜅 𝑓 since 12𝐶+3 is lower bounded by 13 . Hence in both cases (Eq. (B.29) and   1 Eq. (B.30)), 𝛼 = O (1+𝐶)𝜅 𝑓 , and the third term of 𝜌 is upper bounded by     1 6𝐶 + 1 1 7𝐶 + 2 1 − 𝛼 ≤ max 1 − , 1 − min , 2(1 + 𝐶)𝜅 𝑓 12𝐶 + 3 𝜅 𝑓 4(1 + 𝐶)(3𝐶 + 1) In two cases of 𝛾, the second term of 𝜌 becomes   𝛾 1 1 1− = max 1 − ,1− 2𝜆 max ((I − W) † ) 2𝐶𝜅 𝑓 𝜅 𝑔 (1 + 3𝐶)𝜅 𝑔 116 Before analysing the first term of 𝜌, we look at 𝑎 1 𝛼 in two cases of 𝛾. 1 When 𝛾 = 𝜅 𝑓 𝐶𝛽 , we have 2𝐶 𝛽𝛾 2 1 𝑎1 𝛼 = = ≤ . 𝐶 𝛽𝛾 + 2 2𝜅 𝑓 + 1 𝜅 𝑓 1 When 𝛾 = (3𝐶+1) 𝛽 , we have   6𝐶 + 1 1 1 𝑎 1 𝛼 = min , ≤ . (12𝐶 + 3) 𝜅 𝑓 𝜅𝑓 1 In both cases, 𝑎 1 𝛼 ≤ 𝜅𝑓 . Therefore, the first term of 𝜌 becomes 1 − 𝜇𝜂(2 − 𝜇𝜂) 1 − 𝜅1𝑓 (2 − 1 𝜅𝑓 ) 1− 1 𝜅𝑓 1 ≤ =1− =1− . 1 − 𝑎1 𝛼 1 − 𝜅1𝑓 𝜅𝑓 − 1 𝜅𝑓 To summarize, we have     1 1 1 1 6𝐶 + 1 1 7𝐶 + 2 𝜌 ≤ 1 − min , , , , min , 𝜅 𝑓 2𝐶𝜅 𝑓 𝜅 𝑔 (1 + 3𝐶)𝜅 𝑔 2(1 + 𝐶)𝜅 𝑓 12𝐶 + 3 𝜅 𝑓 4(1 + 𝐶)(3𝐶 + 1) and therefore   1   1   1  𝜌 = max 1 − O ,1− O ,1− O . (1 + 𝐶)𝜅 𝑓 (1 + 𝐶)𝜅 𝑔 𝐶𝜅 𝑓 𝜅 𝑔 With full-gradient (i.e., 𝜎 = 0), we get 𝜖−accuracy solution with the total number of iterations 𝑘≥O e((1 + 𝐶)(𝜅 𝑓 + 𝜅 𝑔 ) + 𝐶𝜅 𝑓 𝜅 𝑔 ). When 𝐶 = 0, i.e., there is no compression, the iteration complexity recovers that of NIDS,  O e 𝜅 𝑓 + 𝜅𝑔 . 𝜅 𝑓 +𝜅 𝑔 When 𝐶 ≤ 𝜅 𝑓 𝜅 𝑔 +𝜅 𝑓 +𝜅 𝑔 , the complexity is improved to that of NIDS, i.e., the compression doesn’t harm the convergence in terms of the order of the coefficients. 117 Proof of Corollary 5. Note that (x 𝑘 ) ⊤ = X 𝑘 and 1𝑛×1 X∗ = X∗ , then 𝑛 ∑︁ 2 E∥x𝑖𝑘 − x 𝑘 ∥ 2 = E X 𝑘 − 1𝑛×1 X 𝑘 𝑖=1 2 = E X 𝑘 − X∗ + X∗ − 1𝑛×1 X 𝑘 𝑘 ∗ 1𝑛×1 1⊤𝑛×1  𝑘 ∗  =E X −X − X −X 𝑛 ≤ E∥X 𝑘 − X∗ ∥ 2 𝜌EL 𝑘−1 + 𝑛𝜂2 𝜎 2 (1 − 𝜌) −1 ≤ 1 − 𝑎1 𝛼 𝑛𝜂2 𝜎 2 ≤ 2𝜌 𝑘 L 0 + 2 . (B.31) 1−𝜌 The last inequality holds because we have 𝑎 1 𝛼 ≤ 1/2. Proof of Corollary 3. From the proof of Theorem 3, when 𝐶 = 0, we can set 𝛾 = 1, 𝛼 = 1, and 𝑎 1 = 0. Plug those values into 𝜌, and we obtain the convergence rate for NIDS. B.3.6 Proof of Theorem 4 𝐶 𝛽𝛾 Proof of Theorem 4. In order to get exact convergence, we pick diminishing step-size, set 𝛼 = 2(1+𝐶) , 2𝐶 𝛽𝛾 𝑘 1 𝐶𝛽 𝑎1 𝛼 = 𝐶 𝛽𝛾 𝑘 +2 , 𝜃1 = 2𝜆max ((I−W) † ) and 𝜃 2 = 2(1+𝐶) , then   𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ) − 𝑎 1 𝛼 𝜌 𝑘 = max 1 − , 1 − 𝜃1 𝛾𝑘 , 1 − 𝜃2 𝛾𝑘 1 − 𝑎1 𝛼 If we further pick diminishing 𝜂 𝑘 and 𝛾 𝑘 such that 𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ) − 𝑎 1 𝛼 ≥ 𝑎 1 𝛼, then 𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ) − 𝑎 1 𝛼 𝑎1 𝛼 2𝐶 𝛽𝛾 𝑘 ≥ = ≥ 𝐶 𝛽𝛾 𝑘 . 1 − 𝑎1 𝛼 1 − 𝑎 1 𝛼 2 − 𝐶 𝛽𝛾 𝑘 √︁ Notice that 𝐶 𝛽𝛾 ≤ 23 since (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶 is increasing in 𝐶 > 0 with limit 2 3 at ∞. In this case we only need, √︁ ! n (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶 2𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ) 2o 𝛾 𝑘 ∈ 0, min , , . (B.32) 𝐶𝛽 [4 − 𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 )]𝐶 𝛽 𝛽 And 𝜌 𝑘 ≤ max {1 − 𝐶 𝛽𝛾 𝑘 , 1 − 𝜃 1 𝛾 𝑘 , 1 − 𝜃 2 𝛾 𝑘 } ≤ 1 − 𝜃 3 𝛾 𝑘 118 if 𝜃 3 = min{𝜃 1 , 𝜃 2 } and note that 𝜃 2 ≤ 𝐶 𝛽. We define L 𝑘 B (1 − 𝑎 1 𝛼𝑘 )∥X 𝑘 − X∗ ∥ 2 + (2𝜂2𝑘 /𝛾 𝑘 )E∥D 𝑘+1 − D∗ ∥ 2(I−W) † + 𝑎 1 ∥H 𝑘 − X∗ ∥ 2 . Hence EL 𝑘+1 ≤ (1 − 𝜃 3 𝛾 𝑘 )EL 𝑘 + 𝑛𝜎 2 𝜂2𝑘 . 𝜇𝜂 𝑘 (2−𝜇𝜂 𝑘 ) From 𝑎 1 𝛼 ≤ 2 , we get 4𝐶 𝛽𝛾 𝑘 ≤ 𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ). 𝐶 𝛽𝛾 𝑘 + 2 If we pick 𝛾 𝑘 = 𝜃 4 𝜂 𝑘 , then it’s sufficient to let 2𝐶 𝛽𝜃 4 𝜂 𝑘 ≤ 𝜇𝜂 𝑘 (2 − 𝜇𝜂 𝑘 ). 𝜇 2(𝜇−𝐶 𝛽𝜃 4 ) 𝛾𝑘 Hence if 𝜃 4 < 𝐶𝛽 and let 𝜂∗ = 𝜇2 , then 𝜂 𝑘 = 𝜃4 ∈ (0, 𝜂∗ ) guarantees the above discussion and EL 𝑘+1 ≤ (1 − 𝜃 3 𝜃 4 𝜂 𝑘 )EL 𝑘 + 𝑛𝜎 2 𝜂2𝑘 . So far all restrictions for 𝜂 𝑘 are   2 𝜂 𝑘 ≤ min , 𝜂∗ 𝜇+𝐿 and ( √︁ ) 1 (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶 2 𝜂𝑘 ≤ min , 𝜃4 𝐶𝛽 𝛽  √  n o 2 (3𝐶+1)− (3𝐶+1) 2 −4𝐶 2 1 2 Let 𝜃 5 = min 𝜇+𝐿 , 𝜂∗ , 𝐶 𝛽𝜃 4 , 𝛽𝜃 4 , 𝜂 𝑘 = 𝐵𝑘+𝐴 and 𝐷 = max 𝐴L 0 , 2𝑛𝜎 𝜃3 𝜃4 , we 𝜃3 𝜃4 2 claim that if we pick 𝐵 = 2 and some 𝐴, by setting 𝜂 𝑘 = 𝜃 3 𝜃 4 𝑘+2𝐴 , we get 𝐷 EL 𝑘 ≤ . 𝐵𝑘 + 𝐴 Induction: When 𝑘 = 0, it’s obvious. Suppose previous 𝑘 inequalities hold. Then 4𝑛𝜎 2   𝑘+1 2𝜃 3 𝜃 4 2𝐷 EL ≤ 1− + . 𝜃 3 𝜃 4 𝑘 + 2𝐴 𝜃 3 𝜃 4 𝑘 + 2𝐴 (𝜃 3 𝜃 4 𝑘 + 2𝐴) 2 119 Multiply 𝑀 B (𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴)(𝜃 3 𝜃 4 𝑘 + 2𝐴)(2𝐷) −1 on both sides, we get 4𝑛𝜎 2 (𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴)   𝑘+1 2𝜃 3 𝜃 4 𝑀EL ≤ 1− (𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴) + 𝜃 3 𝜃 4 𝑘 + 2𝐴 2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴) 2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴 − 2𝜃 3 𝜃 4 )(𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴) + 4𝑛𝜎 2 (𝜃 3 𝜃 4 𝑘 + 𝜃 3 𝜃 4 + 2𝐴) = 2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴) 2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴) 2 + 4𝑛𝜎 2 (𝜃 3 𝜃 4 𝑘 + 2𝐴) − 4𝐷𝜃 3 𝜃 4 (𝜃 3 𝜃 4 𝑘 + 2𝐴) + 2𝐷𝜃 3 𝜃 4 (𝜃 3 𝜃 4 𝑘 + 2𝐴) = 2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴) 2 2 −4𝐷 (𝜃 3 𝜃 4 ) + 4𝑛𝜎 𝜃 3 𝜃 4 + 2𝐷 (𝜃 3 𝜃 4 𝑘 + 2𝐴) ≤𝜃 3 𝜃 4 𝑘 + 2𝐴. Hence 2𝐷 EL 𝑘+1 ≤ 𝜃 3 𝜃 4 (𝑘 + 1) + 2𝐴 This induction holds for any 𝐴 such that 𝜂 𝑘 is feasible, i.e. 1 𝜂0 = ≤ 𝜃5. 𝐴 Here we summarize the definition of constant numbers: 1 𝐶𝛽 𝜃1 = † , 𝜃2 = , (B.33) 2𝜆max ((I − W) ) 2(1 + 𝐶)   𝜇 2(𝜇 − 𝐶 𝛽𝜃 4 ) 𝜃 3 = min{𝜃 1 , 𝜃 2 }, 𝜃 4 ∈ 0, , 𝜂∗ = , (B.34) 𝐶𝛽 𝜇2 ( √︁ ) 2 (3𝐶 + 1) − (3𝐶 + 1) 2 − 4𝐶 2 𝜃 5 = min , 𝜂∗ , , . (B.35) 𝜇+𝐿 𝐶 𝛽𝜃 4 𝛽𝜃 4 1 2𝜃 5 Therefore, let 𝐴 = 𝜃5 and 𝜂 𝑘 = 𝜃 3 𝜃 4 𝜃 5 𝑘+2 , we get n o 1 0 2𝜎 2 𝜃 5 1 2 max 𝑛 L , 𝜃3 𝜃4 EL 𝑘 ≤ . 𝑛 𝜃3𝜃4𝜃5 𝑘 + 2 Since 1 − 𝑎 1 𝛼 𝑘 ≥ 1/2, we complete the proof. 120 APPENDIX C GRAPH NEURAL NETWORKS WITH ADAPTIVE RESIDUAL C.1 Additional Results for the Preliminary Study In this section, we provide additional results on CiteSeer and PubMed datasets for the preliminary study in Section 4.2. The results on these two datasets are showed in Figure C.1, C.2, C.3 and C.4. It can be observed that residual connection helps obtain better performance on normal features but it is detrimental to abnormal features, which aligns with the findings in Section 4.2. 0.35 0.25 0.40 APPNP w/Res GCNII w/Res GCN w/Res 0.30 APPNP wo/Res GCNII wo/Res 0.35 GCN wo/Res 0.20 0.30 0.25 Accuracy Accuracy Accuracy 0.25 0.20 0.15 0.20 0.15 0.15 0.100 2 4 6 8 10 12 14 16 0.100 2 4 6 8 10 12 14 16 0.100 2 4 6 8 10 12 14 16 Number of layers Number of layers Number of layers (a) APPNP (b) GCNII (c) GCN Figure C.1: Node classification accuracy on abnormal nodes (CiteSeer). 0.65 0.7 0.7 0.6 0.6 0.5 0.5 Accuracy Accuracy Accuracy 0.60 0.4 0.4 0.3 0.3 APPNP w/Res 0.2 GCNII w/Res 0.2 GCN w/Res APPNP wo/Res GCNII wo/Res GCN wo/Res 0.550 2 4 6 8 10 12 14 16 0.10 2 4 6 8 10 12 14 16 0.10 2 4 6 8 10 12 14 16 Number of layers Number of layers Number of layers (a) APPNP (b) GCNII (c) GCN Figure C.2: Node classification accuracy on normal nodes (CiteSeer). 121 0.80 0.80 0.60 0.75 APPNP w/Res 0.75 GCNII w/Res GCN w/Res 0.70 APPNP wo/Res 0.70 GCNII wo/Res 0.55 GCN wo/Res 0.65 0.65 0.50 0.60 0.60 Accuracy Accuracy Accuracy 0.55 0.55 0.45 0.50 0.50 0.40 0.45 0.45 0.40 0.40 0.35 0.35 0.35 0.30 0.30 0.30 0.250 2 4 6 8 10 12 14 16 0.250 2 4 6 8 10 12 14 16 0.250 2 4 6 8 10 12 14 16 Number of layers Number of layers Number of layers (a) APPNP (b) GCNII (c) GCN Figure C.3: Node classification accuracy on abnormal nodes (PubMed). 0.80 0.8 0.80 0.7 0.75 0.6 Accuracy Accuracy Accuracy 0.70 0.5 0.75 0.65 0.4 APPNP w/Res GCNII w/Res GCN w/Res APPNP wo/Res GCNII wo/Res GCN wo/Res 0 2 4 6 8 10 12 14 16 0.600 2 4 6 8 10 12 14 16 0.30 2 4 6 8 10 12 14 16 Number of layers Number of layers Number of layers (a) APPNP (b) GCNII (c) GCN Figure C.4: Node classification accuracy on normal nodes (PubMed). C.2 Additional Experiments for the Proposed Method In this section, we provide more experiments and ablation study for the proposed AirGNN. C.2.1 Experiments on More Datasets In this subsection, we provide additional experiments for Section 4.4. In particular, we conduct the experiments for the noisy feature scenario on the following 5 datasets: Coauthor CS [95], Coauthor Physics [95], Amazon Computers [95], Amazon Photo [95], and ogbn-arxiv [113]. The node classification accuracy are showed in Figures C.5, C.6, C.7, C.8, and C.9, respectively. Specifically, the accuracy on abnormal nodes and normal nodes are plotted separately in (a) and (b), with respect to the ratio of noisy nodes. When the ratios of noisy nodes are within a reasonable range, we can observe that (1) AirGNN obtains much better accuracy on abnormal nodes on all datasets, which verifies its stronger resilience 122 to abnormal features; and (2) AirGNN achieves better or sometimes comparable accuracy on normal nodes in most cases, which shows its capability to maintain good performance for normal nodes. However, when the noise ratio is very high, the performance of AirGNN drops quickly. This is because the modulation hyperparameter 𝜆 is tuned based on the clean dataset such that it is far away from being optimal for highly noisy dataset. But it can be significantly improved by adjusting the hyperparameter 𝜆 as discussed in next subsection. These results suggest the significant advantages of adaptive residual in AirGNN, and confirm the conclusion in the main paper. The adversarial attack on larger graphs is computationally expensive so we omit the results on more datasets in the adversarial feature scenario. 0.5 0.9 GAT GAT GCN 0.8 GCN 0.4 GCNII 0.7 GCNII APPNP APPNP AirGNN 0.6 AirGNN 0.3 Accuracy Accuracy 0.5 0.4 0.2 0.3 0.1 0.2 0.1 0.01 2 3 4 5 8 10 15 20 25 30 0.01 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Accuracy on abnormal nodes (b) Accuracy on normal nodes Figure C.5: Node classification accuracy in noisy features scenario (Coauthor CS). 1.0 0.9 0.9 0.8 0.8 0.7 GAT 0.7 0.6 Accuracy Accuracy GCN 0.6 0.5 GCNII APPNP 0.5 0.4 AirGNN 0.4 GAT 0.3 GCN 0.3 GCNII 0.2 0.2 APPNP AirGNN 0.11 2 3 4 5 8 10 15 20 25 30 0.11 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Accuracy on abnormal nodes (b) Accuracy on normal nodes Figure C.6: Node classification accuracy in noisy features scenario (Coauthor Physics). 123 0.6 0.8 GAT GAT GCN 0.7 GCN 0.5 GCNII GCNII APPNP 0.6 APPNP 0.4 AirGNN AirGNN 0.5 Accuracy Accuracy 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.01 2 3 4 5 8 10 15 20 25 30 0.01 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Accuracy on abnormal nodes (b) Accuracy on normal nodes Figure C.7: Node classification accuracy in noisy features scenario (Amazon Computers). 0.7 0.9 GAT GAT 0.6 GCN 0.8 GCN GCNII GCNII 0.5 APPNP 0.7 APPNP AirGNN AirGNN 0.6 Accuracy Accuracy 0.4 0.5 0.3 0.4 0.2 0.3 0.1 0.2 0.01 2 3 4 5 8 10 15 20 25 30 0.11 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Accuracy on abnormal nodes (b) Accuracy on normal nodes Figure C.8: Node classification accuracy in noisy features scenario (Amazon Photo). 0.7 0.8 GAT GAT 0.6 GCN 0.7 GCN GCNII GCNII 0.5 APPNP 0.6 APPNP AirGNN AirGNN Accuracy Accuracy 0.4 0.5 0.3 0.4 0.2 0.3 0.1 0.2 0.01 2 3 4 5 8 10 15 20 25 30 0.11 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Accuracy on abnormal nodes (b) Accuracy on normal nodes Figure C.9: Node classification accuracy in noisy features scenario (ogbn-arxiv). C.2.2 AirGNN with Adjusted 𝜆 Note that in Figures C.5, C.6, C.7, C.8, and C.9, the performance of AirGNN drops significantly when the noise ratio is very large. This is because the modulation hyperparameter 𝜆 is tuned based 124 on the clean dataset such that it is far away from being optimal for highly noisy dataset. In fact, the performance of AirGNN can be significantly improved by adjusting 𝜆 during test time according to the performance on the validation set. Taking the Coauthor CS [95] dataset as an example, we compare AirGNN with APPNP and we tune the hyperparameter 𝜆 and 𝛼 for them (denoted as AirGNN-tuned and APPNP-tuned) for a fair comparison as showed in Figure C.10. The result verifies that AirGNN-tuned gets tremendous improvement on both abnormal and normal nodes by adjusting 𝜆. However, APPNP-tuned only focuses on improving global performance and overlooks the abnormal nodes after adjusting 𝛼 based on validation performance so that the performance on abnormal node are much worse. 1.0 1.0 APPNP 0.9 APPNP-tuned 0.8 AirGNN 0.9 AirGNN-tuned 0.7 0.6 0.8 Accuracy Accuracy 0.5 0.7 0.4 0.3 0.6 APPNP 0.2 APPNP-tuned 0.1 AirGNN 0.5 AirGNN-tuned 0.01 2 3 4 5 8 10 15 20 25 30 1 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Accuracy on abnormal nodes (b) Accuracy on normal nodes Figure C.10: Node classification accuracy in noisy features scenario with adjustment (Coauthor CS). C.2.3 Detailed Comparison with APPNP Figure 4.1 in Section 4.2 shows that APPNP without residual performs well on the noisy nodes. Therefore, in order to demonstrate the advantages of AirGNN, it is of interest to make a detailed comparison between AirGNN and the two variants of APPNP (w/Res and wo/Res). We evaluate their performance on noisy nodes, normal nodes, and overall nodes on Cora dataset, and the results under varying noise ratio are summarized in Table C.1, Table C.2, and Table C.3. We can make the following observations: • In Table C.1, both AirGNN and APPNP wo/Res significantly outperform APPNP w/Res on noisy nodes, and AirGNN achieves comparable performance with APPNP wo/Res. This verifies 125 that the residual connection in GNN amplifies the vulnarability to abnormal features, and AirGNN is able to adaptively adjust the residual connections for abnormal nodes to reduce the vulnerability. • In Table C.2 and Table C.3, AirGNN consistently outperforms APPNP wo/Res, which verifies the importance of residual connections in maintaining good performance on normal nodes. AirGNN exhibits much better performance than APPNP w/Res, which shows the benefits of removing abnormal features by adaptive residual. • APPNP wo/Res is a special case of AirGNN with 𝜆 = 0. Moreover, as noted in Section C.2.2, the performance of AirGNN in Table C.1, Table C.2, and Table C.3 can be further improved by adjusting the modulation hyperparameter 𝜆 for each noise ratio according to validation performance. As discussed in Section 4.3, in existing GNNs such as APPNP and GCNII, the conflict between feature aggregation and residual connection can only be partially mitigated by adjusting the residual weight 𝛼. However, such global adjustment cannot be adaptive to a subset of the nodes, which explains the advantages of AirGNN in above observations. In the adversarial feature setting, we can make similar observations but here we omit the comparison. Table C.1: Comparison between APPNP and AirGNN on abnormal (noisy) nodes (Cora). Noisy ratio 5% 10% 15% 20% 25% 30% APPNP w/Res 0.167 ± 0.034 0.170 ± 0.070 0.170 ± 0.027 0.193 ± 0.031 0.187 ± 0.024 0.178 ± 0.026 APPNP wo/Res 0.469 ± 0.035 0.442 ± 0.062 0.427 ± 0.038 0.381 ± 0.043 0.383 ± 0.045 0.354 ± 0.067 AirGNN 0.474 ± 0.048 0.433 ± 0.055 0.405 ± 0.050 0.362 ± 0.039 0.353 ± 0.050 0.337 ± 0.057 Table C.2: Comparison between APPNP and AirGNN on normal nodes (Cora). Noisy ratio 5% 10% 15% 20% 25% 30% APPNP w/Res 0.773 ± 0.015 0.712 ± 0.024 0.669 ± 0.019 0.622 ± 0.024 0.580 ± 0.032 0.530 ± 0.029 APPNP wo/Res 0.761 ± 0.014 0.709 ± 0.025 0.664 ± 0.015 0.599 ± 0.025 0.556 ± 0.035 0.497 ± 0.049 AirGNN 0.791 ± 0.015 0.741 ± 0.021 0.688 ± 0.024 0.625 ± 0.034 0.571 ± 0.039 0.527 ± 0.042 126 Table C.3: Comparison between APPNP and AirGNN on all nodes (Cora). Noisy ratio 5% 10% 15% 20% 25% 30% APPNP w/Res 0.743 ± 0.015 0.657 ± 0.026 0.594 ± 0.017 0.536 ± 0.024 0.482 ± 0.025 0.425 ± 0.025 APPNP wo/Res 0.746 ± 0.013 0.682 ± 0.026 0.628 ± 0.015 0.556 ± 0.027 0.513 ± 0.034 0.455 ± 0.053 AirGNN 0.775 ± 0.015 0.710 ± 0.021 0.646 ± 0.025 0.572 ± 0.033 0.516 ± 0.038 0.470 ± 0.044 C.2.4 Comparison with Robust Model To further demonstrate the advantages of the proposed AirGNN, we compare it with a representative robust model, Robust GCN [128]. Tables C.11, C.12 and C.13 show the performance comparison between Robust GCN and AirGNN on Cora, Citeseer and PubMed, respectively. The accuracy on abnormal nodes and normal nodes are plotted separately in (a) and (b), with respect to the ratio of noisy nodes. These figures show that AirGNN achieves significant better performance than Robust GCN on both abnormal and normal nodes in the noisy feature scenario. 0.7 0.9 Robust GCN AirGNN 0.8 0.6 0.7 0.5 Accuracy Accuracy 0.6 0.4 0.5 0.3 0.4 0.2 0.3 Robust GCN AirGNN 0.11 2 3 4 5 8 10 15 20 25 30 0.21 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Accuracy on abnormal nodes (b) Accuracy on normal nodes Figure C.11: Node classification accuracy in noisy features scenario (Cora). 0.4 0.7 0.3 0.6 Accuracy Accuracy 0.5 0.2 0.4 0.1 0.3 Robust GCN Robust GCN AirGNN AirGNN 0.01 2 3 4 5 8 10 15 20 25 30 0.21 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Accuracy on abnormal nodes (b) Accuracy on normal nodes Figure C.12: Node classification accuracy in noisy features scenario (CiteSeer). 127 0.8 0.7 0.8 0.6 0.5 Accuracy 0.7 Accuracy 0.4 0.3 0.6 0.2 0.1 Robust GCN Robust GCN AirGNN AirGNN 0.01 2 3 4 5 8 10 15 20 25 30 0.51 2 3 4 5 8 10 15 20 25 30 Ratio of Noisy Nodes (%) Ratio of Noisy Nodes (%) (a) Accuracy on abnormal nodes (b) Accuracy on normal nodes Figure C.13: Node classification accuracy in noisy features scenario (PubMed). 128 BIBLIOGRAPHY 129 BIBLIOGRAPHY [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI’16, pages 265–283, 2016. [2] Lada A Adamic and Natalie Glance. The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pages 36–43, 2005. [3] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 440–445, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. [4] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017. [5] Angelica I Aviles-Rivero, Nicolas Papadakis, Ruoteng Li, Samar M Alsaleh, Robby T Tan, and Carola-Bibiane Schonlieb. When labelled data hurts: Deep semi-supervised classification with the graph 1-laplacian. arXiv preprint arXiv:1906.08635, 2019. [6] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 1st edition, 2011. [7] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. SIGNSGD: compressed optimisation for non-convex problems. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 559–568, 2018. [8] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of COMPSTAT’2010, pages 177–186, Heidelberg, 2010. Physica-Verlag HD. [9] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [10] Xavier Bresson, Thomas Laurent, David Uminsky, and James H. von Brecht. An adaptive total variation algorithm for computing the balanced cut of a graph, 2013. [11] Xavier Bresson, Thomas Laurent, David Uminsky, and James H Von Brecht. Multiclass total variation clustering. arXiv preprint arXiv:1306.1185, 2013. 130 [12] Thomas Bühler and Matthias Hein. Spectral clustering based on the graph p-laplacian. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 81–88, 2009. [13] Ruggero Carli, Fabio Fagnani, Paolo Frasca, and Sandro Zampieri. Gossip consensus algorithms via quantized communication. Automatica, 46(1):70–80, 2010. [14] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1725–1735. PMLR, 13–18 Jul 2020. [15] Peijun Chen, Jianguo Huang, and Xiaoqun Zhang. A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Problems, 29(2):025011, 2013. [16] Siheng Chen, Yonina C. Eldar, and Lingxiao Zhao. Graph unrolling networks: Interpretable neural networks for graph signal denoising, 2020. [17] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015. [18] Yu Chen, Lingfei Wu, and Mohammed Zaki. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. Advances in Neural Information Processing Systems, 33, 2020. [19] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997. [20] Laurent Condat. A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms. Journal of optimization theory and applications, 158(2):460–479, 2013. [21] Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020. [22] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1223–1231, USA, 2012. [23] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 3844–3852, 2016. [24] Tyler Derr, Yao Ma, Wenqi Fan, Xiaorui Liu, Charu Aggarwal, and Jiliang Tang. Epidemic graph convolutional network. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 160–168, 2020. 131 [25] Michael Elad. Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media, 2010. [26] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, March 1975. [27] Negin Entezari, Saba A Al-Sayouri, Amirali Darvishzadeh, and Evangelos E Papalexakis. All you need is low (rank) defending against adversarial attacks on graphs. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 169–177, 2020. [28] Wenqi Fan, Xiaorui Liu, Wei Jin, Xiangyu Zhao, Jiliang Tang, and Qing Li. Graph trend filtering networks for recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112–121, 2022. [29] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272. PMLR, 2017. [30] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [31] Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. SGD: General analysis and improved rates. arXiv preprint arXiv:1901.09401, 2019. [32] William L. Hamilton. Graph representation learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 14(3):1–159, 2020. [33] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216, 2017. [34] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman Hall/CRC, 2015. [35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [36] Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik. Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115, 2019. [37] Wei Jin, Yaxing Li, Han Xu, Yiqi Wang, Shuiwang Ji, Charu Aggarwal, and Jiliang Tang. Adversarial attacks and defenses on graphs. ACM SIGKDD Explorations Newsletter, 22(2):19–34, 2021. [38] Wei Jin, Xiaorui Liu, Yao Ma, Charu Aggarwal, and Jiliang Tang. Towards feature overcorrelation in deeper graph neural networks. In Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2022. 132 [39] Wei Jin, Xiaorui Liu, Yao Ma, Tyler Derr, Charu Aggarwal, and Jiliang Tang. Graph feature gating networks. In Proceedings of the 30th ACM International Conference on Information amp; Knowledge Management, CIKM ’21, page 813–822, New York, NY, USA, 2021. Association for Computing Machinery. [40] Wei Jin, Xiaorui Liu, Xiangyu Zhao, Yao Ma, Neil Shah, and Jiliang Tang. Automated self-supervised learning for graphs. In International Conference on Learning Representations, 2022. [41] Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 66–74, 2020. [42] Michael I Jordan, Jason D Lee, and Yun Yang. Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668–681, 2019. [43] A. Jung, A. O. Hero, III, A. C. Mara, S. Jahromi, A. Heimowitz, and Y. C. Eldar. Semi- supervised learning in network-structured data via total variation minimization. IEEE Transactions on Signal Processing, 67(24):6256–6269, 2019. [44] Alexander Jung, Alfred O Hero III, Alexandru Mara, and Saeed Jahromi. Semi-supervised learning via sparse label propagation. arXiv preprint arXiv:1612.01414, 2016. [45] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Urban Stich, and Martin Jaggi. Error feedback fixes SignSGD and other gradient compression schemes. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3252–3261. PMLR, 2019. [46] Seung-Jean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. ℓ1 trend filtering. SIAM review, 51(2):339–360, 2009. [47] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [48] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [49] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018. [50] Anastasia Koloskova, Tao Lin, Sebastian U Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, 2020. [51] Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In Proceedings of the 36th International Conference on Machine Learning, pages 3479–3487. PMLR, 2019. [52] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998. 133 [53] Guohao Li, Matthias Müller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In The IEEE International Conference on Computer Vision (ICCV), 2019. [54] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 583–598, Berkeley, CA, USA, 2014. USENIX Association. [55] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [56] Yao Li, Xiaorui Liu, Jiliang Tang, Ming Yan, and Kun Yuan. Decentralized composite optimization with compression. arXiv preprint arXiv:2108.04448, 2021. [57] Yao Li and Ming Yan. On linear convergence of two decentralized algorithms. arXiv preprint arXiv:1906.07225, 2019. [58] Yaxin Li, Wei Jin, Han Xu, and Jiliang Tang. Deeprobust: A pytorch library for adversarial attacks and defenses, 2020. [59] Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019. [60] Zhi Li and Ming Yan. New convergence analysis of a primal dual algorithm with large stepsizes. Advances in Computational Mathematics, 2021. [61] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2737–2745. Curran Associates, Inc., 2015. [62] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017. [63] Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro. DLM: Decentralized linearized alternating direction method of multipliers. IEEE Transactions on Signal Processing, 63(15):4051–4064, 2015. [64] Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili Jain, Yunhao Liu, Anil K. Jain, and Jiliang Tang. Trustworthy ai: A computational perspective. ACM Trans. Intell. Syst. Technol., jun 2022. Just Accepted. [65] Meng Liu, Hongyang Gao, and Shuiwang Ji. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2020. 134 [66] Xiaorui Liu, Jiayuan Ding, Wei Jin, Han Xu, Yao Ma, Zitao Liu, and Jiliang Tang. Graph neural networks with adaptive residual. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. [67] Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 6837–6849. PMLR, 18–24 Jul 2021. [68] Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression algorithm for efficient distributed learning. The 23rd International Conference on Artificial Intelligence and Statistics, 2020. [69] Xiaorui Liu, Yao Li, Rongrong Wang, Jiliang Tang, and Ming Yan. Linear convergent decentralized optimization with compression. In International Conference on Learning Representations, 2021. [70] Ignace Loris and Caroline Verhoeven. On a generalization of the iterative soft-thresholding algorithm for the case of non-separable penalty. Inverse Problems, 27(12):125007, 2011. [71] Yucheng Lu and Christopher De Sa. Moniqua: Modulo quantized communication in decentralized SGD. In Proceedings of the 37th International Conference on Machine Learning, 2020. [72] Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. Is homophily a necessity for graph neural networks? In International Conference on Learning Representations, 2022. [73] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. Proceedings of the 30th ACM International Conference on Information and Knowledge Management, 2021. [74] Yao Ma and Jiliang Tang. Deep Learning on Graphs. Cambridge University Press, 2020. [75] Sindri Magnússon, Hossein Shokri-Ghadikolaei, and Na Li. On maintaining linear con- vergence of distributed learning and optimization under limited communication. IEEE Transactions on Signal Processing, 68:6101–6116, 2020. [76] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415–444, 2001. [77] Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019. [78] Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus Püschel. D-ADMM: A communication-efficient distributed algorithm for separable optimization. IEEE Transactions on Signal Processing, 61(10):2718–2723, 2013. [79] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017. 135 [80] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. [81] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. [82] Lam Nguyen, PHUONG HA NGUYEN, Marten van Dijk, Peter Richtarik, Katya Scheinberg, and Martin Takac. SGD and hogwild! Convergence without the bounded gradients assumption. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3750–3758, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [83] Feiping Nie, Hua Wang, Heng Huang, and Chris Ding. Unsupervised and semi-supervised learning via ℓ1 -norm graph. In 2011 International Conference on Computer Vision, pages 2268–2273. IEEE, 2011. [84] Hoang Nt and Takanori Maehara. Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550, 2019. [85] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, 2020. [86] Xuran Pan, Song Shiji, and Huang Gao. A unified framework for convolution-based graph neural networks. https://openreview.net/forum?id=zUMD–Fb9Bt, 2020. [87] Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods. Mathematical Programming, pages 1–49, 2020. [88] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. An exact quantized decentralized gradient descent algorithm. IEEE Transactions on Signal Processing, 67(19):4934–4947, 2019. [89] Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. Robust and communication-efficient collaborative learning. In Advances in Neural Information Processing Systems, pages 8388–8399, 2019. [90] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992. [91] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar- dini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008. [92] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Interspeech 2014, September 2014. [93] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008. 136 [94] James Sharpnack, Aarti Singh, and Alessandro Rinaldo. Sparsistency of the edge lasso over graphs. In Artificial Intelligence and Statistics, pages 1028–1036. PMLR, 2012. [95] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018. [96] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. [97] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 4452–4463, USA, 2018. Curran Associates Inc. [98] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pages 1488–1492, 2015. [99] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. [100] Arthur Szlam and Xavier Bresson. Total variation, cheeger cuts. In ICML, 2010. [101] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In Advances in Neural Information Processing Systems, pages 7652–7662. 2018. [102] Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, Tong Zhang, and Ji Liu. Deep- squeeze: Decentralization meets error-compensated compression. CoRR, abs/1907.07346, 2019. [103] Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. 𝐷 2 : Decentralized training over decentralized data. In Proceedings of the 35th International Conference on Machine Learning, pages 4848–4856, 2018. [104] Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 6155–6165, 2019. [105] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005. [106] Ryan J Tibshirani et al. Adaptive piecewise polynomial estimation via trend filtering. Annals of statistics, 42(1):285–323, 2014. [107] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous determinis- tic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9):803–812, 1986. 137 [108] Rohan Varma, Harlin Lee, Jelena Kovačević, and Yuejie Chi. Vector-valued graph trend filtering with non-convex penalties. IEEE Transactions on Signal and Information Processing over Networks, 6:48–62, 2019. [109] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. [110] Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3636–3645, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. [111] Yu-Xiang Wang, James Sharpnack, Alexander J Smola, and Ryan J Tibshirani. Trend filtering on graphs. Journal of Machine Learning Research, 17:1–41, 2016. [112] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Proceedings of the 32Nd Interna- tional Conference on Neural Information Processing Systems, NIPS’18, pages 1306–1316, USA, 2018. Curran Associates Inc. [113] Marinka Zitnik Yuxiao Dong Hongyu Ren Bowen Liu Michele Catasta Jure Leskovec Weihua Hu, Matthias Fey. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020. [114] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pages 1509–1519, 2017. [115] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International conference on machine learning, pages 6861–6871. PMLR, 2019. [116] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5325–5333, 10–15 Jul 2018. [117] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 2020. [118] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004. [119] Han Xu, Xiaorui Liu, Yaxin Li, Anil Jain, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training. In Marina Meila and Tong Zhang, editors, Proceedings of the 138 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11492–11501. PMLR, 18–24 Jul 2021. [120] Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Accelerated primal-dual algorithms for distributed smooth convex optimization over networks. In International Conference on Artificial Intelligence and Statistics, pages 2381–2391. PMLR, 2020. [121] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5453–5462. PMLR, 10–15 Jul 2018. [122] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. ImageNet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, pages 1:1–1:10, New York, NY, USA, 2018. ACM. [123] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016. [124] Kun Yuan, Wei Xu, and Qing Ling. Can primal methods outperform primal-dual methods in decentralized dynamic optimization? arXiv preprint arXiv:2003.00816, 2020. [125] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for distributed optimization and learning—part i: Algorithm development. IEEE Transactions on Signal Processing, 67(3):708–723, 2018. [126] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. In International Conference on Learning Representations, 2019. [127] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. 2004. [128] Dingyuan Zhu, Ziwei Zhang, Peng Cui, and Wenwu Zhu. Robust graph convolutional networks against adversarial attacks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1399–1407, 2019. [129] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework, 2021. [130] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. [131] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005. [132] Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. Deep graph structure learning for robust representations: A survey. arXiv preprint arXiv:2103.03036, 2021. 139 [133] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. Parallelized stochastic gradient descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603. Curran Associates, Inc., 2010. [134] Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. Adversarial attacks on neural networks for graph data. In KDD. ACM, 2018. [135] Daniel Zügner and Stephan Günnemann. Adversarial attacks on graph neural networks via meta learning. arXiv preprint arXiv:1902.08412, 2019. 140