EFFICIENT DISTRIBUTED ALGORITHMS: BETTER THEORY AND COMMUNICATION COMPRESSION By Yao Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Applied Mathematics – Doctor of Philosophy Computational Mathematics, Science and Engineering – Dual Major 2022 ABSTRACT EFFICIENT DISTRIBUTED ALGORITHMS: BETTER THEORY AND COMMUNICATION COMPRESSION By Yao Li Large-scale machine learning models are often trained by distributed algorithms over ei- ther centralized or decentralized networks. The former uses a central server to aggregate the information of local computing agents and broadcast the averaged parameters in a master-slave architecture. The latter considers a connected network formed by all agents. The information can only be exchanged with accessible neighbors with a mixing matrix of communication weights encoding the network’s topology. Compared with centralized optimization, decentralization facilitates data privacy and reduces the communication burden of the single central agent due to model synchroniza- tion, but the connectivity of the communication network weakens the theoretical conver- gence complexity of the decentralized algorithms. Therefore, there are still gaps between decentralized and centralized algorithms in terms of convergence conditions and rates. In the first part of this dissertation, we consider two decentralized algorithms: EXTRA and NIDS, which both converge linearly with strongly convex objective functions and an- swer two questions regarding them. What are the optimal upper bounds for their stepsizes? Do decentralized algorithms require more properties on the functions for linear convergence than centralized ones? More specifically, we relax the required conditions for linear convergence of both algorithms. For EXTRA, we show that the stepsize is comparable to that of cen- tralized algorithms. For NIDS, the upper bound of the stepsize is shown to be exactly the same as the centralized ones. In addition, we relax the requirement for the objective functions and the mixing matrices. We provide the linear convergence results for both algorithms under the weakest conditions. As the number of computing agents and the dimension of the model increase, the communication cost of parameter synchronization becomes the major obstacle to effi- cient learning. Communication compression techniques have exhibited great potential as an antidote to accelerate distributed machine learning by mitigating the communication bottleneck. In the rest of the dissertation, we propose compressed residual communication frame- works for both centralized and decentralized optimization and design different algo- rithms to achieve efficient communication. For centralized optimization, we propose DORE, a modified parallel stochastic gradi- ent descent method with a bidirectional residual compression, to reduce over 95% of the overall communication. Our theoretical analysis demonstrates that the proposed strategy has superior convergence properties for both strongly convex and nonconvex objective functions. Existing works mainly focus on smooth problems and compressing DGD-type algo- rithms for decentralized optimization. The class of smooth objective functions and the sublinear convergence rate under relatively strong assumptions limit these algorithms’ application and practical performance. Motivated by primal-dual algorithms, we pro- pose Prox-LEAD, a linear convergent decentralized algorithm with compression, to tackle strongly convex problems with a nonsmooth regularizer. Our theory describes the cou- pled dynamics of the inexact primal and dual update as well as compression error with- out assuming bounded gradients. The superiority of the proposed algorithm is demon- strated through the comparison with state-of-the-art algorithms in terms of convergence complexities and numerical experiments. Our algorithmic framework also generally en- lightens the compressed communication on other primal-dual algorithms by reducing the impact of inexact iterations. Copyright by YAO LI 2022 Dedicated to my parents and in memory of my grandmother, a woman of strength, kindness and love, who passed away before seeing me graduate. v ACKNOWLEDGEMENTS There are many people who encouraged me and shared great times with me throughout my doctoral career. Please accept my apologies if there are any names that have been forgotten. First of all, I would like to express my deep gratitude to my advisor and dissertation director, Professor Ming Yan. With his dedicated guidance and support over the past five years, I have developed a good taste and a broad vision of research. Again, thanks to my friend-like relationship with him, I could maintain a work-life balance and alleviate the pressure from life and studies. I would also like to thank all the other committee members, Professor Ekaterina Rapinchuk, Professor Kalyanmoy Deb, and Professor Mark Iwen, for their guidance and advice. They provided helpful feedback and suggestions on my path to graduation. I want to thank my most important research partner of mine, Dr. Xiaorui Liu. We collaborated with each other, discussed and shared insightful research ideas, and com- pleted many good papers. Without him, I would not have expanded my knowledge in deep learning and bridged the gap between theoretic algorithmic research and machine learning applications so quickly. I would also like to thank Dr. Brendt Wohlberg and Dr. Youzuo Lin for their short-term guidance in LANL. I also want to thank my friends Jing Huang, Jian Song, Mengzhi Chen, Menglun Wang, Ningyu Sha, Kai Huang, Jialin Qu, Runze Su, Rui Wang and Tao Feng for the precious time they spent with me. Last but not least, I would like to express my greatest gratitude to my parents, Youzhu Li and Yiyan Liu. Their selfless love has always been the greatest motivation for me to complete my doctoral career. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Decentralized Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Parallel SGD with Bidirectional Communication Compression . . . . . . . 2 1.3 Decentralized Optimization with Compression . . . . . . . . . . . . . . . . 3 1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Other Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 CHAPTER 2 EXTRA AND NIDS: DECENTRALIZED ALGORITHMS WITH IMPROVED CONDITIONS . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Algorithms and Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 New Linear Convergence Results for EXTRA and NIDS . . . . . . . . . . . 12 2.4.1 Linear Convergence of EXTRA . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 NIDS without Nonsmooth Term . . . . . . . . . . . . . . . . . . . . 20 2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 CHAPTER 3 DORE: A CENTRALIZED ALGORITHM WITH BIDIRECTIONAL COMMUNICATION COMPRESSION . . . . . . . . . . . . . . . . . 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Double Residual Compression SGD . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 The Proposed DORE . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 The Strongly Convex Case . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 The Nonconvex Case with R = 0 . . . . . . . . . . . . . . . . . . . . 38 3.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5.1 Strongly convex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5.2 Nonconvex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.3 Communication Efficiency . . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 vii CHAPTER 4 PROX-LEAD: A LINEAR CONVERGENT CONVERGENT AL- GORITHM WITH EFFICIENT COMMUNICATION FOR COM- POSITE PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 The Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 Prox-LEAD as Inexact PUDA . . . . . . . . . . . . . . . . . . . . . . 52 4.2.2 Stochastic Gradient Oracle . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Assumptions on Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.1 The General Stochastic Setting . . . . . . . . . . . . . . . . . . . . . 62 4.4.2 Finite-sum Setting with Variance Reduction . . . . . . . . . . . . . . 65 4.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 APPENDIX A SUPPLEMENTARY OF CHAPTER 3 . . . . . . . . . . . . . . 74 APPENDIX B SUPPLEMENTARY OF CHAPTER 4 . . . . . . . . . . . . . . 86 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 viii LIST OF TABLES Table 3.1: A comparison between related algorithms. The second column indi- cates whether algorithm compresses only the gradients or both the gradients and the model. DORE is able to converges linearly to the O(σ) neighborhood of optimal point like full-precision SGD and DI- ANA in the strongly convex case while achieving much better commu- nication efficiency. DORE also admits linear speedup in the nonconvex case like DoubleSqueeze but DORE doesn’t require the assumptions of bounded compression error or bounded gradient. . . . . . . . . . . . . 40 Table 4.1: Stochastic gradient oracle (SGO). . . . . . . . . . . . . . . . . . . . . . . 53 Table 4.2: Summary of the convergence compleixty for Prox-LEAD to achieve ϵ- accuracy with fiexed stepsizes. The first row is the complexity with the full gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Table B.1: The comparison of algorithms mentioned in Section B.2; κf := Lµ , κg := λmax (I−W) max wij λmin (I−W) and κeg := λmin(i,j)∈E (I−W) for nonnegative W. . . . . . . . . . . . . 88 ix LIST OF FIGURES k ∗ Figure 2.1: LEFT: the error ∥x −x ∥F ∥x0 −x∗ ∥F vs iterations for DGD with different step- sizes, EXTRA with three stepsizes, and NIDS. RIGHT: The random network with 10 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 k ∗ Figure 2.2: LEFT: the error ∥x −x ∥F ∥x0 −x∗ ∥F vs iterations for DGD with different step- sizes, EXTRA with three stepsizes, and NIDS. RIGHT: The random network with 10 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 2.3: The comparison of proved stepsizes for EXTRA and NIDS with the optimal choice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 k ∗ Figure 2.4: The figure of residuals ∥x −x ∥F ∥x0 −x∗ ∥F with respect to iteration. The left ¯ graph is for strongly convex f (x) and the right one is for strongly convex f (x). re-EXTRA and re-NIDS stand for implementing EXTRA and NIDS over relaxed Wnew . . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 3.1: An illustration of DORE. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure 3.2: Per iteration time cost on Resnet18 for SGD, QSGD, and DORE. It is tested in a shared cluster environment connected by Gigabit Ether- net interface. DORE speeds up the training process significantly by mitigating the communication bottleneck. . . . . . . . . . . . . . . . . . 41 Figure 3.3: Linear regression on synthetic data. When the learning rate is 0.05, DoubleSqueeze diverges. In both cases, DORE, SGD, and DIANA converge linearly to the optimal point, while QSGD, MEM-SGD, Dou- bleSqueeze, and DoubleSqueeze (topk) only converge to the neigh- borhood even when full gradient is available. . . . . . . . . . . . . . . . 42 Figure 3.4: LeNet trained on MNIST. DORE converges similarly as most base- lines. It outperforms DoubleSqueeze using the same compression method while has similar performance as DoubleSqueeze (topk). . . . 42 Figure 3.5: Resnet18 trained on CIFAR10. DORE achieves similar convergence and accuracy as most baselines. DoubeSuqeeze converges slower and suffers from the higher loss but it works well with topk compression. . 43 x Figure 4.1: Smooth logistic regression problem (λ1 = 0). In the full gradient case ((a) and (b)), LEAD (2bit) and LessBit (2bit) converge similarly as NIDS (32bit) and LEAD (32bit) in terms of epochs/iterations, but they requires much fewer bits in communication. In the stochastic case ((c) and (d)), the 2bit variants of LEAD match well with their 32bit variants in terms of the number of gradient evaluation, but they requires much fewer bits. Note that LEAD-SAGA requires more memory and more iterations/communication than LEAD-LSVRG, but it computes only one gradient in each iteration, while LEAD-LSVRG computes at least two gradients in each iteration. . . . . . . . . . . . . 69 Figure 4.2: Nonsmooth logistic regression problem (λ1 = 0.005). In the full gradi- ent case ((a) and (b)), Prox-LEAD (2bit) converges similarly as NIDS and Prox-LEAD (32bit) in terms of epochs/iterations, but it requires much fewer bits than the other three algorithms. In the stochastic case ((c) and (d)), the 2bit variants match well with their 32bit vari- ants in terms of the number of gradient evaluation, but they requires much fewer bits. Note that Prox-LEAD-SAGA requires more mem- ory and more iterations/communication than Prox-LEAD-LSVRG, but it computes only one gradient in each iteration, while Prox-LEAD- LSVRG computes at least two gradients in each iteration. . . . . . . . . 70 Figure A.1: The norm of compressed variable in linear regression. . . . . . . . . . . 74 Figure A.2: Training under different compression block sizes. . . . . . . . . . . . . 75 Figure A.3: Training under different α. . . . . . . . . . . . . . . . . . . . . . . . . . 75 Figure A.4: Training under different β. . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure A.5: Training under different η. . . . . . . . . . . . . . . . . . . . . . . . . . 76 xi LIST OF ALGORITHMS Algorithm 3.1: DORE1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Algorithm 4.1: Prox-LEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Algorithm 4.2: Compressed Communication Procedure (COMM) . . . . . . . . . . 52 Algorithm A.1: DORE with R(x) = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Algorithm B.1: LEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 xii CHAPTER 1 INTRODUCTION This chapter briefly overviews the problems considered and the results developed in the following separated chapters. The detailed introduction and claimed notations are con- tained throughout each chapter. 1.1 Decentralized Optimization Decentralized optimization problem is to minimize f¯(x) := 1 Pn n i=1 fi (x) collaboratively over a network of n agents. We consider convex and differentiable function fi : Rp → R, which is known only by the corresponding agent i. The whole system is decentralized because each agent has an estimation of the global variable and can only exchange the estimation with their accessible neighbors during each iteration. A symmetric mixing matrix W ∈ Rn×n is used to encode the communication weights between the agents and enforce the consensus. The minimum condition for W has one eigenvalue 1 with the all-one vector 1 being a corresponding eigenvector. All other eigenvalues of W are less than 1. We provide new and stronger linear convergence results for two state-of-the-art algorithms: EXTRA in [1] and NIDS in [2]. More specifically, by assuming L-smoothness of each fi , • We show the linear convergence of EXTRA under the strong convexity of f¯ and the relaxed condition λmin (W) > −5/3. The upper bound of the stepsize can be as large 5+3λmin (W) as 4L , which is shown to be optimal in [3] for general convex problems; • We show the linear convergence of NIDS under the same condition on f¯ and W as EXTRA with any network-independent stepsize α ∈ (0, 2/L). 1 1.2 Parallel SGD with Bidirectional Communication Compression We first consider centralized optimization with efficient communication over a parame- ter server architecture and propose algorithms based on the well-known parallel stochas- tic gradient descent (SGD) to achieve efficient communication. Stochastic gradient algo- rithms [4] are efficient at minimizing the objective function f : Rd → R which is usually defined as f (x) := Eξ∼D [ℓ(x, ξ)], where ℓ(x, ξ) is the objective function defined on data sample ξ and model parameter x. A basic stochastic gradient descent repeats the gradient “descent” step xk+1 = xk − γg(xk ) where xk is the current iteration and γ is the step size. The stochastic gradient g(xk ) is computed based on an i.i.d. sampled mini-batch from the training data distribution D and serves as the estimator of the full gradient ∇f (xk ). In large-scale machine learning, the number of data samples and the model size are usually huge. Distributed learning utilizes many computers/cores to perform the stochastic algo- rithms to reduce the training time. It has attracted extensive attention due to the demand for highly efficient model training [5, 6, 7, 8]. We focus on the data-parallel SGD [9, 10, 11], which provides a scalable solution to speed up the training process by distributing the whole data to multiple computing nodes, and consider the following problem n 1X minimize f (x) + R(x) = Eξ∼Di [ℓ(x, ξ)] +R(x), (1.1) x∈Rd n i=1 | {z } :=fi (x) where each fi (x) is a local objective function of the worker node i defined based on the allocated data under distribution Di and R : Rd → R is usually a closed convex regular- izer. We propose DORE, adapted by compressed communication on both pull and push directions of Parallel SGD, to reduce the bidirectional communication cost in distributed learning and provide a complete theoretical analysis of the algorithm’s behavior in both strongly convex and nonconvex settings. 2 1.3 Decentralized Optimization with Compression We then return to decentralized optimization and try to generalize the communication compression scheme of DORE over networks without a parameter server. We consider the general composite problems, i.e., the smooth problems with nonsmooth regularizer, over the connected n-node network G in the form of n 1 X  minimize Eξ ∼D fi (x, ξi ) +r(x) , (1.2) x∈Rp n i=1 | i i{z } =:fi (x) where fi (x, ξi ) is the objective function at node i defined on the data ξi sampled from the distribution Di , and x denotes the model parameters. We use fi (x) to define the overall local objective function at node i, and we will differentiate fi (x, ξi ) and fi (x) using differ- ent inputs. The data distributions {Di } can be heterogeneous. In other words, the data distributions can be different from node to node, and we do not make assumptions about data heterogeneity. The function fi (x, ξi ) is assumed to be convex and smooth, and r(x) is a proper, convex, and possibly nonsmooth function shared across the nodes. The graph G encodes the topology of the communication network where information is exchanged along the edges. In recent years, various communication compression techniques, such as quantization and sparsification, have been developed to reduce communication costs. Notably, ex- tensive studies [12, 13, 14, 15, 16, 17, 18, 19] have utilized gradient compression to boost communication efficiency for centralized optimization significantly. They enable efficient large-scale optimization while maintaining comparable convergence rates and practical performance with their non-compressed counterparts. This remarkable success has sug- gested the potential and significance of communication compression in decentralized al- gorithms. While extensive attention has been paid to centralized optimization, communication compression is relatively less studied in decentralized algorithms because the algorithm design and analysis are more challenging to cover general communication topologies. 3 Recent efforts are trying to push this research direction. For instance, DCD-SGD and ECD-SGD [20] introduce difference compression and extrapolation compression to re- duce model compression error. [21, 22] introduce QDGD and QuanTimed-DSGD to have exact convergence with small stepsizes. DeepSqueeze [23] directly compresses the lo- cal model and compensates for the compression error in the next iteration. CHOCO- SGD [24, 25] presents a novel quantized gossip algorithm that reduces compression error by difference compression and preserves the model average. Nevertheless, most current works focus on the compression of primal-only algorithms, i.e., reduce to DGD [26, 27] or P-DSGD [28]. They are unsatisfying regarding convergence rate, stability, and the capa- bility to handle heterogeneous data. Part of the reason is that they inherit the drawback of DGD-type algorithms, whose convergence rate is slow in heterogeneous data scenarios where the data distributions are significantly different from agent to agent. In the literature on decentralized optimization, it has been proved that primal-dual algorithms can achieve faster converge rates and better support heterogeneous data [29, 1, 2, 30]. However, it is unknown whether communication compression is feasible for primal-dual algorithms and how fast the convergence can be with compression. This chapter attempts to bridge this gap by investigating the communication compression for primal-dual decentralized algorithms. We design a decentralized residual compression scheme and propose a novel decen- tralized algorithm with compression, Prox-LEAD, to achieve linear convergence under the strongly convex assumption and efficient communication. We derive the convergence complexity of Prox-LEAD for two types of fi , the general expectation of loss fi (x, ξi ) as defined in (1.2) and the finite-sum setting. We combine Prox-LEAD with Loopless SVRG and SAGA for the second case to achieve the exact linear convergence with stochastic gradients. 4 1.4 Publications Chapters 2-4 are based on the following papers respectively: Published papers: • [31] Yao Li and Ming Yan. On the linear convergence of two decentralized algo- rithms. Journal of Optimization Theory and Applications, 189(1):271–290, 2021 • [19] Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression algorithm for efficient distributed learning. In International Conference on Artificial Intelligence and Statistics, pages 133–143. PMLR, 2020 In print: • [32] Yao Li, Xiaorui Liu, Jiliang Tang, Ming Yan, and Kun Yuan. Decentralized com- posite optimization with compression. arXiv preprint arXiv:2108.04448, 2021 1.5 Other Papers In addition to the above papers, I completed the following three papers during my doc- toral study. The first work also considers optimization with compression, which proposes an improved algorithmic framework with a compression scheme different from the one covered in Chapter 3 to achieve efficient communication for centralized learning. The second paper proposes LEAD to solve smooth problems with compressed communica- tion, which is the special case of Prox-LEAD in Chapter 4. The theoretical result of LEAD in [33] is extended and improved by Prox-LEAD in [32]. The last one considers the equiv- alence of several existing primal-dual algorithms and derives the improved bound on stepsizes for some of them, which is beyond the topic I would like to discuss here. For the consistency and the conciseness of the whole dissertation, I omit these papers. 5 • [34] Hanlin Tang, Yao Li, Ji Liu, and Ming Yan. ErrorCompensatedX: Error compen- sation for variance reduced algorithms. Advances in Neural Information Processing Systems, 34:18102–18113, 2021 • [33] Xiaorui Liu, Yao Li, Rongrong Wang, Jiliang Tang, and Ming Yan. Linear con- vergent decentralized optimization with compression. In International Conference on Learning Representations, 2021 • [35] Yao Li and Ming Yan. On the improved conditions for some primal-dual algo- rithms. arXiv preprint arXiv:2201.00139, 2022 6 CHAPTER 2 EXTRA AND NIDS: DECENTRALIZED ALGORITHMS WITH IMPROVED CONDITIONS 2.1 Introduction Early decentralized methods based on decentralized gradient descent [26, 36, 37, 27, 38] have sublinear convergence for strongly convex objective functions, because of the dimin- ishing stepsize that is needed to obtain a consensual and optimal solution. This sublinear convergence rate is much slower than that for centralized ones. The first decentralized al- gorithm with linear convergence [39] is based on Alternate Direction Multiplier Method (ADMM) [40, 41]. Note that this type of algorithms has O(1/k) rate for general convex functions [42, 43, 44]. After that, many linearly convergent algorithms are proposed. Some examples are EXTRA [1], NIDS [2], DIGing [45, 46], ESOM [47], gradient track- ing methods [48, 49, 50, 46, 45, 51, 52], exact diffusion [53, 54], and dual optimal [55, 56]. There are also works on composite functions, where each private function is the sum of a smooth and a nonsmooth functions [57, 2, 58, 59]. Another topic of interest is decentral- ized optimization over directed and dynamic graphs [60, 61, 62, 63, 45, 64, 65]. Interested readers are referred to [66] and the references therein. This chapter focuses on two linear convergent algorithms: EXTRA and NIDS, and pro- vides better theoretical convergence results. EXact firsT-ordeR Algorithm (EXTRA) was proposed in [1], and its iteration is described in (2.2). For the general convex case, where each fi is convex and L-smooth (i.e., has a L-Lipschitz continuous gradient), the conver-   gence condition in [1] is α ∈ 0, 1+λmin L (W) . Therefore, there is an implicit condition for W that the smallest eigenvalue of W is larger than −1. Later the condition is relaxed to   α ∈ 0, 5+3λ4Lmin (W) in [3], and the corresponding requirement for W is that the smallest eigenvalue of W is larger than −5/3. In addition, this condition for the stepsize is shown 7 to be optimal, i.e., EXTRA may diverge if the condition is not satisfied. Though we can al- ways manipulate W to change the smallest eigenvalue, the convergence speed of EXTRA depends on the matrix W. In the numerical experiment, we will see that it is beneficial to choose small eigenvalues for EXTRA in certain scenarios. The linear convergence of EXTRA requires additional conditions on the functions. There are mainly three types of conditions used in the literature: the strong convexity of f¯ (and some weaker variants) [1], the strong convexity of each fi (and some weaker vari- ants) [3], and the strong convexity of one function fi [54]. Note that the condition on f¯ is much weaker than the other two; there are cases where f¯ is strongly convex but none of fi ’s is. E.g., fi = ∥eTi x∥22 for p = n > 1, where ei is the vector whose ith component is 1 and all other components are 0. If f¯ is (restricted) strongly convex with parameter µf¯, the lin-   µ ¯(1+λ (W)) ear convergence of EXTRA is shown when α ∈ 0, f Lmin 2 in [1]. The upper bound for the stepsize is very conservative, and the better performance with a larger stepsize was shown numerically in [1] without proof. If each fi is strongly convex with parameter     µ, the linear convergence is shown when α ∈ 0, 1+λL+µ min (W) and α ∈ 0, 5+3λ4L min (W) in [58] and [3], respectively. One contribution of this chapter is to show the linear convergence   of EXTRA under the (restricted) strong convexity of f¯ and α ∈ 0, 5+3λ4L min (W) . The algorithm NIDS (Network InDependent Stepsize) was proposed in [2]. Though there is a small difference from EXTRA, NIDS can choose a stepsize that does not depend on the mixing matrices. The linear convergence of NIDS in [2] requires I ≽ W ≻ −I and strong convexity of f (x). In this chapter, we relax this condition for linear convergence to (restricted) strong convexity of f¯(x) and the relaxed mixing matrices with I ≽ W ≻ −(5/3)I. 2.2 Notation We let Xn f (x) := fi (xi ), (2.1) i=1 8 where each xi ∈ Rp is the local copy of the global variable x and the kth iterated point is xki . Since agent i has its own estimate xi of the global variable x, we put them together and define x = [x1 , x2 , · · · , xn ]⊤ ∈ Rn×p . The gradient of f is defined as ∇f (x) = [∇f1 (x1 ), ∇f2 (x2 ), · · · , ∇fn (xn )]⊤ ∈ Rn×p . We say that x is consensual if x1 = x2 = · · · = xn , i.e., x = 1x⊤ , where x ∈ Rp×1 and 1 = [1, 1, · · · , 1]⊤ ∈ Rn×1 . In this chapter, we use ∥ · ∥ and ⟨·, ·⟩ to denote the Frobenious norm and the corre- sponding inner product, respectively. For a given matrix M ∈ Rn×p and any positive (semi)definite matrix H, which is denoted as H ≻ 0 (H ≽ 0 for positive semidefinite), we p define ∥M∥H := tr(M⊤ HM). The largest and the smallest eigenvalues of a matrix A are defined as λmax (A) and λmin (A). For a symmetric positive semidefinite matrix A, we let λ+ † min (A) be the smallest nonzero eigenvalue. A is the pseudo inverse of A. For a matrix A ∈ Rn×n , we say a matrix B ∈ Rn×p is in Ker{A} if AB = 0n×p , and B is in Range{A} if there exists C ∈ Rn×p such that B = AC. For simplicity, we may use x+ and x to replace xk+1 and xk , respectively, in the proofs. 2.3 Algorithms and Prerequisites One iteration of EXTRA can be expressed as xk+2 = (I + W)xk+1 − Wx f k − α[∇f (xk+1 ) − ∇f (xk )]. (2.2) The stepsize α > 0, and the symmetric matrices W and W f satisfy I + W ≽ 2W f ≽ 2W. The initial value x0 is chosen arbitrarily, and x1 = Wx0 − α∇f (x0 ). In practice, we usually I+W let Wf= 2 . One iteration of NIDS is I+W (2.3)   xk+2 = 2 2xk+1 − xk −α(∇f (xk+1 ) − ∇f (xk )) , 9 I+W 0 where α > 0 is the stepsize. The initial value x0 is chosen arbitrarily, and x1 = 2 [x − α∇f (x0 )]. I+W If we choose W f= 2 in (2.2), the difference between EXTRA and NIDS in the above mathematical forms happens only in the communicated data, i.e.,whether we exchange the gradient information or not at each step. In practice, EXTRA can gain the advantage of time overlap by parallelizing communication and gradient evaluation, while NIDS evaluates the gradient and then communicates after the gradient is added. However, this small difference brings big changes in the convergence [2]. In order for both algorithms to converge, we have the following assumptions on W and W. f Assumption 2.3.1 (Mixing matrix). The connected network G = {V, E} consists of a set of nodes V = {1, 2, · · · , n} and a set of undirected edges E. An undirected edge (i, j) ∈ E means that there is a connection between agents i and j and both agents can exchange data. The mixing matrices W = [wij ] ∈ Rn×n and W f = [w eij ] ∈ Rn×n satisfy: 1. (Decentralized property): If i ̸= j and (i, j) ∈ / E, then wij = w eij = 0. 2. (Symmetry): W = W⊤ , W f =W f ⊤. 3. (Null space property): Null{W − W} f = span{1} ⊆ Null{I − W}. f 4. (Spectral property): I+W ≽W f ≻ − 1 I, W f ≽ W. 2 3 Remark 2.3.1. Parts 2-4 imply that the spectrum of W is enlarged to (− 53 , 1], while the original I+W assumption is (−1, 1] for doubly stochastic matrices. Therefore, in our assumption, 2 does not have to be positive definite. This assumption for W is strictly weaker than those in [1] and [2]. Remark 2.3.2. From [1, Proposition 2.2], Null{I − W} = span{1}. It is a critical result for both algorithms. Before showing their theoretical results, we reformulate both algorithms. 10 Reformulation of EXTRA: We reformulate EXTRA by introducing a variable y ∈ Rn×p as xk+1 = Wx f k + yk − α∇f (xk ), (2.4a) yk+1 = yk − (W f − W)xk+1 , (2.4b) with y0 = −(W f − W)x0 . Then (2.4) is equivalent to EXTRA (2.2). Proposition 2.3.1. Let the x-sequence generated by (2.4) with y0 = −(W f − W)x0 be {xk }∞ , k=1 then it’s identical to the sequence generated by EXTRA (2.2) with the same initial point x0 . Proof. From (2.4a), we have x1 =Wxf 0 + y0 − α∇f (x0 ) = Wx f 0 − (Wf − W)x0 − α∇f (x0 ) =Wx0 − α∇f (x0 ). For k ≥ 0, we have xk+2 =Wx f k+1 + yk+1 − α∇f (xk+1 ) = Wxk+1 + yk − α∇f (xk+1 ) =(I + W)xk+1 − Wx f k − α[f (xk+1 ) − f (xk )], where the second and last equalities are from (2.4b) and (2.4a), respectively. Remark 2.3.3. By (2.4b) and the assumption of y0 , yk ∈ Range{W f − W} for all k. Also, xk+1 = (W f − W)† (yk − yk+1 ) + zk+1 for some zk+1 ∈ Ker{W f − W}. Reformulation of NIDS: We adopt the following reformulation from [2]: I−W k dk+1 = dk + 2α [x − α∇f (xk ) − αdk ], (2.5a) xk+1 = xk − α∇f (xk ) − αdk+1 , (2.5b) with d0 = 0. The equivalence is shown in [2]. To establish the linear convergence of EXTRA and NIDS, we need the following two assumptions. 11 Assumption 2.3.2 (Uniqueness). There is a unique minimizer x∗ for f¯(x). Assumption 2.3.3 (L-smoothness and restricted strong convexity). Each function fi is a proper, closed and convex function with a Lipschitz continuous gradient: ∥∇fi (x) − ∇fi (ex)∥ ≤ L∥x − x e ∈ Rp , e∥, ∀x, x (2.6) where L > 0 is the Lipschitz constant. Furthermore, f¯(x) is (restricted) strongly convex with respect to x∗ : ⟨x − x∗ , ∇f¯(x) − ∇f¯(x∗ )⟩ ≥ µf¯∥x − x∗ ∥2 , ∀x ∈ Rp . (2.7) From [67, Theorem 2.1.5], the inequality (2.6) is equivalent to, for any x, x̃ ∈ Rn×p , ⟨x − x x)⟩ ≥ L−1 ∥∇f (x) − ∇f (e e, ∇f (x) − ∇f (e x)∥2 . (2.8) Proposition 2.3.2 ([1, Appendix A]). The following two statements are equivalent: 1. f¯(x) is (restricted) strongly convex with respect to x∗ ; 2. For any η > 0, g(x) := f (x) + η2 ∥x∥2I−W is µg (restricted) strongly convex with respect to x∗ = 1(x∗ )⊤ . Specially, we can characterize 2 + ( ) µf¯ µf¯λmin (I − W) µg = min , η . 2 µ2f¯ + 16L2 This proposition shows ⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩ + η∥x − x∗ ∥2I−W ≥ µg ∥x − x∗ ∥2 (2.9) for any x ∈ Rn×p . 2.4 New Linear Convergence Results for EXTRA and NIDS Throughout this section, we assume that Assumptions 2.3.1-2.3.3 hold. Two techniques are used to show the linear convergence: a) Proposition 2.3.2 serves as a bridge to connect f¯(x) and f (x). It is the key to the weaker assumption on objective functions. b) Both algo- rithms are equivalent to the extended Proximal Alternating Predictor-Corrector (PAPC) 12 in [3], and this equivalence is the key to relaxing the conditions on the mixing matrices W and W. f 2.4.1 Linear Convergence of EXTRA I+W When W f = 2 , EXTRA is recovered by applying the extended PAPC in [3] to the fol- lowing dual form of the decentralized consensus problem √ minimize f ∗ ( I − Wy), y where f ∗ is the conjugate function of f and y is the dual variable. In this case, EXTRA has the optimal bound of the stepsize over the relaxed mixing matrix W ≻ −(5/3)I. This fact enlightens us on the critical Lemma 2.4.3. For simplicity, we introduce some notations. Because of part 4 of Assumption 2.3.1, given the mixing matrices W and W, f there is a constant 3  1 i θ∈ , min ,1 4 1 − λmin (W) f such that W :=θW f + (1 − θ)I ≻ 0, (2.10) H :=W + (θ − 12 )(I − W) f = I+W ≻ 0, (2.11) f 2 M :=(W f − W)† ≽ 0, (2.12) G :=W + I − 2W f ≽ 0. (2.13) Based on (2.10), we have f = W − (1 − θ)(I − W). W f (2.14) Let (x∗ , y∗ ) be a fixed point of (2.4), it is straightforward to show that f − W)x∗ =0. (W (2.15) Part 3 of Assumption 2.3.1 shows that x∗ is consensual, i.e., x∗ = 1(x∗ )⊤ for certain x∗ ∈ Rp . The y-iteration in (2.4b) and the initialization of y0 show yk ∈ Range{W f − W} = 13 Ker{1⊤ }. Then we have 1⊤ y∗ = α1⊤ ∇f (x∗ ) = 0. Thus, x∗ is the unique minimizer of f¯(x). Lemma 2.4.1 (Norm over range space [2, Lemma 3]). Let A ∈ Rn×n be symmetric positive (semi)definite with rank r (r ≤ n) and λ1 ≥ λ2 ≥ · · · ≥ λr > 0 be its r eigenvalues. Then Range{A} is a rp-dimensional subspace in Rn×p and has a norm defined by ∥x∥2A† := ⟨x, A† x⟩, where A† is the pseudo inverse of A. In addition, λ−1 2 2 −1 2 1 ∥x∥ ≤ ∥x∥A† ≤ λr ∥x∥ for all x ∈ Range{A}. For simplicity, we let x+ and x stand for xk+1 and xk , respectively, in the proofs. The same simplification applies to yk . Lemma 2.4.2 (Norm equality). Let {(xk , yk )}∞ k=1 be the sequence generated by (2.4), then it satisfies ∥xk+1 − x∗ ∥2W−W f = ∥yk − yk+1 ∥2M . (2.16) Proof. From Remark 2.3.3, we have x+ = M(y − y+ ) + z+ (2.17) for z+ ∈ Ker{W f − W}. This equality and (2.15) give ∥x+ − x∗ ∥2W−W f =⟨x+ − x∗ , (W f − W)(x+ − x∗ )⟩ = ⟨x+ , (W f − W)x+ ⟩ =⟨M(y − y+ ), y − y+ ⟩ = ∥y − y+ ∥2M , where the third equality holds because of (2.12), (2.17), and y −y+ ∈ Range(W f −W). Lemma 2.4.3 (A key inequality for EXTRA). Let {(xk , yk )}∞ k=1 be generated by (2.4), then we have ∥xk+1 − x∗ ∥2H + ∥yk+1 − y∗ ∥2M ≤∥xk − x∗ ∥2H + ∥yk − y∗ ∥2M − ∥xk − xk+1 ∥2(θ− 3 )(I−W) f − ∥x k+1 − x∗ ∥2G 4 − ∥xk − xk+1 ∥W2 − 2α⟨xk+1 − x∗ , ∇f (xk ) − ∇f (x∗ )⟩. (2.18) 14 Proof. The iteration (2.4) and equation (2.14) show 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩ =2⟨x+ − x∗ , W(x f − x+ ) + W(x f + − x∗ ) − (x+ − x∗ ) + (y − y∗ )⟩ =2⟨x+ − x∗ , W(x f − x+ ) + ( W f − I)(x+ − x∗ ) f − W)(x+ − x∗ ) + y+ − y + y − y∗ ⟩ + (W =2⟨x+ − x∗ , W(x f − x+ )⟩ + 2⟨x+ − x∗ , y+ − y∗ ⟩ − 2∥x+ − x∗ ∥2 G =2⟨x+ − x∗ , W(x − x+ )⟩ − 2⟨x+ − x∗ , (1 − θ)(I − W)(x f − x+ )⟩ + 2⟨x+ − x∗ , y+ − y∗ ⟩ − 2∥x+ − x∗ ∥2G , (2.19) where the first equality comes from (2.4a), the second one follows (2.4b), and the last one is from (2.14). From Remark 2.3.3, x+ −x∗ = M(y−y+ )+z+ −x∗ for some z+ ∈ Ker{W−W}. f Thus ⟨z+ − x∗ , y+ − y∗ ⟩ = 0, and the equality (2.19) can be rewritten as 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩ =2⟨x+ − x∗ , W(x − x+ )⟩ − 2⟨x+ − x∗ , (1 − θ)(I − W)(x f − x+ )⟩ + 2⟨M(y − y+ ), y+ − y∗ ⟩ − 2∥x+ − x∗ ∥2G . Using the basic equality 2⟨a − b, b − c⟩ = ∥a − c∥2 − ∥a − b∥2 − ∥b − c∥2 and Lemma 2.4.2, we have ∥x+ − x∗ ∥W2 − ∥x+ − x∗ ∥2(1−θ)(I−W) + f + ∥y − y ∥M ∗ 2 =∥x − x∗ ∥2W − ∥x − x∗ ∥2(1−θ)(I−W)f + ∥y − y ∥M ∗ 2 ∗ 2 − ∥x − x+ ∥2W + ∥x − x+ ∥2(1−θ)(I−W) + f − ∥x − x ∥W−W f − 2∥x+ − x∗ ∥2G − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩. (2.20) Note that the following inequality holds, 1 2 ∥x+ − x∗ ∥2W−W f ≤ ∥x+ − x∗ ∥2W−W f + 21 ∥x − x∗ ∥2W−W f − 14 ∥x − x+ ∥2W−W f . 15 Adding it onto both sides of (2.20), we have ∥x+ − x∗ ∥2H − 21 ∥x+ − x∗ ∥2G + ∥y+ − y∗ ∥2M ≤∥x − x∗ ∥2H − 21 ∥x − x∗ ∥2G + ∥y − y∗ ∥2M 1 − ∥x − x+ ∥2W − ∥x − x+ ∥2(θ− 3 )(I−W) + 2 f + ∥x − x ∥G 4 4 − 2∥x+ − x∗ ∥2G − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩. (2.21) Apply the inequality 41 ∥x−x+ ∥2G ≤ 21 ∥x−x∗ ∥2G + 12 ∥x+ −x∗ ∥2G , then the key inequality (2.18) is obtained. In the following theorem, we assume G ̸= 0 (i.e., W f ̸= (I + W)/2). It is easy to amend the proof to show the result for this special case. Theorem 2.4.1 (Q-linear convergence of EXTRA). Under Assumptions 2.3.1-2.3.3, we define 4θ−3 r1 = 4(1−θ)2 λ −1 (I−W)) > 0, (2.22) max (W f 1 r2 = 2λ −1 ) > 0, (2.23) max (GW r3 = r1 +rr21 +r r2 1 r2 ∈ (0, 1), (2.24) and choose two small parameters ξ and η such that  n o ξ ∈ 0, min 4λ r(WM) 3 ,1 , (2.25) max   η ∈ 0, 4αλ λmin(W)−2α(W)ξ 2L . (2.26) min In addition, we define P :=H + 2ξ (I − W) ≻ 0, Q :=M + (r3 − 2ξλmax (WM))W−1 ≻ 0. Then for any stepsize α ∈ (0, 2λminL(W) ), we have ∥xk+1 − x∗ ∥2P + ∥yk+1 − y∗ ∥2Q ≤ ρ(∥xk − x∗ ∥2P + ∥yk − y∗ ∥2Q ), (2.27) 16 where n     α2 L 2α2 L η ρ := max 1 − 2α − λmin (W) µg , 4α − λmin (W) ξ , o (2.28) r3 −4ξλmax (WM) 1 − r +(1−2ξ)λ (WM) < 1. 3 max Proof. From (2.18) in Lemma 2.4.3, we have ∥x+ − x∗ ∥2H + ∥y+ − y∗ ∥2M ≤∥x − x∗ ∥2H + ∥y − y∗ ∥2M − ∥x − x+ ∥2(θ− 3 )(I−W) + f − ∥x − x ∥G ∗ 2 4 − ∥x − x+ ∥2W − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩. (2.29) Then we find an upper bound of −∥x − x+ ∥W 2 − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩. − ∥x − x+ ∥W 2 − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩ =α2 ∥∇f (x) − ∇f (x∗ )∥W 2 ∗ −1 − 2α⟨x − x , ∇f (x) − ∇f (x )⟩ ∗ − ∥W(x − x+ ) − α(∇f (x) − ∇f (x∗ ))∥W 2 −1 2L ≤ − 2α − λ α (W) ⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩  min − ∥W(x − x+ ) − α(∇f (x) − ∇f (x∗ ))∥W 2 −1 , where, the inequality comes from (2.8). Combining it with (2.29), we have ∥x+ − x∗ ∥2H + ∥y+ − y∗ ∥2M − ∥x − x∗ ∥2H − ∥y − y∗ ∥2M 2L ≤ − 2α − λ α (W) ⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩  min − ∥W(x − x+ ) − α(∇f (x) − ∇f (x∗ ))∥W 2 −1 ∗ 2 − ∥x − x+ ∥2(θ− 3 )(I−W) + f − ∥x − x ∥G . (2.30) 4 The inequality (2.30) shows that {(xk , yk )}∞ k=1 is a Cauchy sequence converging to the fixed point (x∗ , y∗ ) of (2.4). From (2.9), we can bound the first term on the right hand side of (2.30) as α2 L ⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩  − 2α − λmin (W) α2 L α2 L η∥x − x∗ ∥2I−W − 2α − µg ∥x − x∗ ∥2 .   ≤ 2α − λmin (W) λmin (W) (2.31) 17 Next, we bound the two terms involving successive iterated points, i.e., −∥W(x − x+ ) − α(∇f (x) − ∇f (x∗ ))∥2W−1 − ∥x − x+ ∥2(θ− 3 )(I−W) f . Note that 4 W(x − x+ ) − α(∇f (x) − ∇f (x∗ )) =G(x+ − x∗ ) − (y+ − y∗ ) + (1 − θ)(I − W)(x f − x+ ). (2.32) We use T1 , T2 , and T3 to denote the three terms on the right hand side of (2.32), respec- tively. Using the definition of r1 in (2.22), we have 2 + 2 − ∥T1 + T2 + T3 ∥W −1 − ∥x − x ∥ (θ− 3 )(I−W) 4 f − 12 1 = − ∥T1 + T2 ∥2W−1 − 2⟨W (T1 + T2 ), W− 2 T3 ⟩ − ∥T3 ∥2W−1 4θ−3 − 4(1−θ) ∥x − x+ ∥2(1−θ)(I−W) f 1 1 ≤ − ∥T1 + T2 ∥2W−1 − 2⟨W− 2 (T1 + T2 ), W− 2 T3 ⟩ − (1 + r1 )∥T3 ∥2W−1 r1 ≤− 1+r1 ∥T1 + T2 ∥2W−1 , where the last inequality comes from the Cauchy inequality 1 −2⟨a, b⟩ ≤ 1+r1 ∥a∥2 + (1 + r1 )∥b∥2 . Combining it with the last term −∥x+ − x∗ ∥2G on the right hand side of (2.30), we have r1 − 1+r1 ∥T1 + T2 ∥2W−1 − ∥x+ − x∗ ∥2G 1 1 ≤− r1 1+r1 ∥T2 ∥2W−1 − 2r1 1+r1 ⟨W− 2 T1 , W− 2 T2 ⟩ − r1 1+r1 ∥T1 ∥2W−1 2 − r2 ∥T1 ∥W 1 −1 − 2 ∥x + − x∗ ∥2G ≤ − r3 ∥y+ − y∗ ∥2W−1 − 2ξ ∥x+ − x∗ ∥2G , (2.33) where ξ < 1 is a small positive parameter, and r2 and r3 are defined as (2.23) and (2.24), respectively. Since G = (I − W) − 2(W f − W), we have ∥x+ − x∗ ∥2G = ∥x+ − x∗ ∥2I−W − 2∥y − y+ ∥2M . (2.34) 18 Therefore r1 − 1+r1 ∥T1 2 + T2 ∥W −1 − ∥x + − x∗ ∥2G ξ ≤ − r3 ∥y+ − y∗ ∥W 2 −1 − 2 ∥x + − x∗ ∥2I−W − ξ∥y − y+ ∥2M ξ ≤ − r3 ∥y+ − y∗ ∥W 2 −1 − 2 ∥x + − x∗ ∥2I−W + 2ξ∥y+ − y∗ ∥2M + 2ξ∥y − y∗ ∥2M ≤ − (r3 /λmax (WM) − 2ξ)∥y+ − y∗ ∥2M − 2ξ ∥x+ − x∗ ∥2I−W + 2ξ∥y − y∗ ∥2M . (2.35) Let ξ < r3 /(4λmax (WM)), then we have r3 /λmax (WM) − 2ξ > 2ξ. Putting (2.31) and (2.35) together onto (2.30), we have ∥x+ − x∗ ∥2H + 2ξ ∥x+ − x∗ ∥2I−W + (1 + (r3 /λmax (WM) − 2ξ))∥y+ − y∗ ∥2M 2L 2L ≤ 1 − 2α − λ α (W) µg ∥x − x∗ ∥2H + 2α − λ α (W) η∥x − x∗ ∥2I−W    min min + (1 + 2ξ)∥y − y∗ ∥2M . Let ρ be defined as (2.28), we get (2.27). Note that the choice of ξ and η affects the defini- tion of P and Q, but not the algorithm. Hence for any α ∈ (0, 2λminL(W) ), Q-linear conver- gence is guaranteed for (xk − x∗ , yk − y∗ ). Because ∥xk −x∗ ∥2P ≤ ∥xk −x∗ ∥2P +∥yk −y∗ ∥2Q , the sequence {∥xk −x∗ ∥2P }∞ k=1 converges R-linearly to 0 at the rate of ρ. I+W Two special cases are not covered by the theorem: θ = 1 and W f = 2 . When θ = 1, r2 I+W we have r1 = ∞ and r3 = 1+r2 . When W f = 2 , i.e., G = 0, we have r2 = ∞ and r1 r3 = 1+r1 . In both cases, the linear convergence rate is n     2α2 L 4α2 L η ρ = max 1 − 2α − 2−θ+θλmin (W) µg , 4α − 2−θ+θλmin (W) ξ , o (2.36) βr3 −4ξ(2−θβ) 1− βr3 +(1−2ξ)(2−θβ) , where β = 1 − λ2 (W) is the spectral gap. It is exactly the limit of ρ in (2.28) with r1 or r2 approaching infinity. Remark 2.4.1. The upper bound for the stepsize α, 2(1 − θ + θλmin (W))/L, f is much larger than 2 that in [1] for ensuring linear convergence, 2µg λmin (W)/L f , when W f is positive definite. In the 19 f = (I + W)/2, we have α < (2 − θ + θλmin (W))/L. Since we can choose θ as special case W close as possible to 3/4, the upper bound of α attains (3λmin (W) + 5)/(4L), which coincides the optimal bound given in [3] for general convex functions. In [3], the linear convergence was shown under the strong convexity of all functions {fi }ni=1 . 2.4.2 NIDS without Nonsmooth Term We consider NIDS next. In the smooth case, NIDS can be recovered by PAPC applied to the primal form of the decentralized consensus problem √ minimize f (x), s.t. I − Wx = 0. x It motivates us to show the inequality in Lemma 2.4.5. [2, Lemma 1] shows that, with the initialization (d0 = 0, x0 ), the fixed point (d∗ ∈ Range(I − W), x∗ ) of (2.5) satisfies d∗ + ∇f (x∗ ) = 0, (2.37a) (I − W)x∗ = 0, (2.37b) and x∗ is the consensual solution to the problem (2.1). We will use the following important equality, which can be derived from (2.5) (I − I−W 2 )(dk+1 − dk ) = I−W 2α (xk+1 − x∗ ). (2.38) Motivated by the proof of EXTRA, we introduce another matrix to measure the dis- tance to the fixed point. We still pick θ ∈ ( 34 , 1] such that     I+W I−W θ 2 + (1 − θ)I = I − θ 2 ≻ 0. (2.39) Define a new symmetric matrix  † f = 2(I − W)† − θI = M I−W − θI. (2.40) 2 Then M f is a norm over Range(I − W). Note that M f = −θ1. In f is invertible because M1 the following proofs, we use the same simplification x and x+ . 20 Lemma 2.4.4 (Equality). Let {(dk , xk )}∞ k=1 be the sequence generated by (2.5), we have the fol- lowing two equalities: ⟨xk+1 − x∗ , dk+1 − d∗ ⟩ = α⟨dk+1 − dk , dk+1 − d∗ ⟩M−(1−θ)I f (2.41a) ⟨xk+1 − x∗ , dk+1 − dk ⟩ = α∥dk+1 − dk ∥2M−(1−θ)I f . (2.41b) Proof. Since d+ − d∗ ∈ Range(I − W), we have ⟨x+ − x∗ , d+ − d∗ ⟩ =⟨(I − W)(x+ − x∗ ), (I − W)† (d+ − d∗ )⟩ =α⟨(2I − (I − W))(d+ − d), (I − W)† (d+ − d∗ )⟩ =α⟨(2(I − W)† − I)(d+ − d), d+ − d∗ ⟩, (2.42) where the second equality follows (2.38). Replacing d∗ with d in (2.42), we get (2.41b) in the same way. Lemma 2.4.5 (A key inequality for NIDS). Let {(dk , xk )}∞ k=1 be generated by (2.5). We have, with any r4 ∈ (0, θ − 43 ), ∥xk+1 − x∗ ∥2 + α2 ∥dk+1 − d∗ ∥2M+(θ− 1 +2r4 )I f 2 ≤∥xk − x∗ ∥2 + α2 ∥dk − d∗ ∥2M+(θ− 1 −2r − α2 ∥dk − dk+1 ∥2M+(θ− 3 −r 4 )I 4 )I f f 2 4 + α2 ∥∇f (xk ) − ∇f (x∗ )∥2 − 2α⟨xk − x∗ , ∇f (xk ) − ∇f (x∗ )⟩. (2.43) Proof. The iteration (2.5) and the definition of M f in (2.40) show 2α⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩ =2⟨x − x∗ , x − x+ ⟩ − 2α⟨x − x∗ , d+ − d∗ ⟩ =2⟨x − x+ , x − x∗ ⟩ − 2α⟨x − x+ , d+ − d∗ ⟩ − 2α⟨x+ − x∗ , d+ − d∗ ⟩ =2⟨x − x+ , x − αd+ − x∗ + αd∗ ⟩ + 2α2 ⟨d − d+ , d+ − d∗ ⟩M−(1−θ)I f =2⟨x − x+ , x+ − x∗ + α∇f (x) − α∇f (x∗ )⟩ + 2α2 ⟨d − d+ , d+ − d∗ ⟩M−(1−θ)If , where the first and the last equalities use (2.5b) and the third one follows (2.41a). 21 From (2.5b), we obtain 2α⟨x − x+ , ∇f (x) − ∇f (x∗ )⟩ =∥x − x+ ∥2 + α2 ∥∇f (x) − ∇f (x∗ )∥2 − ∥x − x+ − α∇f (x) + α∇f (x∗ )∥2 =∥x − x+ ∥2 + α2 ∥∇f (x) − ∇f (x∗ )∥2 − α2 ∥d+ − d∗ ∥2 . (2.44) Together with the basic equality 2⟨a − b, b − c⟩ = ∥a − c∥2 − ∥b − c∥2 − ∥a − b∥2 , we get ∥x+ − x∗ ∥2 + α2 ∥d+ − d∗ ∥2M−(1−θ)I f =∥x − x∗ ∥2 + α2 ∥d − d∗ ∥2M−(1−θ)I f − α2 ∥d − d+ ∥2M−(1−θ)I f − α2 ∥d+ − d∗ ∥2 + α2 ∥∇f (x) − ∇f (x∗ )∥2 − 2α⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩. (2.45) 3 Since r4 < θ − 4 ≤ 1/4, the following inequality holds, −( 21 − 2r4 )∥d+ − d∗ ∥2 ≤ ( 21 − 2r4 )∥d − d∗ ∥2 − ( 41 − r4 )∥d − d+ ∥2 . Adding it onto both sides of (2.45), we get (2.43). Theorem 2.4.2 (Q-linear convergence for NIDS). Under Assumptions 2.3.1-2.3.3, we define  (λmax (I − W) − 2)2  r5 = max 2, . (2.46) 2 − ( 43 + r4 )λmax (I − W) 1 For any stepsize α ∈ (0, L2 ), we choose η ∈ (0, α(2−αL)r 5 ) and define n o 4r4 ρ3 = max 1 − α(2 − αL)µg , α(2 − αL)ηr5 , 1 − 2λmax ((I−W)+ )− 12 +2r4 < 1, (2.47) Then we have ∥xk+1 − x∗ ∥2I+ I−W + α2 ∥dk+1 − d∗ ∥2Q r5 (2.48) ≤ρ(∥x −k x∗ ∥2I+ I−W 2 + α ∥d − k d∗ ∥2Q ), r5 where Q := Mf + (θ − 1 + 2r4 )I ≻ 0. 2 22 Proof. Given any α ∈ (0, L2 ), we have α2 ∥∇f (x) − ∇f (x∗ )∥2 − 2α⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩ ≤ − α(2 − αL)⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩ = − α(2 − αL)⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩ − α(2 − αL)η∥x − x∗ ∥2I−W + α(2 − αL)η∥x − x∗ ∥2I−W ≤ − α(2 − αL)µg ∥x − x∗ ∥2 + α(2 − αL)η∥x − x∗ ∥2I−W , where the first inequality is from (2.8) and the second one uses (restricted) strong convex- ity (2.9). Together with (2.43), we have ∥x+ − x∗ ∥2 + α2 ∥d+ − d∗ ∥2M+(θ− 1 +2r 4 )I f 2 ≤∥x − x∗ ∥2 + α2 ∥d − d∗ ∥2M+(θ− 1 −2r − α2 ∥d − d+ ∥2M+(θ− 3 −r 4 )I 4 )I f f 2 4 − α(2 − αL)µg ∥x − x∗ ∥2 + α(2 − αL)η∥x − x∗ ∥2I−W , (2.49) The equality (2.38) gives ∥x+ − x∗ ∥2I−W =∥(I − W)(x+ − x∗ )∥2(I−W)† = α2 ∥(2I − (I − W))(d+ − d)∥2(I−W)† =α2 ∥d − d+ ∥2(2I−(I−W))(I−W)† (2I−(I−W)) = α2 ∥d − d+ ∥24(I−W)† −4I+(I−W) ≤α2 r5 ∥d − d+ ∥2M+(θ− 3 −r )I , (2.50) 4 f 4 where the second equality follows (2.38), the fourth one is from d − d+ ∈ Range(I − W), and the inequality holds with the definition of r5 in (2.46). Combing (2.49) and (2.50), we derive ∥x+ − x∗ ∥2 + 1 r5 ∥x+ − x∗ ∥2I−W + α2 ∥d+ − d∗ ∥2M+(θ− 1 +2r 4 )I f 2 ≤(1 − α(2 − αL)µg )∥x − x∗ ∥2 + α(2 − αL)η∥x − x∗ ∥2I−W + α2 ∥d − d∗ ∥2M+(θ− 1 −2r )I . (2.51) 4 f 2 Let ρ3 be defined as (2.47), and we show (2.48). Meanwhile, the Q-linear convergence of (dk , xk ) implies the R-linear convergence of xk . 23 This theorem shows that NIDS is still linearly convergent over a relaxed W and keeps 2 the network-independent stepsize, which attains L practically. 2.5 Numerical Experiments In this section, we compare the performance of EXTRA and NIDS over the relaxed mixing matrices in the following two scenarios: • Comparison of Decentralized Gradient Descent (DGD), EXTRA, and NIDS with dif- ferent stepsizes for a doubly stochastic matrix W. • Comparison of EXTRA and NIDS with different stepsizes for a relaxed matrix W. We consider the following decentralized sensing problem. Each agent i ∈ {1, · · · , n} has its own private measured data Mi ∈ Rmi ×p and yi ∈ Rmi based on the unknown common variable x ∈ Rp . Suppose that yi = Mi x + ei with independently identically distributed random noise ei ∈ Rmi . The goal is to estimate x cooperatively over the network, and the problem is n 1X1 minimize f¯(x) = ∥Mi x − yi ∥22 . x n i=1 2 The data {Mi }ni=1 and x are generated from Gaussian distribution. We normalize each Mi such that ∥Mi⊤ Mi ∥ = 10, i.e., L = 10. In both scenarios, we set n = 10, p = 5, x0 = 0, and I+W W f= 2 for EXTRA. For the first scenario, we construct the matrix W based on the Metropolis constant edge weight matrix in [1, §2.4]. In this case W f is positive definite, and we can set θ = 3 . 4 5I+3W (1+λmin (W))µf¯ Then W = 8 . We implement EXTRA with three different stepsizes: α1 = 100 1+λmin (W) (the stepsize for linear convergence in [1]), α2 = 10 (the stepsize for convergence 5+3λmin (W) only in [1]), and α3 = 40 (our largest stepsize). For NIDS, the stepsize is set to 1 α4 = 5 although it is the upper bound of the stepsize which is not attainable in our proof theoretically. 24 The result with mi = 1 is illustrated in Fig. 2.1. Because we have n > p, the function f¯(x) is strongly convex with probability one. NIDS requires the least number of iteration to attain the expected tolerance. Meanwhile, EXTRA with our proposed stepsize has better performance than that given in [1]. 10 0 6 8 10 -2 4 10 -4 10 2 5 10 -6 7 1 9 10 -8 10 -10 3 0 500 1000 1500 2000 2500 3000 k ∗ Figure 2.1: LEFT: the error ∥x −x ∥F ∥x0 −x∗ ∥F vs iterations for DGD with different stepsizes, EXTRA with three stepsizes, and NIDS. RIGHT: The random network with 10 nodes. 4 10 0 3 10 10 -2 5 10 -4 8 2 9 10 -6 1 10 -8 10 -10 0 500 1000 1500 2000 2500 3000 6 7 k ∗ Figure 2.2: LEFT: the error ∥x −x ∥F ∥x0 −x∗ ∥F vs iterations for DGD with different stepsizes, EXTRA with three stepsizes, and NIDS. RIGHT: The random network with 10 nodes. Then, we set mi = 10 in Fig. 2.2. In this case, individual functions fi (x) and f (x) are strongly convex. NIDS and EXTRA with the largest stepsize lead the performance. Here two results of EXTRA are the same as that of NIDS although they are set with different stepsizes. The observation may indicate that there is an optimal choice of stepsize be- 25 5+3λmin (W) tween α2 and α3 for both EXTRA and NIDS. By setting α5 = 40+µf¯ for EXTRA and 2 α6 = 10+µf¯ , we compare these algorithms in Fig. 2.3. This figure suggests that the optimal stepsize may depends on the problem/functions. How to find the optimal stepsize is an important research topic and beyond the scope of this chapter. 10 0 10 -2 10 -4 10 -6 10 -8 10 -10 0 20 40 60 80 100 120 140 Figure 2.3: The comparison of proved stepsizes for EXTRA and NIDS with the optimal choice. Next, we turn to the relaxed mixing matrices. Based on the previous created W, we 4W−I replace it by Wnew = 3 to scale the range of eigenvalues to (− 35 , 1]. In this case, some diagonal entries of Wnew may be negative. We consider the worst topology of network, line topology, i.e., each agent has at most two neighbors. In this experiment, we solve the same problem using EXTRA and NIDS on the unrelaxed and relaxed mixing matri- ces, respectively, over the line. For NIDS, since the stepsize is network-independent, we relax the mixing matrix W to Wnew more aggressively so that λmin (Wnew ) approaches 1 − 53 and compare the performance with the unrelaxed case of NIDS under α = 5 . For 5+3λmin (W) EXTRA, we set the stepsize to α = 40 , and compare the performance with the 5+3λmin (Wnew ) relaxed one under the stepsize α = 40 where we only perturb W mildly so that λmin (Wnew ) approaches −1. The result is shown in Fig. 2.4. From Fig. 2.4, if the topology of network is weak, switching to the relaxed mixing matrix may offer better performance when using NIDS and EXTRA to solve the problem. 26 10 0 10 0 10 -2 10 -2 10 -4 10 -4 10 -6 10 -6 10 -8 10 -8 10 -10 10 -10 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 k ∗ Figure 2.4: The figure of residuals ∥x −x ∥F ∥x0 −x∗ ∥F with respect to iteration. The left graph is for strongly convex f¯(x) and the right one is for strongly convex f (x). re-EXTRA and re-NIDS stand for implementing EXTRA and NIDS over relaxed Wnew . The improvement for NIDS is more distinguished. 2.6 Conclusion In this chapter, we relax the mixing matrices and prove the linear convergence of EXTRA and NIDS under the (restricted) strongly convexity assumption on f¯. A larger upper bound of the stepsize is derived for EXTRA compared with that given in [1] and [58]. NIDS can choose a network-independent stepsize and this stepsize can be chosen as the same as that of centralized ones. We relax the conditions for the mixing matrices and the functions, while keeping the same stepsize. In numerical experiments on linear regression, EXTRA with the larger stepsize con- verges faster than using the µf¯-dependent stepsize in [1]. Over the unrelaxed mixing matrix, NIDS leads the performance in most cases and is the easiest to implement. If the topology of network is weak, using the relaxed mixing matrix can accelerate NIDS. For EXTRA, in general, we may not choose the mixing matrices to be relaxed due to the tiny improvement, but the larger stepsize derived in the relaxed case is competent to be considered. 27 CHAPTER 3 DORE: A CENTRALIZED ALGORITHM WITH BIDIRECTIONAL COMMUNICATION COMPRESSION 3.1 Introduction In the well-known parameter server framework [7, 11], each worker node evaluates its own stochastic gradient {∇fe i (xk )}ni=1 and send it to the master node, which collects all gradients and calculates their average (1/n) ni=1 ∇f k P e i (x ). Then the master node further takes the gradient descent step with the averaged gradient and broadcasts the new model parameter xk+1 to all worker nodes. It makes use of the computational resources from all nodes. In reality, the network bandwidth is often limited. Thus, the communication cost for the gradient transmission and model synchronization becomes the dominating bottle- necks as the number of nodes and the model size increase, which hinders the scalability and efficiency of SGD. One common way to reduce the communication cost is to compress the gradient infor- mation by either gradient sparsification or quantization [13, 12, 15, 68, 69, 70, 71, 72] such that many fewer bits of information are needed to be transmitted. However, little atten- tion has been paid on how to reduce the communication cost for model synchronization and the corresponding theoretical guarantees. Obviously, the model shares the same size as the gradient, so does the communication cost. Thus, merely compressing the gradient can reduce at most 50% of the communication cost, which suggests the importance of model compression. Notably, the compression of model parameters is much more chal- lenging than gradient compression. One key obstacle is that its compression error cannot be well controlled by the step size γ and thus it cannot diminish like that in the gradient compression [20]. In this chapter, we aim to bridge this gap by investigating algorithms to compress the full communication in the optimization process and understanding their 28 theoretical properties. Our contributions can be summarized as: • We proposed DORE, which can compress both the gradient and the model information such that more than 95% of the communication cost can be reduced. • We provided theoretical analyses to guarantee the convergence of DORE under strongly convex and nonconvex assumptions without the bounded gradient assumption. • Our experiments demonstrate the superior efficiency of DORE comparing with the state- of-art baselines without degrading the convergence speed and the model accuracy. 3.2 Background Recently, many works try to reduce the communication cost to speed up the distributed learning, especially for deep learning applications, where the size of the model is typi- cally very large (so is the size of the gradient) while the network bandwidth is relatively limited. Below we briefly review relevant papers. Gradient quantization and sparsification. Recent works [13, 12, 71, 17, 14] have shown that the information of the gradient can be quantized into a lower-precision vec- tor such that fewer bits are needed in communication without loss of accuracy. [12] pro- posed 1Bit SGD that keeps the sign of each element in the gradient only. It empirically works well, and [14] provided theoretical analysis systematically. QSGD [13] utilizes an unbiased multi-level random quantization to compress the gradient while Terngrad [71] quantizes the gradient into ternary numbers {0, ±1}. In DIANA [17], the gradient differ- ence is compressed and communicated contributing to the estimator of the gradient in the master node. Another effective strategy to reduce the communication cost is sparsification. [70] proposed a convex optimization formulation to minimize the coding length of stochastic gradients. A more aggressive sparsification method is to keep the elements with relatively larger magnitude in gradients, such as top-k sparsification [15, 68, 73]. 29 Model synchronization. The typical way for model synchronization is to broadcast model parameters to all worker nodes. Some works [69, 74] have been proposed to reduce model size by enforcing sparsity, but it cannot be applied to general optimization prob- lems. Some alternatives including QSGD [13] and ECQ-SGD [72] choose to broadcast all quantized gradients to all other workers such that every worker can perform model up- date independently. However, all-to-all communication is not efficient since the number of transmitted bits increases dramatically in large-scale networks. DoubleSqueeze [18] ap- plies compression on the averaged gradient with error compensation to speed up model synchronization. Error compensation. [12] applied error compensation on 1Bit-SGD and achieved negligible loss of accuracy empirically. Recently, error compensation was further stud- ied [72, 15, 16] to mitigate the error caused by compression. The general idea is to add the compressed error to the next compression step: ĝ = Q(g + e), e = (g + e) − ĝ. However, to the best of our knowledge, most of the algorithms with error compensa- tion [72, 15, 16, 18] need to assume bounded gradient, i.e., E∥g∥2 ≤ B, and the conver- gence rate depends on this bound. Contributions of DORE. The most related papers to DORE are DIANA [17] and Dou- bleSqueeze [18]. Similarly, DIANA compresses gradient difference on the worker side and achieves good convergence rate. However, it doesn’t consider the compression in model synchronization, so at most 50% of the communication cost can be saved. Doub- leSqueeze applies compression with error compensation on both worker and server sides, but it only considers nonconvex objective functions. Moreover, its analysis relies on a bounded gradient assumption, i.e., E∥g∥2 ≤ B, and the convergence error has a depen- dency on the gradient bound like most existed error compensation works. In general, the uniform bound on the norm of the stochastic gradient is a strong as- sumption which might not hold in some cases. For example, it is violated in the strongly 30 convex case [75, 76]. In this chapter, we design DORE, the first algorithm which utilizes gradient and model compression with error compensation without assuming bounded gradients. Unlike existing error compensation works, we provide a linear convergence rate to the O(σ) neighborhood of the optimal solution for strongly convex functions and a sublinear rate to the stationary point for nonconvex functions with linear speedup. In Table 3.1, we compare the asymptotic convergence rates of different quantized SGDs with DORE. 3.3 Double Residual Compression SGD In this section, we introduce the proposed DOuble REsidual compression SGD (DORE) algorithm. Before that, we introduce a common assumption for the compression operator. In this work, we adopt an assumption from [13, 71, 17] that the compression variance is linearly proportional to the magnitude. Assumption 3.3.1. The stochastic compression operator Q : Rd → Rd is unbiased, i.e., EQ(x) = x and satisfies E∥Q(x) − x∥2 ≤ C∥x∥2 , (3.1) for a nonnegative constant C that is independent of x. We use x̂ to denote the compressed x, i.e., x̂ ∼ Q(x). Many feasible compression operators can be applied to our algorithm since our theo- retical analyses are built on this common assumption. Some examples of feasible stochas- tic compression operators include: • No Compression: C = 0 when there is no compression. • Stochastic Quantization: A real number x ∈ [a, b], (a < b) is set to be a with probability b−x x−a b−a and b with probability b−a , where a and b are predefined quantization levels [13]. It satisfies Assumption 3.3.1 when ab > 0 and a < b. 31 Master Gr al ad es idu ien r tr nt al M esi ra die idu od du G es … el r esi al r el du od al M … Worker Worker Worker Worker Figure 3.1: An illustration of DORE. x • Stochastic Sparsification: A real number x is set to be 0 with probability 1 − p and p with probability p [71]. It satisfies Assumption 3.3.1 with C = (1/p) − 1. • p-norm Quantization: A vector x is quantized element-wisely by Qp (x) = ∥x∥p sign(x) ◦ ξ, where ◦ is the Hadamard product and ξ is a Bernoulli random vector satisfying |xi | ∥x∥1 ∥x∥p ξi ∼ Bernoulli( ∥x∥ p ). It satisfies Assumption 3.3.1 with C = maxx∈Rd ∥x∥22 − 1 [17]. To decrease the constant C for a higher accuracy, we can further decompose a vector x ∈ Rd into blocks, i.e., x = (x(1)⊤ , x(2)⊤ , · · · , x(m)⊤ )⊤ with x(l) ∈ Rdl and m P l=1 dl = d, and compress the blocks independently. 3.3.1 The Proposed DORE Many previous works [13, 12, 71] reduce the communication cost of P-SGD by quantizing the stochastic gradient before sending it to the master node, but there are several intrinsic issues. First, these algorithms will incur extra optimization error intrinsically. Let’s consider the case when the algorithm converges to the optimal point x∗ . By the first-order optimal- ity, we have (1/n) ni=1 ∇fi (x∗ ) = 0., the data distributions may be different for different P worker nodes in general, and thus we may have ∇fi (x∗ ) ̸= ∇fj (x∗ ), ∀i, j ∈ {1, . . . , n} and i ̸= j. In other words, each individual ∇fi (x∗ ) may be far away from zero. This will cause 32 large compression variance according to Assumption 3.3.1, which indicates that the upper bound of compression variance E∥Q(x) − x∥2 is linearly proportional to the magnitude of x. Second, most existing algorithms [12, 13, 71, 14, 72, 17] need to broadcast the model or gradient to all worker nodes in each iteration. It is a considerable bottleneck for efficient optimization since the amount of bits to transmit is the same as the uncompressed gradi- ent. DoubleSqueeze [18] is able to apply compression on both worker and server sides. However, its analysis depends on a strong assumption on bounded gradient. Meanwhile, no theoretical guarantees are provided for the convex problems. Algorithm 3.1: DORE1 1: Input: Stepsize α, β, γ, η, initialize h0 = h0i = 0d , x̂0i = x̂0 , ∀i ∈ {1, . . . , n}. 2: for k = 1, 2, · · · , K − 1 do 3: For each worker i ∈ {1, 2, · · · , n}: 12: For the master: 4: Sample gik such that E[gik |x̂ki ] = ∇fi (x̂ki ) 13: Receive {∆ ˆ ki } from workers 5: Gradient residual: ∆ki = gik − hki 14: ∆ˆ = 1/n n ∆ k P ˆk i i 6: Compression: ∆ ˆ k = Q(∆k ) 15: ĝk = hk + ∆ ˆ k {= 1/n Pn ĝik } i i i 7: hk+1 = h k i + α ˆ ∆ k i 16: xk+1 = proxγR (x̂k − γĝk ) i 8: { ĝik = hki + ∆ˆ ki } 17: hk+1 = hk + α∆ ˆk 9: Send ∆ ˆ k to the master 18: Model residual: qk = xk+1 − x̂k + ηek i 10: Receive q̂k from the master 19: Compression: q̂k = Q(qk ) 11: x̂k+1 i = x̂ki + β q̂k 20: ek+1 = qk − q̂k 21: x̂k+1 = x̂k + β q̂k 22: Broadcast q̂k to workers 23: end for 24: Output: x̂K or any x̂K i We proposed DORE to address all aforementioned issues. Our motivation is that the gradient should change smoothly for smooth functions so that each worker node can keep a state variable hki to track its previous gradient information. As a result, the residual be- tween new gradient and the state hki should decrease, and the compression variance of the residual can be well bounded. On the other hand, as the algorithm converges, the model would only change slightly. Therefore, we propose to compress the model resid- ual such that the compression variance can be minimized and also well bounded. We 33 also compensate the model residual compression error into next iteration to achieve a better convergence. Due to the advantages of the proposed double residual compres- sion scheme, we can derive the fastest convergence rate through analyses without the bounded gradient assumption. Below are some key steps of our algorithm as showed in Algorithm 3.1 and Figure 3.1: [lines 4-9]: each worker node sends the compressed gradient residual (∆ ˆ k ) to the master i node and updates its state hki with ∆ ˆ k; i [lines 13-15]: the master node gathers the compressed gradient residual ({∆ ˆ k )} from all i worker nodes and recovers the averaged gradient ĝk based on its state hk ; [lines 16]: the master node applies gradient descent algorithms (possibly with the proxi- mal operator); [lines 18-22]: the master node broadcasts the compressed model residual with error com- pensation (q̂k ) to all worker nodes and updates the model; [lines 10-11]: each worker node receives the compressed model residual (q̂k ) and updates its model xki . In the algorithm, the state hki serves as an exponential moving average of the local gra- dient in expectation, i.e., EQ hk+1 i = (1 − α)hki + αgik , as proved in Lemma A.4.1. Therefore, as the iteration approaches the optimum, hki will also approach the local gradient ∇fi (x∗ ) rapidly which contributes to small gradient residual and consequently small compression variance. Similar difference compression techniques are also proposed in DIANA and its variance-reduced variant [17, 77]. 1 Equations in the curly bracket are just notations for the proof but does not need to computed actually. 34 3.3.2 Discussion In this subsection, we provide more detailed discussions about DORE including model initialization, model update, the special smooth case as well as the compression rate of communication. Initialization. It is important to take the identical initialization x̂0 for all worker and master nodes. It is easy to be ensured by either setting the same random seed or broad- casting the model once at the beginning. In this way, although we don’t need to broadcast the model parameters directly, every worker node updates the model x̂k in the same way. Thus we can keep their model parameters identical. Otherwise, the model inconsistency needs to be considered. Model update. It is worth noting that although we can choose an accurate model xk+1 as the next iteration in the master node, we use x̂k+1 instead. In this way, we can en- sure that the gradient descent algorithm is applied based on the exact stochastic gradient which is evaluated on x̂ki at each worker node. This dispels the intricacy to deal with inexact gradient evaluated on xk and thus it simplifies the convergence analysis. Smooth case. In the smooth case, i.e., R = 0, Algorithm 3.1 can be simplified. The master node quantizes the recovered averaged gradient with error compensation and broadcasts it to all worker nodes. The details of this simplified case can be found in Appendix A.3. Compression rate. The compression of the gradient information can reduce at most 50% of the communication cost since it only considers compression during gradient ag- gregation while ignoring the model synchronization. However, DORE can further cut down the remaining 50% communication. Taking the blockwise p-norm quantization as an example, every element of x can be 3 represented by 2 bits using the simple ternary coding {0, ±1}, along with one magnitude for each block. For example, if we consider the uniform block size b, the number of bits to represent a d-dimension vector of 32 bit float-point numbers can be reduced from 32d 35 bits to 32 db + 32 d bits. As long as the block size b is relatively large with respect to the constant 32, the cost 32 db for storing the float-point number is relatively small such that the compression rate is close to 32d/( 32 d) ≈ 21.3 times (for example, 19.7 times when b = 256). Applying this quantization, QSGD, Terngrad, MEM-SGD, and DIANA need to trans- mit (32d+32 db + 32 d) bits per iteration and thus they are able to cut down 47% of the overall 2 × 32d bits per iteration through gradient compression when b = 256. But with DORE, we only need to transmit 2(32 db + 32 d) bits per iteration. Thus DORE can reduce over 95% of the total communication by compressing both the gradient and model transmis- sion. More efficient coding techniques such as Elias coding [78] can be applied to further reduce the number of bits per iteration. 3.4 Convergence Analysis To show the convergence of DORE, we make the following commonly used assumptions. Assumption 3.4.1. Each worker node samples an unbiased estimator of the gradient stochasti- cally with bounded variance, i.e., for i = 1, 2, · · · , n and ∀x ∈ Rd , E[gi |x] = ∇fi (x), E∥gi − ∇fi (x)∥2 ≤ σi2 , (3.2) 1 Pn where gi is the estimator of ∇fi at x. In addition, we define σ 2 = n i=1 σi2 . Assumption 3.4.2. Each fi is L-Lipschitz differentiable, i.e., for i = 1, 2, · · · , n and ∀x, y ∈ Rd , L fi (x) ≤ fi (y) + ⟨∇fi (y), x − y⟩ + ∥x − y∥2 . (3.3) 2 Assumption 3.4.3. Each fi is µ-strongly convex (µ ≥ 0), i.e., for i = 1, 2, · · · , n and ∀x, y ∈ Rd , µ fi (x) ≥ fi (y) + ⟨∇fi (y), x − y⟩ + ∥x − y∥2 . (3.4) 2 For simplicity, we use the same compression operator for all worker nodes, and the master node can apply a different compression operator. We denote the constants in As- sumption 3.3.1 as Cq and Cqm for the worker and master nodes, respectively. Then we set 36 α and β in both algorithms to satisfy q q 1 − 1 − 4Cq (C nc q +1) 1+ 1− 4Cq (Cq +1) nc ≤α≤ , 2(Cq + 1) 2(Cq + 1) 1 0<β ≤ , (3.5) Cqm +1 4Cq (Cq +1) with c ≥ n . We consider two scenarios in the following two subsections: f is strongly convex with a convex regularizer R and f is nonconvex with R = 0. 3.4.1 The Strongly Convex Case Theorem 3.4.1. Under Assumptions 3.3.1-3.4.3, if α and β in Algorithm 3.1 satisfy (3.5), η and γ satisfy  q −Cqm + (Cqm )2 + 4(1 − (Cqm + 1)β) η < min  , 2Cqm  4µL 2 , (3.6) (µ + L) (1 + cα) − 4µL η(µ + L) 2 ≤γ ≤ , (3.7) 2(1 + η)µL (1 + cα)(µ + L) then we have (1 + η)(1 + ncα) 2 2 Vk+1 ≤ ρk V1 + βγ σ , (3.8) n(1 − ρ) with Vk =β(1 − (Cqm + 1)β)E∥qk−1 ∥2 + E∥x̂k − x∗ ∥2 n (1 + η)cβγ 2 X + E∥hki − ∇fi (x∗ )∥2 , n i=1  (η 2 + η)C m 2(1 + η)βγµL  q ρ = max , 1 + ηβ − , 1 − α < 1. 1 − (Cqm + 1)β µ+L 37 Corollary 3.4.1. When there is no error compensation and we set η = 0, then ρ = max(1 − 2βγµL µ+L ,1 − α). If we further set 1 1 4Cq (Cq + 1) α= , β= , c= , (3.9) 2(Cq + 1) Cqm +1 n 2 and choose the largest stepsize γ = (µ+L)(1+2Cq /n) , the convergent factor is  (µ + L)2  1 Cq  (1 − ρ)−1 = max 2(Cq + 1), (Cqm + 1) + . (3.10) 2µL 2 n Remark 3.4.1. In particular, suppose {∆i }ni=1 are compressed using the Bernoulli p-norm quan- 1 tization with the largest block size dmax , then Cq = αw − 1, with ∥x∥22 αw = min ≤ 1. 0̸=x∈Rdmax ∥x∥1 ∥x∥p 1 Similarly, q is compressed using the Bernoulli p-norm quantization with Cqm = αm − 1. Then the linear convergent factor is n 2 1 (µ + L)2  1 2 2 o −1 (1 − ρ) = max , − + . (3.11) αw αm µL 2 n nαw n  o 2 µ+L 1 1 1 While the result of DIANA in [17] is max αw , µ 2 − n + nαw , which is larger than (3.11) with αm = 1 (no compression for the model). When there is no compression for ∆i , i.e., αw = 1, the algorithm reduces to the gradient descent, and the linear convergent factor is the same as that of the gradient descent for strongly convex functions. Remark 3.4.2. Although error compensation often improves the convergence in practice, in the- ory, no compensation, i.e., η = 0, provides the best convergence rate. This is because we don’t have much information of the error being compensated. Filling this gap will be an interesting future direction. 3.4.2 The Nonconvex Case with R = 0 Theorem 3.4.2. Under Assumptions 3.3.1-3.4.2 and the additional assumption that each worker samples the gradient from the full dataset, we set α and β according to (3.5). By choosing q 2 2 m 2 n −1 + 1 + 48L β C(Cmq +1) 1 o q γ ≤ min , , 12Lβ(Cqm + 1) 6Lβ(1 + cα)(Cqm + 1) 38 we have β K 2 − 3(1 + cα)(Cqm + 1)Lβ 2 γ X E∥∇f (x̂k )∥2 K k=1 Λ1 − ΛK+1 3(Cqm + 1)(1 + ncα)Lβ 2 σ 2 γ ≤ + , (3.12) γK n where Λk =(Cqm + 1)Lβ 2 ∥qk−1 ∥2 + f (x̂k ) − f ∗ n 2 21 X + 3c(Cqm + 1)Lβ γ E∥hki ∥2 . (3.13) n i=1 1 1 4Cq (Cq +1) Corollary 3.4.2. Let α = 2(Cq +1) ,β = Cqm +1 , and c = n , then 1 + ncα is a fixed constant. 1 √ If γ = , when K is relatively large, we have 12L(1+cα)(1+ K/n) K 1 X 1 1 E∥∇f (x̂k )∥2 ≲ +√ . (3.14) K k=1 K Kn √ Remark 3.4.3. The dominant term in (3.14) is O(1/ Kn), which implies that the sample com- plexity of each worker node is O(1/(nϵ2 )) in average to achieve an ϵ-accurate solution. It shows that, same as DoubleSqueeze in [18], DORE is able to perform linear speedup. Furthermore, this convergence result is the same as the P-SGD without compression. Note that DoubleSqueeze 2 has an extra term (1/K) 3 , and its convergence requires the bounded variance of the compression operator. 3.5 Numerical Experiments In this section, we validate the theoretical results and demonstrate the superior perfor- mance of DORE. Our experimental results demonstrate that (1) DORE achieves similar convergence speed as full-precision SGD and state-of-art quantized SGD baselines and (2) its iteration time is much smaller than most existing algorithms, supporting the supe- rior communication efficiency of DORE. To make a fair comparison, we choose the same Bernoulli ∞-norm quantization as described in Section 3.3 and the quantization block size is 256 for all experiments if not 39 Algorithm Direction Compression Linear Rate Nonconvex Rate 1 QSGD Uni. 2-norm Quantization N/A K +B 1 DIANA Uni. p-norm Quantization ✓ √ Kn + K1 DoubleSqueeze Bi. Bounded Variance N/A √1 + 1 + K1 Kn K 2/3 DORE Bi. Assumption 3.3.1 ✓ √1 + 1 Kn K Table 3.1: A comparison between related algorithms. The second column indicates whether algorithm compresses only the gradients or both the gradients and the model. DORE is able to converges linearly to the O(σ) neighborhood of optimal point like full- precision SGD and DIANA in the strongly convex case while achieving much better com- munication efficiency. DORE also admits linear speedup in the nonconvex case like Dou- bleSqueeze but DORE doesn’t require the assumptions of bounded compression error or bounded gradient. being explicitly stated because ∞-norm quantization is unbiased and commonly used. The parameters α, β, η for DORE are chosen to be 0.1, 1 and 1, respectively. The baselines we choose to compare include SGD, QSGD [13], MEM-SGD [15], DI- ANA [17], DoubleSqueeze and DoubleSqueeze (topk) [18]. SGD is the vanilla SGD with- out any compression and QSGD quantizes the gradient directly. MEM-SGD is the QSGD with error compensation. DIANA, which only compresses and transmits the gradient difference, is a special case of the proposed DORE. DoubleSqueeze quantizes both the gradient on the workers and the averaged gradient on the server with error compensa- tion. Although DoubleSqueeze is claimed to work well with both biased and unbiased compression, in our experiment it converges much slower and suffers the loss of accuracy with unbiased compression. Thus, we also compare with DoubleSqueeze using the Top-k compression as presented in [18]. 3.5.1 Strongly convex To verify the convergence for strongly convex and smooth objective functions, we conduct the experiment on a linear regression problem: f (x) = ∥Ax − b∥2 + λ∥x∥2 . The data matrix A ∈ R1200×500 and optimal solution x∗ ∈ R500 are randomly synthesized. Then we generate the prediction b by sampling from a Gaussian distribution whose mean is 40 2.5 SGD QSGD 2.0 DORE 1.5 Seconds 1.0 0.5 0.0 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 0.0040 0.0045 0.0050 Bandwidth (1s/Mbit) Figure 3.2: Per iteration time cost on Resnet18 for SGD, QSGD, and DORE. It is tested in a shared cluster environment connected by Gigabit Ethernet interface. DORE speeds up the training process significantly by mitigating the communication bottleneck. Ax∗ . The rows of the data matrix A are allocated evenly to 20 worker nodes. To better verify the linear convergence to the O(σ) neighborhood around the optimal solution, we take the full gradient in each node for all algorithms to exclude the effect of the gradient variance (σ = 0). As showed in Figure 3.3, with full gradient and a constant learning rate, DORE con- verges linearly, same as SGD and DIANA, but QSGD, MEM-SGD, DoubleSqueeze, as well as DoubleSqueeze (topk) converge to a neighborhood of the optimal point. This is because these algorithms assume the bounded gradient and their convergence errors de- pend on that bound. Although they converge to the optimal solution using a diminishing step size, their converge rates will be much slower. In addition, we also validate that the norms of the gradient and model residual de- crease exponentially, and it explains the linear convergence behavior of DORE. For more details, please refer to Appendix A.1. 3.5.2 Nonconvex To verify the convergence in the nonconvex case, we test the proposed DORE with two classical deep neural networks on two representative datasets, i.e., LeNet [79] on MNIST and Resnet18 [80] on CIFAR10. In the experiment, we use 1 parameter server and 10 41 workers, each of which is equipped with an NVIDIA Tesla K80 GPU. The batch size for each worker node is 256. We use 0.1 and 0.01 as the initial learning rates for LeNet and Resnet18, and decrease them by a factor of 0.1 after every 25 and 100 epochs, respectively. All parameter settings are the same for all algorithms. 1 10 10 1 2 10 10 1 5 10 10 3 ||xk x * ||2 ||xk x * ||2 8 10 10 5 SGD SGD 11 QSGD QSGD 10 MEM-SGD 10 7 MEM-SGD DIANA DIANA 14 DoubleSqueeze 9 DoubleSqueeze 10 10 DoubleSqueeze (topk) DoubleSqueeze (topk) DORE DORE 0 2000 4000 6000 8000 10000 0 2500 5000 7500 10000 12500 15000 17500 20000 Iteration Iteration (a) Learning rate=0.05 (b) Learning rate=0.025 Figure 3.3: Linear regression on synthetic data. When the learning rate is 0.05, Doub- leSqueeze diverges. In both cases, DORE, SGD, and DIANA converge linearly to the optimal point, while QSGD, MEM-SGD, DoubleSqueeze, and DoubleSqueeze (topk) only converge to the neighborhood even when full gradient is available. SGD SGD 2.0 QSGD 2.0 QSGD MEM-SGD MEM-SGD DIANA DIANA 1.5 DoubleSqueeze 1.5 DoubleSqueeze Training Loss Test Loss DoubleSqueeze (topk) DoubleSqueeze (topk) DORE DORE 1.0 1.0 0.5 0.5 0.0 0.0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test Loss Figure 3.4: LeNet trained on MNIST. DORE converges similarly as most baselines. It outperforms DoubleSqueeze using the same compression method while has similar per- formance as DoubleSqueeze (topk). Figures 3.4 and 3.5 show the training loss and test loss for each epoch during the train- ing of LeNet on the MNIST dataset and Resnet18 on CIFAR10 dataset. The results indicate that in the nonconvex case, even with both compressed gradient and model information, DORE can still achieve similar convergence speed as full-precision SGD and other quan- tized SGD variants. DORE achieves much better convergence speed than DoubleSqueeze 42 4.5 4.5 SGD SGD QSGD 4.0 QSGD 4.0 MEM-SGD MEM-SGD DIANA 3.5 DIANA 3.5 DoubleSqueeze DoubleSqueeze Training Loss 3.0 Test Loss 3.0 DoubleSqueeze (topk) DoubleSqueeze (topk) DORE 2.5 DORE 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0 50 100 150 200 250 0 50 100 150 200 250 Epoch Epoch (a) Training Loss (b) Test Loss Figure 3.5: Resnet18 trained on CIFAR10. DORE achieves similar convergence and accu- racy as most baselines. DoubeSuqeeze converges slower and suffers from the higher loss but it works well with topk compression. using the same compression method and converges similarly with DoubleSqueeze with Topk compression as presented in [18]. We also validate via parameter sensitivity in Ap- pendix A.2 that DORE performs consistently well under different parameter settings such as compression block size, α, β and η. 3.5.3 Communication Efficiency In terms of communication cost, DORE enjoys the benefit of extremely efficient commu- nication. As one example, under the same setting as the Resnet18 experiment described in the previous section, we test the time cost per iteration for SGD, QSGD, and DORE un- der varied network bandwidth. We didn’t test MEM-SGD, DIANA, and DoubleSqueeze because MEM-SGD, DIANA have similar time cost as QSGD while DoubleSqueeze has similar time cost as DORE. The result showed in Figure 3.2 indicates that as the band- width becomes worse, with both gradient and model compression, the advantage of DORE becomes more remarkable compared to the baselines that don’t apply compres- sion for model synchronization . 43 3.6 Conclusion Communication cost is the severe bottleneck for distributed training of modern large- scale machine learning models. Extensive works have compressed the gradient informa- tion to be transferred during the training process, but model compression is rather limited due to its intrinsic difficulty. In this chapter, we proposed the Double Residual Compres- sion SGD named DORE to compress both gradient and model communication that can mitigate this bottleneck prominently. The theoretical analyses suggest good convergence rate of DORE under weak assumptions. Furthermore, DORE is able to reduce 95% of the communication cost while maintaining similar convergence rate and model accuracy compared with the full-precision SGD. 44 CHAPTER 4 PROX-LEAD: A LINEAR CONVERGENT CONVERGENT ALGORITHM WITH EFFICIENT COMMUNICATION FOR COMPOSITE PROBLEMS 4.1 Introduction In recent years, the communication cost has become the bottleneck in the distributed training of machine learning models, given that the computation becomes much faster with powerful computing devices such as GPUs and TPUs. Therefore, the communi- cation efficiency gains increasing attention in the algorithm design. On the one hand, decentralized communication has been shown to be an effective and important direc- tion for improving communication efficiency [28]. On the other hand, various com- munication compression techniques, such as quantization and sparsification, which are originally developed for centralized settings [12, 13, 14, 15, 16, 17, 19, 18], have been shown to be significant in reducing the communication cost for decentralized optimiza- tion [20, 21, 24, 25, 33, 81]. These works have exhibited the great potential of decentralized optimization with communication compression in speeding up decentralized machine learning. The composite problem (1.2) abstracts many important applications such as regular- ized empirical risk minimization in statistics and machine learning, and optimal con- trol of multi-agent systems. More specifically, we consider two different settings on the smooth components of the objective function, i.e., the general stochastic setting and the finite-sum setting. In the general stochastic setting, the problem follows (1.2) where the local functions {fi } are defined as the expectation over the general local sample distribu- tions {Di }. In the finite-sum setting, we consider the discrete distribution on local nodes, and the local functions {fi } are defined as the unweighted average over local samples, fi (x) = m1 m P j=1 fij (x), where each fij stands for the loss function defined on the jth batch 45 of samples at node i. In this work, we assume that the number of batches m is the same for all nodes for simplicity. Note that it can be easily generalized to the case when the nodes have different numbers of batches. Contribution. The main contribution of this chapter is a Proximal gradient LinEAr convergent Decentralized algorithm with compression, Prox-LEAD, which solves the problem (1.2). Specifically, the contributions can be summarized as follows: • We propose the first decentralized stochastic proximal gradient algorithm with com- pressed communication, Prox-LEAD. It converges linearly up to the neighborhood of the optimal solution in the general stochastic setting. In the finite-sum setting, we establish the linear convergence to the exact solution for Prox-LEAD’s two variance reduction variants, i.e., Loopless SVRG and SAGA. • We provide a rigorous theory of Prox-LEAD on the compressed communication and convergence complexities in different settings on the smooth component of the problem. Without the restriction on data heterogeneity and gradient bound- edness, Prox-LEAD maintains a comparable convergence rate compared to the un- compressed counterpart. Our theorems indicate that Prox-LEAD works with arbi- trary compression precision. Moreover, with reasonably aggressive compression, Prox-LEAD significantly reduces the communication cost almost for free. • Our algorithmic framework builds bridges between many known algorithms. With- out involving compression, it provides stochastic and variance reduction variants of some deterministic algorithms, and it has a better convergence complexity against the existing non-accelerated stochastic decentralized algorithms. When the nons- mooth regularizer is absent, it reduces to LEAD and achieves a better convergence complexity. The framework also enlightens other primal-dual algorithms to apply 46 compressed communication and reduce the impact of inexact primal and dual iter- ations. • We present comprehensive experiments to verify our theorems and the effectiveness of the proposed algorithm in different settings. The comparison with state-of-art al- gorithms demonstrates the superiority of Prox-LEAD and its stochastic variants. Moreover, it is robust to parameter tuning, which exhibits great advantages in prac- tice. The rest of this chapter is organized as follows. Section 4.1.1 and Section 4.1.2 sum- marize related works and introduce the notations used in this chapter, respectively. In Section 4.2, the proposed algorithm Prox-LEAD, motivation and derivation, as well as convergence complexities are presented. The assumptions on regularity are introduced in Section 4.3, and the convergence analyses and major theorems are illustrated in Sec- tion 4.4. Importantly, we also detail the connection with existing algorithms in Section B.2. Finally, numerical experiments are presented in Section 4.5. 4.1.1 Related Work Many algorithms were proposed to solve the decentralized optimization of the average of functions defined over agents in networks. The early decentralized algorithms can be traced back to the work by [82]. DGD [26] is the most classical decentralized algo- rithm. It is intuitive and simple but converges slowly due to the diminishing stepsize that is needed to obtain the optimal solution [27]. Its stochastic version D-PSGD [28] has been shown effective for training nonconvex deep learning models. Algorithms based on primal-dual formulations or gradient tracking are proposed to eliminate the convergence bias in DGD-type algorithms and improve the convergence rate, such as D-ADMM [83], DLM [29], EXTRA [1], NIDS [2], D2 [84], Exact Diffusion [53], NEXT [50], DIGing [45], Harnessing [46], SONATA [85], GSGT [86], OPTRA [87], etc. There are also dual-based 47 methods which apply gradient methods on the dual formulation [55, 88, 56]. These algo- rithms are able to achieve optimal bounds but requires computing the non-trivial gradient of the dual function. To improve the communication efficiency, communication compression is first applied to decentralized settings by [20]. It proposes two algorithms, i.e., DCD-SGD and ECD- SGD, which require compression of high accuracy and are not stable with aggressive com- pression. [21, 22] introduce QDGD and QuanTimed-DSGD to achieve exact convergence with small stepsize and the convergence is slow. DeepSqueeze [23] compensates the com- pression error to the compression in the next iteration. Motivated by the quantized av- erage consensus algorithms, such as the work in [89], the quantized gossip algorithm Choco-Gossip [24] converges linearly to the consensual solution. Combining Choco- Gossip and D-PSGD leads to a decentralized algorithm with compression, Choco-SGD, which converges sublinearly under the strong convexity and gradient boundedness as- sumptions. Its nonconvex variant is further analyzed in [25]. A new compression scheme using the modulo operation is introduced in [90] for decentralized optimization. A gen- eral algorithmic framework aiming to maintain the linear convergence of distributed opti- mization under compressed communication is considered in [91]. It requires a contractive property that is not satisfied by many decentralized algorithms including the algorithms proposed in this chapter. A preliminary version of this work [33] proposes the first linear convergent decentralized algorithm with compression, LEAD. However, the composite problem is not considered in LEAD, and LEAD is deficient in dealing with stochastic gra- dients. [81] introduces a linear convergent algorithm for decentralized optimization with communication compression based on a primal-dual decentralized algorithm [92]. Com- munication compression is also proposed for gradient tracking algorithms [93, 94, 95], but they require double communication cost since two vectors need to be transmitted in each communication run. Variance reduction techniques such as SAG [96], SVRG [97], SAGA [98] and Loop- 48 less SVRG [99] have been introduced to accelerate stochastic optimization problems with finite-sum structure. SVRG-type gradient estimator requires more gradient evaluation but they are memory friendly. SAGA reduces the number of gradient evaluation in each iteration but it requires more memory space. Variance reduction has been applied to de- centralized optimization [100, 101, 102, 103, 104, 81]. Decentralized algorithms such as PG-EXTRA [57] and NIDS [2] are proposed for com- posite optimization, and the sublinear rate is proved when the smooth component is strongly convex and smooth. [58] introduces a proximal gradient algorithm (P2D2), [105] enhances the convergece rate of SONATA, and the linear convergence rate is proved in both literature when the nonsmooth component is shared across all nodes. A proximal unified decentralized algorithm (PUDA) [106] and another proximal decentralized algo- rithmic framework [107] unify many existing algorithms and establish linear convergence for composite optimization when the nonsmooth component is shared. A decentralized accelerated proximal gradient descent algorithm in proposed in [108]. However, commu- nication compression and stochastic optimization are not considered in these algorithms. 4.1.2 Notation We clarify commonly used notation in this section. We use bold lower-case letters to denote column vectors in Rp and bold upper-case letters for matrices in Rn×p . The lower- case letter with a subscript will be the corresponding row of a matrix denoted by the same letter in the upper-case. For example, in the algorithm, we use xi ∈ Rp for the local copy of the model parameters at node i and X = [x1 , x2 , · · · , xn ]⊤ is the matrix, whose rows are these local copies. Throughout the chapter, without any specification, we use ⟨·, ·⟩ as standard vector/matrix inner product, whose form is dependent on context. Similarly, p ∥ · ∥P is a vector/matrix norm defined as ⟨·, P(·)⟩ for a positive semi-definite matrix P ∈ Rn×n , with some specific domain. We use M† to denote the pseudo-inverse of a matrix M ∈ Rn×n . If M is symmetric, we use λmax (M), λi (M) and λmin (M) to denote the 49 largest, the ith-largest and the smallest nonzero eigenvalues of M, respectively. For the stochastic approximate of the local gradient, we use ∇fi (xi , ξi ) ∈ Rp as the estimate of the deterministic gradient ∇fi (xi ) ∈ Rp of xi at node i. With the collection of all local estimates {∇fi (xi , ξi )}ni=1 for {∇fi (xi )}ni=1 , we define ∇F(X) ∈ Rn×p and ∇F(X, ξ) ∈ Rn×p as the compact matrix form of them. The commonly used all-zero vector(matrix) and all-one vector are 0 ∈ Rp (Rn×p ) and 1 ∈ Rn respectively. The proximal operator with 1 parameter η > 0 of a function r : Rp → R is proxηr = arg minz∈Rp r(z) + 2η ∥z − x∥2 . Finally, we use [n] to replace {1, · · · , n} for abbreviation. 4.2 The Proposed Algorithms An equivalent form of the problem (1.2) reformulating the decentralized constraint via a mixing matrix is provided as follows: Xn Xn X∗ = arg min fi (xi ) + r(xi ), (4.1) (I−W)X=0 i=1 i=1 X∈Rn×p | {z } | {z } =:F(X) =:R(X) where X = [x1 , · · · , xn ]⊤ is the collection of local xi s and W is a symmetric matrix which restricts the feasible region of the above problem to the subspace {1x⊤ ∈ Rn×p | ∀x ∈ Rp }. The optimal solution X∗ = 1(x∗ )⊤ is consensual and provides an optimal solution x∗ ∈ Rp to the problem (1.2). The detailed assumptions on W are shown below. Assumption 4.2.1 (Mixing matrix). The graph G = {V, E} is undirected and connected with V = [n] := {1, 2, · · · , n}. The mixing matrix W = [wij ] ∈ Rn×n is symmetric and satisfies 1. wij = 0 if i ̸= j, (i, j) ̸∈ E and wij > 0 if (i, j) ∈ E. 2. −1 < λn (W) ≤ · · · ≤ λ2 (W) < λ1 (W) = 1 and W1 = 1. We propose a Proximal gradient LinEAr convergent Decentralized algorithm with compressed communication, Prox-LEAD, to solve the problem (4.1). Our algorithm ex- tends LEAD proposed in [33] to deal with the nonsmooth component via the proximal 50 operator, and to accelerate the stochastic optimization in the finite-sum setting. Moreover, when the compression is absent and the gradient oracle returns full gradient, Prox-LEAD is shown to be covered by the proximal unified decentralized framework in [106]. Algo- rithm 4.1 illustrates the prototype of Prox-LEAD in the compact form, and it reduces to LEAD by setting R = 0. In Algorithm 4.1, line 6 uses the gradient from the stochastic gradient oracle to create an auxiliary variable Z as the information to be communicated. The COMM procedure compresses the difference between Z and a state variable H by a unbiased operator satis- fying the following condition. Assumption 4.2.2 (Unbiased compression operator). The stochastic operator Q : Rp → Rp is unbiased, i.e., EQ(x) = x, and there exists C ≥ 0 such that E∥x − Q(x)∥2 ≤ C∥x∥2 , ∀x ∈ Rp . In particular, when C = 0, we treat Q as the identity operator. The benefit of this difference compression is to reduce the compression error asympo- totically [17, 33]. Since the following variance of the stochastic estimation is dependent on the distance between Zk+1 and Hk , E∥Ẑk+1 − Zk+1 ∥2 = E∥Q(Zk+1 − Hk ) − (Zk+1 − Hk )∥2 , | {z } =O(∥Zk+1 −Hk ∥) the variance of the compression will vanish as Z and H converge to the same point. The updates in line 8 and line 9 essentially compensate the compression error locally. Note that the COMM procedure is first proposed in the preliminary version of this work [33], and please to refer to Section 3.1 in [33] for a detailed explanation for the reduced com- pression error and implicit error compensation. Lines 8 and 9 of Algorithm 4.1 proceed in parallel after the compressed communi- cation is exchanged. The proximal mapping in Line 10 is applied to the rows of Vk+1 51 separately, which is defined as      (xk+1 1 ) ⊤   proxηr (v1k+1 )⊤   ..   = ..  .   .   .      ⊤ (xk+1 n ) proxηr (vnk+1 )⊤ Algorithm 4.1: Prox-LEAD Input: Stepsize η, parameter α, γ, initial X0 , H1 , D1 = 0 Output: XK or 1/n ni=1 XK P i 1: H1w = WH1 2: Z1 = X0 − η∇F(X0 , ξ 0 ) 3: X1 = proxηR (Z1 ) 4: for k = 1, 2, · · · , K − 1 do 5: Gk = SGO(Xk ) 6: Zk+1 = Xk − ηGk − ηDk 7: Ẑk+1 , Ẑk+1 w ,H k+1 , Hk+1 w = COMM(Zk+1 , Hk , Hkw ) γ 8: D k+1 k = D + 2η (Ẑ k+1 − Ẑk+1 W ) 9: Vk+1 = Zk+1 − γ2 (Ẑk+1− Ẑk+1 W ) 10: Xk+1 = proxηR Vk+1 11: end for Algorithm 4.2: Compressed Communication Procedure (COMM) 1: procedure COMM(Zk+1 , Hk , Hkw ) 2: Qk = Q(Zk+1 − Hk ) ▷ Compression k+1 k k 3: Ẑ =H +Q 4: Ẑw = Hkw + WQk k+1 ▷ Communication 5: Hk+1 = (1 − α)Hk + αẐ 6: Hk+1 w = (1 − α)Hkw + αẐw 7: Return: Ẑk+1 , Ẑk+1 w ,H k+1 , Hk+1 w 8: end procedure 4.2.1 Prox-LEAD as Inexact PUDA The problem (4.1) can be reformulated into the following three operators splitting 1 minimize F(X) + ι0 (B 2 X) + R(X), (4.2) X∈R n×p 52 The finite-sum setting The general setting Loopless SVRG SAGA Sample ξi ∼ Di Sample l ∈ [m] ∼ Pi randomly Sample ω ∈ {0, 1} ∼ Bernoulli (p), 1 1 gi = (∇fil (xi ) − ∇fil (x̃il )) gi = (∇fil (xi ) − ∇fil (x̃i )) mpil gi = ∇fi (xi , ξi ). mpil m 1 X + ∇fi (x̃i ), + ∇fij (x̃ij ), m j=1 x̃i = ω · xi + (1 − ω) · x̃i . x̃il = xi . Table 4.1: Stochastic gradient oracle (SGO). where ι0 is the indicator function taking zero value at 0 and infinity elsewhere. Many existing schemes such as Condat-Vu in [109, 110], PDFP in [111] and PD3O in [112] are applicable to this problem but we choose to adapt the inexact PDHG with a single proximal gradient step and derive the following iteration which is not the realiza- tion of any scheme mentioned above.   k+1 X  = Xk − η∇F(Xk ) − ηDk , λ   k+1 D = Dk + (I − W)Xk+1 ,   2  k+1 V = Xk − η∇F(Xk ) − ηDk+1 (4.3)    ηλ  = I− (I − W) Xk+1 ,  2    Xk+1 = proxηR (Vk+1 ). Compared to LEAD, the third step is reformulated to fully depend on X but the num- ber of communication is unchanged, and a proximal map is applied on V directly in the final step. The iteration (4.3) can be shown as a special case of PUDA in [106] which has global linear convergence for strongly convex F. The benefit of this iteration is to maintain the consensus of X in optimality, which further implies the consensus of V and X. As shown in Section 4.4, the consensus of X is the key to the linear convergence of Prox-LEAD. It also explains why we need the same nonsmooth function r for all nodes. 53 If we compress the only communication step involving X via COMM procedure, (D, V) is updated by the inexact information with the controllable compression error. Therefore, we regard Prox-LEAD as inexact PUDA in terms of the inexact dual and prox- imal steps. 4.2.2 Stochastic Gradient Oracle As shown in Algorithm 4.1, Prox-LEAD uses a stochastic gradient oracle (SGO) to esti- mate the gradient, and different stochastic estimators are listed in Table 4.1 for gradient estimation. The stochastic gradient oracle returns three types of estimation. In the general setting, each node uses sample distribution Di to provide an unbiased stochastic gradient and the variance exists. In the finite-sum setting, we assume each node will construct a discrete distribution Pi = {pil : l ∈ [m]} for m mini-batches and the stochastic gradi- ent will be corrected by two different variance reduction schemes: Loopless SVRG and SAGA. For Loopless SVRG (LSVRG), each node will have a reference point x̃i where the full gradient is evaluated after a random period. A random variable l ∈ [m] will be sam- pled first with distribution Pi , then a Bernoulli random variable ω will be sampled to determine whether the reference point will be update or not. If ω = 1, the reference is replaced by the latest xi , otherwise unchanged. For SAGA, each node will have m ref- erence points, {x̃ij : j ∈ [m]}. After the index l is sampled, x̃il will be replaced by the latest xi while remaining reference points are unchanged. Both schemes use reference points to correct the gradient by variance reduction. Loopless SVRG is memory-friendly but requires more gradient evaluations. Empirically, SAGA converges faster in terms of the number of gradient evaluations, but it requires more memory space. The following Table 4.2 summarizes the convergence complexity of Prox-LEAD in different settings. 54 Algorithm Convergence complexity Prox-LEAD O(C e + κf + κg + Cκf κg ) Theorem 4.4.1 Prox-LEAD LSVRG e + κf + κg + Cκf κg + p−1 ) O(C Theorem 4.4.2 Prox-LEAD SAGA O(C e + κf + κg + Cκf κg + m) Theorem 4.4.3 Table 4.2: Summary of the convergence compleixty for Prox-LEAD to achieve ϵ-accuracy with fiexed stepsizes. The first row is the complexity with the full gradient. We define condition numbers, κf and κg , as follows L λmax (I − W) κf = , κg = , µ λmin (I − W) where L and µ are regularity constants of the objective function assumed in Assump- tion 4.3.2. When the compression procedure and the regularizer are removed, the com- plexities of LEAD LSVRG and LEAD SAGA are reduced to O(κ e f + κg + p−1 ) and O(κ e f+ κg + m) respectively, which are better than LessBit-Option D in [81] in terms of the condi- tional numbers, and it improves over the stochastic LEAD in [33]. 4.3 Assumptions on Regularity In the general stochastic setting, each fi (x) in (1.2) is the expectation of the local loss func- tion under the sample distribution Di , and we make the following inter-node assumption. Assumption 4.3.1 (Locally bounded gradient variance). In the general stochastic setting, each local stochastic gradient ∇fi (x, ξi ) is an unbiased estimate, i.e., Eξi ∇fi (x, ξi ) = ∇fi (x), and satisfies E∥∇fi (x∗ , ξi ) − ∇fi (x∗ )∥2 ≤ σi2 , where x∗ is the optimal solution to the problem (1.2). This locally bounded variance at the optimal point is strictly weaker than the uni- formly bounded variance assumption [33]. 55 Given a smooth convex function f , we define the Bregman distance with respect to f as Vf (x, y) = f (x) − f (y) − ⟨∇f (y), x − y⟩, ∀x, y ∈ Rp . With the above definition, we impose different assumptions to the regularity of smooth function component in two settings. Assumption 4.3.2 (Strong convexity and smoothness). Each fi is a smooth, µ-strongly con- vex function, i.e., ∀x, y ∈ Rp , µ fi (x) − fi (y) − ⟨∇fi (y), x − y⟩ ≥ ∥x − y∥2 . (4.4) | {z } 2 =Vfi (x,y) In the general stochastic setting, each fi is L-smooth in expectation, i.e., ∀x, y ∈ Rp , E∥∇fi (x, ξi ) − ∇fi (y, ξi )∥2 ≤ 2L[fi (x) − fi (y) − ⟨∇fi (y), x − y⟩] . (4.5) | {z } =2LVfi (x,y) In the finite-sum setting, the L-smoothness is imposed on each fij instead, i.e., ∀x, y ∈ Rp , ∥∇fij (x) − ∇fij (y)∥2 ≤ 2L[fij (x) − fij (y) − ⟨∇fij (y), x − y⟩] . (4.6) | {z } =2LVfij (x,y) Note the inequalities in (4.4) and (4.5) are equivalent to, ∀X, Y ∈ Rn×p , µ VF (X, Y) = F(X) − F(Y)− ⟨∇F(Y), X − Y⟩ ≥ ∥X − Y∥2 , 2 Xn 2 E∥∇F(X, ξ) − ∇F(Y, ξ)∥ ≤ 2L Vfi (xi , yi ) = 2LVF (X, Y). i=1 Remark 4.3.1. In [76], the above two types of L-smoothness assumptions are discussed, and the latter is used to derive an improved convergence rate of SGD. The expected L-smoothness is shown to be significantly weaker than the most commonly-used L-smoothness of fi (x, ξi ) for all ξi . 4.4 Convergence Analysis In this section, we present the convergence of Prox-LEAD in both stochastic scenarios. We first show two fundamental lemmas regarding the conditional expectation on the 56 compression operator, then we present the two scenarios in Sections 4.4.1 and 4.4.2, re- spectively. More specifically, in Section 4.4.1, the linear convergence to the neighborhood will be shown under the general stochastic setting, while in Section 4.4.2, two variance reduction schemes are used to exploit the exact linear convergence in the problems with finite-sum structure. The stochastic actions such as compression and gradient estimation generate two se- quences of σ-algebra where the stochastic variables in this procedure are adapted. We use F k to denote the σ-algebra of gradient estimation at kth step and Hk is the σ-algebra of stochastic compression at the same step. {F k } and {Hk } satisfy F 1 ⊂ H1 ⊂ F 2 ⊂ H2 ⊂ · · · ⊂ F k ⊂ Hk ⊂ · · · . With these notations, we can clarify the stochastic dependencies among the variables gen- erated by the algorithm. For example, tuple (Gk , Zk+1 ) is measurable in F k and tuple (Ẑk+1 , Hk+1 , Dk+1 , Vk+1 , Xk+1 ) is measurable in Hk . Throughout the section, we use E to denote the conditional expectations EF k and EHk given the context for simplicity. Then we define some auxiliary constants related to the optimal solution X∗ to the problem (4.1). X∗ is consensual, i.e., each row x∗ solves the problem (1.2). We let z∗ = x∗ − (η/n) ni=1 ∇fi (x∗ ), which is in the pre-image of proxηr at P x∗ . In addition, we let η ⊤ Z∗ = 1(z∗ )⊤ = X∗ − 11 ∇F(X∗ ), (4.7) n 1 D∗ = (X∗ − Z∗ ) − ∇F(X∗ ) η  1  = − I − 11⊤ ∇F(X∗ ). (4.8) n To show the convergence of the proposed algorithm, we characterizes the decrease of ∥Zk − Z∗ ∥, ∥Dk − D∗ ∥(I−W)† , ∥Xk − X∗ ∥, and ∥Hk − Z∗ ∥. The convergence of Hk to Z∗ shows the decrease of the compression error to zero. The following lemma shows the connection of those values for the k and k + 1-th iteration. 57 Lemma 4.4.1 (One-step progress). Let {(Zk , Dk , Xk )} be the sequence generated from Algo- rithm 4.1. Under Assumption 4.2.1, taking the expectation conditioned on the stochastic compres- 2 sion operator at the k-th iteration, we have for any γ ≤ λmax (I−W) , ∥Zk+1 − Z∗ ∥2 = ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 + η 2 ∥Dk − D∗ ∥2 − 2η⟨Dk − D∗ , Xk − X∗ − ηGk + η∇F(X∗ ⟩, (4.9) E∥Dk+1 − D∗ ∥2(I−W)† γ2 γ2 = ∥Dk − D∗ ∥2(I−W)† − γ∥Dk − D∗ ∥2 + ∥Z k+1 − Z ∗ 2 ∥ I−W + E∥Ẑk+1 − Zk+1 ∥2I−W 4η 2 4η 2 γ + ⟨Dk − D∗ , Xk − X∗ − ηGk + η∇F(X∗ )⟩, (4.10) η and γ2 E∥Xk+1 − X∗ ∥2 ≤ ∥Zk+1 − Z∗ ∥2I− γ (I−W) + E∥Ẑk+1 − Zk+1 ∥2(I−W)2 . (4.11) 2 4 Proof. (i) The equality (4.9) is shown from Line 6 of Algorithm 4.1 and (4.8) directly. (ii) Note that (I − W)Z∗ = 0. Then from Line 8 of Algorithm 4.1, we have γ Dk+1 − D∗ = Dk − D∗ + (I − W)(Ẑk+1 − Z∗ ), 2η which gives E∥Dk+1 − D∗ ∥2(I−W)† γ γ2 = ∥Dk − D∗ ∥2(I−W)† + ⟨Dk − D∗ , Zk+1 − Z∗ ⟩ + 2 ∥Zk+1 − Z∗ ∥2I−W η 4η 2 γ + 2 E∥Ẑk+1 − Zk+1 ∥2I−W 4η γ = ∥Dk − D∗ ∥2(I−W)† − γ∥Dk − D∗ ∥2 + ⟨Dk − D∗ , Xk − X∗ − ηGk + η∇F(X∗ )⟩ η 2 2 γ γ + 2 ∥Zk+1 − Z∗ ∥2I−W + 2 E∥Ẑk+1 − Zk+1 ∥2I−W , 4η 4η where the second equality comes from Line 6 of Algorithm 4.1. (iii) From the definition of Z∗ , we have X∗ = proxηR (Z∗ ). 58 Therefore, E∥Xk+1 − X∗ ∥2 = ∥proxηR (Vk+1 ) − proxηR (Z∗ )∥2 ≤ E∥Vk+1 − Z∗ ∥2 γ  2 = E Zk+1 − Z∗ − (I − W)(Ẑk+1 − Z∗ ) 2  γ  2 γ2 = I − (I − W) (Zk+1 − Z∗ ) + E∥(I − W)Ẑk+1 − Zk+1 ∥2 2 4 2 γ ≤ ∥Zk+1 − Z∗ ∥2I− γ (I−W) + E∥(I − W)Ẑk+1 − Zk+1 ∥2 , 2 4 where the first inequality comes from the non-expansiveness of proxηR and the last in- equality follows from  γ 2 γ2 I − (I − W) = I − γ(I − W) + (I − W)2 2 4 γ γ 1  γ  1 = I − (I − W) − (I − W) 2 I − (I − W) (I − W) 2 2 2 2 γ ≼ I − (I − W), 2 since γ2 (I − W) ≼ I. The inequality (4.11) is obtained. The next lemma builds the critical inequality for Algorithm 4.1 and serves as the key step in the proofs. We define the following two positive (semi)definite matrices γ P := I − (I − W), Q := (I − W)† 2 to measure the convergence. Lemma 4.4.2 (The key inequality). Under Assumptions 4.2.1 and 4.2.2, we choose α ∈ (0, (1 + C)−1 ) such that ∆(α) := α − (1 + C)α2 > 0 and  2  1 2∆(α)ηµ  η ∈ 0, , γ ∈ 0, . µ λmax (I − W) ∆(α)ηµ + 4αC 59 Then the sequence {(Ẑk , Zk , Dk , Hk )} generated by Algorithm 4.1 satisfies  ηµ  k+1 2η 2 1− ∥Z − Z∗ ∥2P + E∥Dk+1 − D∗ ∥2Q 2 γ + CM E∥Hk+1 − Z∗ ∥2 + E∥Ẑk+1 − Zk+1 ∥2I−P 2η 2 ≤ ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 + ∥Dk − D∗ ∥2Q− γ I γ 2 + (1 − α)CM ∥Hk − Z∗ ∥2 , (4.12) where the expectation is conditioned on the stochastic compression at k-th step and (1 − γ2 λmax (I − W))ηµ M := > 0. 2αC In particular, when C = 0, i.e., Ẑk = Zk , (4.12) still holds as  ηµ  k+1 2η 2 1− ∥Z − Z∗ ∥2P + E∥Dk+1 − D∗ ∥2Q + CM E∥Hk+1 − Z∗ ∥2 2 γ 2η 2 ≤ ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 + ∥Dk − D∗ ∥2Q− γ I + (1 − α)CM ∥Hk − Z∗ ∥2 (4.13) γ 2 with (1 − γ2 λmax (I − W))ηµ CM := > 0. 2α 2 Proof. Letting (4.9) + (4.10)× 2ηγ , we get 2η 2 ∥Zk+1 − Z∗ ∥2P + E∥Dk+1 − D∗ ∥2Q γ 2η 2 = ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 + ∥Dk − D∗ ∥2Q − η 2 ∥Dk − D∗ ∥2 γ γ + E∥Ẑk+1 − Zk+1 ∥2I−W 2 2η 2 = ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 + ∥Dk − D∗ ∥2Q− γ I + E∥Ẑk+1 − Zk+1 ∥2I−P , (4.14) γ 2 60 From the update of H in Line 5 of the compression procedure ‘COMM’, we obtain E∥Hk+1 − Z∗ ∥2 = ∥(1 − α)(Hk − Z∗ ) + α(Zk+1 − Z∗ )∥2 + α2 E∥Qk − EQk ∥2 = (1 − α)∥Hk − Z∗ ∥2 + α∥Zk+1 − Z∗ ∥2 − α(1 − α)∥Zk+1 − Hk ∥2 + α2 E∥Ẑk+1 − Zk+1 ∥2 ≤ (1 − α)∥Hk − Z∗ ∥2 + α∥Zk+1 − Z∗ ∥2 − α(1 − α)∥Zk+1 − Hk ∥2 + α2 CE∥Zk+1 − Hk ∥2 = (1 − α)∥Hk − Z∗ ∥2 + α∥Zk+1 − Z∗ ∥2 − ∆(α)∥Zk+1 − Hk ∥2 where the second equality uses ∥(1 − α)x + αy∥2 = (1 − α)∥x∥2 + α∥y∥2 − α(1 − α)∥x − y∥2 , and the inequality is from Assumption 4.2.2. Multiplying the H-inequality by C and adding it to the following inequality ∆(α)E∥Ẑk+1 − Zk+1 ∥2 ≤ ∆(α)C∥Zk+1 − Hk ∥2 , we have ∆(α)E∥Ẑk+1 − Zk+1 ∥2 + CE∥Hk+1 − Z∗ ∥2 ≤ (1 − α)C∥Hk − Z∗ ∥2 + αC∥Zk+1 − Z∗ ∥2 αC ≤ (1 − α)C∥Hk − Z∗ ∥2 + ∥Zk+1 − Z∗ ∥2P , (4.15) λmin (P) if γ < 2 λmax (I−W) since P ≻ 0 and λmax (P−1 ) = λ−1 min (P). (1− γ λ (I−W))ηµ When C > 0, let M = λmin2αC(P)ηµ = 2 max 2αC > 0, then (4.14) + M × (4.15) gives  ηµ  k+1 2η 2 1− ∥Z − Z∗ ∥2P + E∥Dk+1 − D∗ ∥2Q 2 γ + CM E∥Hk+1 − Z∗ ∥2 + E∥Ẑk+1 − Zk+1 ∥2(∆(α)M −1)I+P 2η 2 ≤ ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 + ∥Dk − D∗ ∥2Q− γ I γ 2 + (1 − α)CM ∥Hk − Z∗ ∥2 . Notice that for the given range of γ, we have ∆(α)M I ≽ 2(I − P), 61 hence (4.12) is proved. Lastly, when C = 0, we have Ẑk+1 = Zk+1 for all k. Multiplying the H-inequality by (1− γ2 λmax (I−W))ηµ CM := 2α and adding it to (4.14), we complete the proof. 4.4.1 The General Stochastic Setting In the general stochastic setting with fi (xi ) = Eξi ∼Di fi (xi , ξi ), the assumptions on fi and the variance of the stochastic gradient at the optimal point allow us to show the linear convergence of Algorithm 4.1 up to a neighborhood of the optimal point. Theorem 4.4.1 (Prox-LEAD). Under Assumptions 4.2.1–4.3.2, let {(Xk , Dk , Zk , Ẑk , Hk )} be the sequence generated by Algorithm 4.1 and σ 2 = n1 ni=1 σi2 . Set α ∈ (0, (1 + C)−1 ) such that P ∆(α) = α − (1 + C)α2 > 0 and we can choose  1 i  1 2∆(α)ηµ  η ∈ 0, , γ ∈ 0, , 2L λmax (I − W) ∆(α)ηµ + 4αC then, in total expectation, 1 1 2η 2 σ 2 EΦk+1 ≤ ρk EΦ1 + , n n 1−ρ where  ηµ  k 2η 2 k Φ := 1 − ∥Z − Z∗ ∥2P + ∥Dk − D∗ ∥2Q + CM ∥Hk − Z∗ ∥2 + ∥Ẑk − Zk ∥2I−P 2 γ and   ηµ γ γλmax (I − W) ρ = max 1 − , 1 − λmin (I − W), 1 − α, (1 − ηµ) < 1. 2 − ηµ 2 2 Proof. In Lemma 4.4.2, we derive the one-step progress inequality in expectation condi- tioned on the stochastic compression at kth step and we now focus on the term involving X and Gk in (4.12). 62 Take the conditional expectation on stochastic gradient at kth step, we have E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 = ∥Xk − X∗ ∥2 − 2η⟨Xk − X∗ , ∇F(Xk ) − ∇F(X∗ )⟩ + η 2 E∥∇F(Xk , ξ k ) − ∇F(X∗ )∥2 ≤ ∥Xk − X∗ ∥2 − 2η⟨Xk − X∗ , ∇F(Xk ) − ∇F(X∗ )⟩ + 2η 2 E∥∇F(Xk , ξ k ) − ∇F(X∗ , ξ k )∥2 + 2η 2 E∥∇F(X∗ , ξ k ) − ∇F(X∗ )∥2 ≤ ∥Xk − X∗ ∥2 − 2η⟨Xk − X∗ , ∇F(Xk ) − ∇F(X∗ )⟩ + 4η 2 LVF (Xk , X∗ ) + 2nη 2 σ 2 = ∥Xk − X∗ ∥2 − 2η⟨Xk − X∗ , ∇F(Xk )⟩ + 2η(F(Xk ) − F(X∗ ) − VF (Xk , X∗ )) + 4η 2 LVF (Xk , X∗ ) + 2nη 2 σ 2 = ∥Xk − X∗ ∥2 − 2η(F(X∗ ) − F(Xk ) − ⟨X∗ − Xk , ∇F(Xk )⟩) − 2η(1 − 2ηL)VF (Xk , X∗ ) + 2nη 2 σ 2 ≤ (1 − ηµ)∥Xk − X∗ ∥2 + 2nη 2 σ 2 , where the first equality uses the unbiasedness of stochastic gradient, the second inequal- ity follows the expected Lipschitz property in Assumption 4.3.2, and the last inequality is 1 due to the strong convexity and η ≤ 2L . Now we use the power property to take the conditional expectation on the stochastic gradient for (4.12) and plug the above inequality into it, then  ηµ  2η 2 1− E∥Zk+1 − Z∗ ∥2P + E∥Dk+1 − D∗ ∥2Q 2 γ + CM E∥Hk+1 − Z∗ ∥2 + E∥Ẑk+1 − Zk+1 ∥2I−P 2η 2 ≤ (1 − ηµ)∥Xk − X∗ ∥2 + ∥Dk − D∗ ∥2Q− γ I γ 2 + (1 − α)CM ∥Hk − Z∗ ∥2 + 2nη 2 σ 2 . 63 Taking the total expectation and using (4.11), we get  ηµ  2η 2 1− E∥Zk+1 − Z∗ ∥2P + E∥Dk+1 − D∗ ∥2Q 2 γ + CM E∥Hk+1 − Z∗ ∥2 + E∥Ẑk+1 − Zk+1 ∥2I−P 2η 2 ≤ (1 − ηµ)E∥Zk − Z∗ ∥2P + E∥Dk − D∗ ∥2Q− γ I γ 2 + (1 − α)CM E∥Hk − Z∗ ∥2 + (1 − ηµ)E∥Ẑk − Zk ∥2(I−P)2 + 2nη 2 σ 2 . (4.16) Let Φk be defined as above, by (4.16), we have  k+1 ηµ γ EΦ ≤ max 1 − , 1 − λmin (I − W), 1 − α, 2 − ηµ 2  γλmax (I − W) (1 − ηµ) EΦk + 2nη 2 σ 2 . 2 Finally, by taking the telescopic sum, we complete the proof. When there is no compression, Prox-LEAD is reduced to the special case of PUDA with the stochastic gradient. Corollary 4.4.1 shows the linear convergence to the neigh- borhood of the optimal solution and when the gradient is deterministic, the convergence rate matches that given in [106]. Corollary 4.4.1 (Stochastic PUDA). When there is no compression, i.e., C = 0, under Assump- 1  tions 4.2.1, 4.3.1 and 4.3.2, we can pick α = 1 and γ = 1. Then, for any η ∈ 0, 2L   ηµ λmin (I − W) EΦ k+1 ≤ max 1 − ,1 − EΦk + 2nη 2 σ 2 , 2 − ηµ 2 with  ηµ  k ∗ 2 ∗ 2  λmax (I − W)  ηµ k k Φ := 1 − 2 k ∥Z − Z ∥ I+W + 2η ∥D − D ∥Q + 1 − ∥Z − Z∗ ∥2 . 2 2 2 2 Proof. Similar to the proof of Theorem 4.4.1, when C = 0, we considers inequality (4.13) instead. Notice that in this case (4.11) becomes ∥Xk+1 − X∗ ∥2 ≤ ∥Zk+1 − Z∗ ∥2P . 64 Hence we can get  ηµ  2η 2 1− E∥Zk+1 − Z∗ ∥2P + E∥Dk+1 − D∗ ∥2Q 2 γ + CM E∥Hk+1 − Z∗ ∥2 2η 2 ≤ (1 − ηµ)E∥Zk − Z∗ ∥2P + E∥Dk − D∗ ∥2Q− γ I γ 2 + (1 − α)CM E∥Hk − Z∗ ∥2 + 2nη 2 σ 2 , (1− γ2 λmax (I−W))ηµ where CM = 2α . We can take α = 1 − ϵ for some ϵ ∈ (0, 1). Note that ∆(α) approaches 1 − ϵ as C goes 2 to 0. The upper bound of γ is reduced to λmax (I−W) , which is strictly greater than 1 due to Assumption 4.2.1. Hence we can take γ = 1 and the convergence rate becomes   ηµ λmin (I − W) max 1 − ,1 − ,ϵ . 2 − ηµ 2 Let ϵ approach 0 and notice that Hk = Zk for all k if α = 1, then we complete the proof. Corollary 4.4.1 can be viewed as the limit case of Theorem 4.4.1 with (1 − γ2 λmax (I − W))ηµ CM = , Ẑ = Z, H = Z. 2α 1 1 2 By taking α = 2(1+C) ,η = 2L and γ = (1+16Cκf )λmax (I−W) , Theorem 4.4.1 shows the conver- gence complexity of Prox-LEAD is O(C e +κf +κg +Cκf κg ) when full gradient is used. This complexity significantly improves that of LEAD, which is O((1 e + C)(κf + κg ) + Cκf κg ). A complete comparison can be found in Table B.1. 4.4.2 Finite-sum Setting with Variance Reduction In the finite-sum setting, we assume that each fi is the average of m functions fij and impose the commonly assumed L-Lipschitz continuity to ∇fij . We will keep using Φk defined in Theorem 4.4.1. The following theorems implicitly assumes C > 0 while the 65 convergence results are extendable to the case without compression. The detailed proof of two theorems can be found in supplemental material. Theorem 4.4.2 (Loopless SVRG). Under Assumptions 4.2.1, 4.2.2, and 4.3.2, for any p ∈ (0, 1), 1 set pij = m , ∀i ∈ [n], ∀j ∈ [m], 1 1 1 2 η= , α= , γ= , 6L 2(1 + C) λmax 1 + 48Cκf then, in total expectation, we have  k+1 k+1   2 −1 e 0, EΦ e ≤ 1 − max 12κf − 1, κg + 48Cκf κg , 1 + C, 6κf , Φ p Pn where Φ e k := Φk + 2 9pL i=1 Vfi (x̃ki , x∗ ). 1 Theorem 4.4.3 (SAGA). Under Assumption 4.2.1, 4.2.2 and 4.3.2, set pij = m , ∀i ∈ [n], ∀j ∈ [m], 1 1 1 2 η= , α= , γ= , 6L 2(1 + C) λmax 1 + 48Cκf then, in total expectation, we have   −1 k+1 k+1 e 0,  EΦ e ≤ 1 − max 12κf − 1, κg + 48Cκf κg , 1 + C, 6κf , 2m Φ Pn Pm where Φ e k := Φk + 2 9L i=1 j=1 Vfij (x̃kij , x∗ ). Using O(·)e as the abbreviation of O((·) log(1/ϵ)), we simplify the above two theorems as the following corollary. Corollary 4.4.2. We can achieve ∥xki − x∗ ∥2 ≤ ϵ in expectation on each node after the number of iterations • e C + κf + κg + Cκf κg + p−1  K=O for Loopless SVRG and 66 • K=O e (C + κf + κg + Cκf κg + m) for SAGA. Proof. From the definition of Φk , we have 12κf ∥Zk − Z∗ ∥2P ≤ Φk 12κf − 1 and γ2 k ∥Ẑ − Zk ∥2(I−W)2 ≤ ∥Ẑk − Zk ∥2I−P ≤ Φk . 4 Using (4.11) in Lemma 4.4.1, we have n 1X 1 γ2 E∥xki − x∗ ∥2 ≤ E∥Zk − Z∗ ∥2P + E∥Ẑk − Zk ∥2(I−W)2 n i=1 n 4n 24κf − 1 ≤ EΦk . n(12κf − 1) Lastly, the convergence complexity is proved from Theorem 4.4.2 and Theorem 4.4.3. Remark 4.4.1. In particular, when there is no compression, i.e., C = 0 and the network is fully connected, i.e., κg = 1, the complexity of Loopless SVRG is reduced to O e (κf + p−1 ), which matches that given in [99] and the complexity of SAGA is reduced to Oe (κf + m) shown in [98]. 4.5 Numerical Experiments In this section, we present numerical experiments to validate the convergence of the pro- posed algorithms, including LEAD in the smooth case and Prox-LEAD in the nonsmooth case, as well as their stochastic variants. 4.5.1 Experimental Setting Baselines. To demonstrate the effectiveness of the proposed algorithms, we compare them with the following baselines: 1) two state-of-the-art decentralized algorithms with 67 compression: Choco [24] and LessBit [81]; 2) three non-compressed algorithms: DGD [27], NIDS [2], and P2D2 [58]. Note that NIDS and P2D2 support the nonsmooth case. In the stochastic case, we also include LessBit with Option C (LessBit-SGD) and Option D (LessBit-LSVRG). Setup. We consider eight machines connected in a ring topology network. Each agent can only exchange information with its two 1-hop neighbors. The mixing weight is simply set as 1/3. For compression, we use the unbiased b-bits quantization method with ∞- norm    (b−1)  ∥x∥∞ sign(x) 2 |x| Q∞ (x) := · +u , (4.17) 2b−1 ∥x∥∞ where · is the Hadamard product, |x| is the elementwise absolute value of x, and u is a random vector uniformly distributed in [0, 1]p . Only sign(x), norm ∥x∥∞ , and integers in the bracket need to be transmitted. Note that this quantization method is similar to the quantization used in QSGD [13] and Choco [24], but we use the ∞-norm scaling instead of the 2-norm. This small change brings significant improvement on compression precision as justified both theoretically and empirically in Appendix C in [33]. In this section, we choose 2-bit quantization and quantize the data in a blockwise manner (block size = 256). For all experiments, we tune the stepsize η in the range [0.01, 0.1]. For LEAD and Prox- LEAD, we simply fix α = 0.5 and γ = 1.0 for all experiments since they are very robust to the parameter settings. The parameters γ in Choco and θ in LessBit are tuned from {0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 1.0}. Logistic regression. We consider a regularized logistic regression problem with a cross-entropy objective function: m C 1 XX f (X) = − yi,j log(a⊤ 2 i Xj ) + λ1 ∥X∥1 + λ2 ∥X∥2 , m i=1 j=1 where ai ∈ Rp , y ∈ Rm×C , X ∈ Rp×C and C is the number of classes. We use the MNIST dataset and distribute the samples equally to all the machines in a non-iid way, sorted by their labels. Note that this is the heterogeneous data settings where the data distribution 68 from each agent is very different, which is more challenging than the homogeneous data setting where all the agents share the same data distribution. In the smooth case, we set λ1 = 0 and λ2 = 0.005, and in the nonsmooth case, we set λ1 = 0.005 and λ2 = 0.005. In the stochastic case, the training data in each agent are evenly divided into 15 mini-batches. The performance is measured by the the training suboptimality, i.e., ∥Xk − X∗ ∥2F , where X∗ denotes the optimal solution. DGD (32bit) 10−1 NIDS (32bit) 10−1 training suboptimality training suboptimality Choco (2bit) LessBit (2bit) LEAD (32bit) LEAD (2bit) 10−2 10−2 0 500 1000 1500 2000 2500 3000 3500 4000 0 2 4 6 8 epoch bits 1e7 (a) Full gradient (b) Full gradient 10−1 10−1 training suboptimality training suboptimality DGD (32bit) Choco (2bit) LessBit-SGD (2bit) 10−2 10−2 LessBit-LSVRG (2bit) LEAD-SGD (32bit) LEAD-SGD (2bit) LEAD-SAGA (32bit) 10−3 LEAD-SAGA (2bit) 10−3 LEAD-LSVRG (32bit) LEAD LSVRG (2bit) 10−4 10−4 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 #gradient 1e8 bits 1e8 (c) Stochastic gradient (d) Stochastic gradient Figure 4.1: Smooth logistic regression problem (λ1 = 0). In the full gradient case ((a) and (b)), LEAD (2bit) and LessBit (2bit) converge similarly as NIDS (32bit) and LEAD (32bit) in terms of epochs/iterations, but they requires much fewer bits in communication. In the stochastic case ((c) and (d)), the 2bit variants of LEAD match well with their 32bit variants in terms of the number of gradient evaluation, but they requires much fewer bits. Note that LEAD-SAGA requires more memory and more iterations/communication than LEAD-LSVRG, but it computes only one gradient in each iteration, while LEAD- LSVRG computes at least two gradients in each iteration. 69 P2D2 (32bit) training suboptimality NIDS (32bit) 10−1 training suboptimality 10−1 Prox-LEAD (32bit) Prox-LEAD (2bit) 10−2 10−2 0 500 1000 1500 2000 2500 3000 3500 4000 0 2 4 6 8 epoch bits 1e7 (a) Full gradient (b) Full gradient 100 Prox-LEAD-SGD (32bit) 10−1 Prox-LEAD-SGD (2bit) training suboptimality training suboptimality Prox-LEAD-SAGA (32bit) 10−1 Prox-LEAD-SAGA (2bit) Prox-LEAD-LSVRG (32bit) 10−2 Prox-LEAD-LSVRG (2bit) 10−2 10−3 10−3 10−4 10−4 0 1 2 3 4 5 6 7 0 1 2 3 4 5 #gradient 1e7 bits 1e8 (c) Stochastic gradient (d) Stochastic gradient Figure 4.2: Nonsmooth logistic regression problem (λ1 = 0.005). In the full gradient case ((a) and (b)), Prox-LEAD (2bit) converges similarly as NIDS and Prox-LEAD (32bit) in terms of epochs/iterations, but it requires much fewer bits than the other three algo- rithms. In the stochastic case ((c) and (d)), the 2bit variants match well with their 32bit variants in terms of the number of gradient evaluation, but they requires much fewer bits. Note that Prox-LEAD-SAGA requires more memory and more iterations/communication than Prox-LEAD-LSVRG, but it computes only one gradient in each iteration, while Prox- LEAD-LSVRG computes at least two gradients in each iteration. Smooth case. The experiments in the smooth case are showed in Fig. 4.1. From Fig. 4.1a, we can observe that when full gradients are available, NIDS, LessBit and LEAD converge linearly to the optimal solution, while DGD and Choco suffer from the con- vergence bias. Fig. 4.1b demonstrates the benefit of communication compression when considering the suboptimality with respect to the communcation bits. Note that the per- formance of LEAD (32bit) matches well with LEAD (2bit), which validates that the com- pression doesn’t hurt the convergence for LEAD, while the communication bits are sig- 70 nificantly reduced. Fig. 4.1c and Fig. 4.1d show the performance of algorithms with stochastic gradi- ents. We can make the following observations: 1) The performance of LEAD-SGD (2bit), LEAD-SAGA (2bit) and LEAD-LSVRG (2bit) match well with LEAD-SGD (32bit), LEAD- SAGA (32bit) and LEAD-LSVRG (32bit), respectively, which indicates that the compres- sion in LEAD doesn’t hurt the convergence; 2) The linear convergence of the variance- reduction variants, such as LEAD-SAGA (2bit) and LEAD-LSVRG (2bit), verifies our the- oretical analyses1 ; 3) LEAD-SAGA (2bit) and LEAD-LSVRG (2bit) significantly outper- form all baselines2 . The benefit of communication compression can be clearly illustrated by Fig. 4.1d. Nonsmooth case. The experiments in the nonsmooth case are showed in Fig. 4.2. Fig. 4.2a shows that Prox-LEAD (2bit) achieves linear convergence to the optimal solution with full gradient, and its performance matches well with the non-compressed version Prox-LEAD (32bit). It also converges similarly with other non-compressed baselines such as P2D2 and NIDS. Fig. 4.2b demonstrates the tremendous advantages of communication compression in Prox-LEAD (2bit) when considering the communication bits. Fig. 4.2c and Fig. 4.2d present the performance with stochastic gradients. It can be observed from Fig. 4.2c that: 1) Prox-LEAD-SAGA (2bit) and Prox-LEAD-LSVRG (2bit) maintains linear convergence with communication compression and stochastic gradients; 2) The compressed versions of Prox-LEAD all match well with the non-compressed ver- sions. Fig. 4.2b shows that the advantages of communication compression in Prox-LEAD are very significant in terms of communication bits. 1 LEAD-SAGA outperform LEAD-LSVRG in terms of the number of gradient computation since LEAD- SAGA computes fewer gradient evaluations in each iteration by sacrificing the memory space. However, LEAD-LSVRG outperforms LEAD-SAGA in terms of communication bits since the extra gradient computa- tion in LEAD-LSVRG doesn’t increase communication cost but it improves the convergence speed. Similar phenomenon is also observed in the nonsmooth case in Figure 4.2c and Fig. 4.2d. 2 LEAD-SGD (2bit) and LEAD-LSVRG (2bit) outperform LessBit-SGD (2bit) and LessBit-LSVRG (2bit), which shows the advantages of the extra gradient descent step in LEAD, as discussed in Section B.2. Though LessBit-LSVRG (2bit) has the same communication bit as LEAD-SAGA (2bit), LEAD-SAGA (2bit) requires about half of the gradient evaluation as LessBit-LSVRG (2bit). 71 To summarize, the experiments in this section verify the theoretical linear convergence of the proposed algorithm when the nonsmooth objective, stochastic gradients and com- munication compression are present. They also suggest the state-of-the-art performance in the comparison with strong baseline algorithms. 4.6 Conclusion In this chapter, we consider the decentralized stochastic composite optimization prob- lem. A decentralized proximal stochastic gradient algorithm with communication com- pression, Prox-LEAD, is proposed to improve the communication efficiency and con- vergence rates. We provide rigorous theoretical analyses and convergence complexi- ties for the proposed algorithm in the general stochastic setting and the finite-sum set- ting. We establish the linear convergence rate with variance reduction schemes and well- controlled compression error. Both the theorems and numerical experiments demonstrate the effectiveness of Prox-LEAD in reducing the communication cost and the advantages over existing algorithms. Moreover, our algorithmic framework builds bridges between many known algorithms, and it potentially enlightens the communication compression for other primal-dual algorithms. 72 APPENDICES 73 APPENDIX A SUPPLEMENTARY OF CHAPTER 3 A.1 Compression Error The property of the compression operator indicates that the compression error is linearly proportional to the norm of the variable being compressed: E∥Q(x) − x∥2 ≤ C∥x∥2 . We visualize the norm of the variables being compressed, i.e., the gradient residual (the worker side) and model residual (the master side) for DORE as well as error compensated gradient (the worker side) and averaged gradient (the master side) for DoubleSqueeze. As showed in Figure A.1, the gradient and model residual of DORE decrease exponen- tially and the compression errors vanish. However, for DoubleSqueeze, their norms only decrease to some certain value and the compression error doesn’t vanish. It explains why algorithms without residual compression cannot converge linearly to the O(σ) neighbor- hood of the optimal solution in the strongly convex case. 4 10 3 DoubleSqueeze 10 DoubleSqueeze DoubleSqueeze (topk) DoubleSqueeze (topk) 10 1 DORE 10 1 DORE 1 2 10 10 3 10 Norm 10 5 Norm 5 10 7 10 8 10 9 10 11 10 11 10 0 2500 5000 7500 10000 12500 15000 17500 20000 0 2500 5000 7500 10000 12500 15000 17500 20000 Iteration Iteration (a) Worker side (b) Master side Figure A.1: The norm of compressed variable in linear regression. 74 A.2 Parameter Sensitivity Continuing the MNIST experiment in Section 3.5, we further conduct parameter analysis on DORE. The basic setting for block size, learning rate, α, β and η are 256, 0.1, 0.1, 1, 1, respectively. We change each parameter individually. Figures A.2, A.3, A.4, and A.5 demonstrate that DORE performs consistently well under different parameter settings. 2.00 Block size: 128 Block size: 128 2.0 Block size: 256 1.75 Block size: 256 Block size: 512 Block size: 512 1.50 1.5 Training Loss 1.25 1.0 Test Loss 1.00 0.75 0.5 0.50 0.25 0.0 0.00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test loss Figure A.2: Training under different compression block sizes. 2.00 : 0.1 : 0.1 2.0 : 0.2 1.75 : 0.2 : 0.3 : 0.3 1.50 1.5 Training Loss 1.25 1.0 Test Loss 1.00 0.75 0.5 0.50 0.25 0.0 0.00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test loss Figure A.3: Training under different α. 75 2.00 : 0.8 : 0.8 2.0 : 0.9 1.75 : 0.9 : 1.0 1.50 : 1.0 1.5 Training Loss 1.25 1.0 Test Loss 1.00 0.75 0.5 0.50 0.25 0.0 0.00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test loss Figure A.4: Training under different β. : 0.8 2.00 : 0.8 2.0 : 0.9 : 0.9 1.75 : 1.0 : 1.0 1.50 1.5 Training Loss 1.25 1.0 Test Loss 1.00 0.75 0.5 0.50 0.25 0.0 0.00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Epoch Epoch (a) Training loss (b) Test loss Figure A.5: Training under different η. 76 A.3 DORE in the Smooth Case Algorithm A.1: DORE with R(x) = 0 1: Input: Stepsize α, β, γ, η, initialize h0 = h0i = 0d , x̂0i = x̂0 , ∀i ∈ {1, . . . , n}. 2: for k = 1, 2, · · · , K − 1 do 3: For each worker {i = 1, 2, · · · , n}: 12: For the master: 4: Sample gik such that E[gik |x̂ki ] = ∇fi (x̂ki )13: Receive ∆ ˆ k s from workers i 5: Gradient residual: ∆ki = gik − hki 14: ∆ˆ k = 1/n Pn ∆ ˆk i i 6: Compression: ∆ ˆ ki = Q(∆ki ) 15: ĝ = h + ∆ {= 1/n ni ĝik } k k ˆ k P 7: hk+1 = h k i + α ˆ ∆ k i 16: hk+1 = hk + α∆ ˆk i 8: { ĝik = hki + ∆ ˆk } 17: q = −γĝ + ηek k k i 9: Sent ∆ ˆ k to the master 18: Compression: q̂k = Q(qk ) i k 10: Receive q̂ from the master 19: ek+1 = qk − q̂k 11: x̂k+1 i = x̂ki + β q̂k 20: Broadcast q̂k to workers 21: end for 22: Output: any x̂K i A.4 Proof of Theorem 3.4.1 We first provide two lemmas. We define EQ , Ek , and E be the expectation taken over the quantization, the kth iteration based on x̂k , and the overall expectation, respectively. Lemma A.4.1. For every i, we can estimate the first two moments of hk+1 i as EQ hk+1 i =(1 − α)hki + αgik , (A.1) EQ ∥hik+1 − si ∥2 ≤(1 − α)∥hki − si ∥2 + α∥gik − si ∥2 + α[(Cq + 1)α − 1]∥∆ki ∥2 . (A.2) Proof. The first equality follows from lines 5-7 of Algorithm 3.1 and Assumption 3.3.1. For the second equation, we have the following variance decomposition E∥X∥2 = ∥EX∥2 + E∥X − EX∥2 (A.3) for any random vector X. By taking X = hk+1 i − si , we get EQ ∥hk+1 − si ∥2 = ∥(1 − α)(hki − si ) + α(gik − si )∥2 + α2 EQ ∥∆ ˆ ki − ∆ki ∥2 . (A.4) i 77 Using the basic equality ∥λa + (1 − λ)b∥2 + λ(1 − λ)∥a − b∥2 = λ∥a∥2 + (1 − λ)∥b∥2 (A.5) for all a, b ∈ Rd and λ ∈ [0, 1], as well as Assumption 3.3.1, we have EQ ∥hk+1 i − si ∥2 ≤ (1 − α)∥hki − si ∥2 + α∥gik − si ∥2 − α(1 − α)∥∆ki ∥2 + α2 Cq ∥∆ki ∥2 , (A.6) which is the inequality (A.2). Next, from the variance decomposition (A.3), we also derive Lemma A.4.2. Lemma A.4.2. The following inequality holds n Cq X σ2 E[∥ĝk − h∗ ∥2 ] ≤ E∥∇f (x̂k ) − h∗ ∥2 + E∥∆ k 2 i ∥ + , (A.7) n2 i=1 n Pn Pn where h∗ = ∇f (x∗ ) = 1 n i=1 h∗i and σ 2 = 1 n i=1 σi2 . Proof. By taking the expectation over the quantization of g, we have E∥ĝk − h∗ ∥2 = E∥gk − h∗ ∥2 + E∥ĝk − gk ∥2 n k ∗ 2 Cq X ≤ E∥g − h ∥ + 2 E∥∆ki ∥2 , (A.8) n i=1 where the inequality is from Assumption 3.3.1. For ∥gk − h∗ ∥, we take the expectation over the sampling of gradients and derive E∥gk − h∗ ∥2 = E∥∇f (x̂k ) − h∗ ∥2 + E∥gk − ∇f (x̂k )∥2 σ2 ≤ E∥∇f (x̂k ) − h∗ ∥2 + (A.9) n by Assumption 3.4.1. Combining (A.8) with (A.9) gives (A.7). Proof of Theorem 3.4.1. We consider xk+1 − x∗ first. Since x∗ is the solution of (1.1), it satis- fies x∗ = proxγR (x∗ − γh∗ ). (A.10) 78 Hence E∥xk+1 − x∗ ∥2 =E∥proxγR (x̂k − γĝk ) − proxγR (x∗ − γh∗ )∥2 ≤E∥x̂k − x∗ − γ(ĝk − h∗ )∥2 =E∥x̂k − x∗ ∥2 − 2γE⟨x̂k − x∗ , ĝk − h∗ ⟩ + γ 2 E∥ĝk − h∗ ∥2 =E∥x̂k − x∗ ∥2 − 2γE⟨x̂k − x∗ , ∇f (x̂k ) − h∗ ⟩ + γ 2 E∥ĝk − h∗ ∥2 , (A.11) where the inequality comes from the non-expansiveness of the proximal operator and the last equality is derived by taking the expectation of the stochastic gradient ĝk . Combining (A.7) and (A.11), we have E∥xk+1 − x∗ ∥2 ≤E∥x̂k − x∗ ∥2 − 2γE⟨x̂k − x∗ , ∇f (x̂k ) − h∗ ⟩ n n γ2 X ∗ 2 Cq γ 2 X γ2 + k E∥∇fi (x̂ ) − hi ∥ + 2 E∥∆ki ∥2 + σ 2 . (A.12) n i=1 n i=1 n Then we consider E∥x̂k+1 − x∗ ∥2 . According to Algorithm 3.1, we have: EQ [x̂k+1 − x∗ ] = x̂k + βqk − x∗ = (1 − β)(x̂k − x∗ ) + β(xk+1 − x∗ + ηek ) (A.13) where the expectation is taken on the quantization of qk . By variance decomposition (A.3) and the basic equality (A.5), E∥x̂k+1 − x∗ ∥2 ≤(1 − β)E∥x̂k − x∗ ∥2 + βE∥xk+1 + ηek − x∗ ∥2 − β(1 − β)E∥qk ∥2 + β 2 Cqm E∥qk ∥2 ≤(1 − β)E∥x̂k − x∗ ∥2 + (1 + η 2 ϵ)βE∥xk+1 − x∗ ∥2 − β(1 − (Cqm + 1)β)E∥qk ∥2 1 + (η 2 + )βCqm E∥qk−1 ∥2 , (A.14) ϵ where ϵ is generated from Cauchy inequality of inner product. For convenience, we let ϵ = η1 . 79 1 Choose a β such that 0 < β ≤ 1+Cqm . Then we have β(1 − (Cqm + 1)β)E∥qk ∥2 + E∥x̂k+1 − x∗ ∥2 ≤(1 − β)E∥x̂k − x∗ ∥2 + (1 + η)βE∥xk+1 − x∗ ∥2 + (η 2 + η)βCqm E∥qk−1 ∥2 . (A.15) Letting si = h∗i in (A.2), we have n (1 + η)cβγ 2 X E∥hk+1 i − h∗i ∥2 n i=1 n n (1 + η)(1 − α)cβγ 2 X k ∗ 2 (1 + η)αcβγ 2 X k ≤ ∥hi − hi ∥ + ∥gi − h∗i ∥2 n i=1 n i=1 n (1 + η)α[(Cq + 1)α − 1]cβγ 2 X k 2 + ∥∆i ∥ . (A.16) n i=1 Then we let Rk = β(1 − (Cqm + 1)β)E∥qk ∥2 and define Vk = Rk−1 + E∥x̂k − x∗ ∥2 + (1+η)cβγ 2 Pn k ∗ 2 n i=1 E∥hi − hi ∥ . Thus, we obtain Vk+1 ≤(η 2 + η)βCqm E∥qk−1 ∥2 + (1 + ηβ)E∥x̂k − x∗ ∥2 n ∗ ∗ (1 + η)(1 − α)cβγ 2 X k k − 2(1 + η)βγE⟨x̂ − x , ∇f (x̂ ) − h ⟩ + E∥hki − h∗i ∥2 n i=1 2h n (1 + η)βγ 2 i X + nc(C q + 1)α − ncα + C q E∥∆ki ∥2 n2 i=1 n (1 + η)(1 + cα) 2 X (1 + η)(1 + ncα) 2 2 + βγ E∥∇fi (x̂k ) − h∗i ∥2 + βγ σ . (A.17) n i=1 n The E∥∆ki ∥2 -term can be ignored if nc(Cq + 1)α2 − ncα + Cq ≤ 0, which can be guaran- 4Cq (Cq +1) teed by c ≥ n and  q q  4Cq (Cq +1) 4Cq (Cq +1) 1− 1− nc 1+ 1− nc α∈ , . 2(Cq + 1) 2(Cq + 1) Given that each fi is L-Lipschitz differentiable and µ-strongly convex, we have n µL 1 1X E⟨∇f (x̂k ) − h∗ , x̂k − x∗ ⟩ ≥ E∥x̂k − x∗ ∥2 + E∥∇fi (x̂k ) − h∗i ∥2 . (A.18) µ+L µ + L n i=1 80 Hence Vk+1 ≤ρ1 Rk−1 + (1 + ηβ)E∥x̂k − x∗ ∥2 − 2(1 + η)βγE⟨x̂k − x∗ , ∇f (x̂k ) − h∗ ⟩ n n (1 + η)(1 − α)cβγ 2 X ∗ 2 (1 + η)(1 + cα) 2 X + k E∥hi − hi ∥ + βγ E∥∇fi (x̂k ) − h∗i ∥2 n i=1 n i=1 (1 + η)(1 + ncα) 2 2 + βγ σ n h 2(1 + η)βγµL i ≤ρ1 Rk−1 + 1 + ηβ − E∥x̂k − x∗ ∥2 µ+L 2 X n (1 + η)(1 − α)cβγ + E∥hki − h∗i ∥2 n i=1 n h 2(1 + η)βγ i 1 X + (1 + η)(1 + cα)βγ − 2 E∥∇fi (x̂k ) − h∗i ∥2 µ+L n i=1 (1 + η)(1 + ncα) 2 2 + βγ σ n n ∗ 2 (1 + η)(1 − α)cβγ 2 X ≤ρ1 R k−1 k + ρ2 E∥x̂ − x ∥ + E∥hki − h∗i ∥2 n i=1 (1 + η)(1 + ncα) 2 2 + βγ σ (A.19) n where (η 2 + η)Cqm ρ1 = , 1 − (Cqm + 1)β 2(1 + η)βγµL ρ2 =1 + ηβ − . µ+L Here we let γ ≤ 2 (1+cα)(µ+L) such that (1+η)(1+cα)βγ 2 − 2(1+η)βγ µ+L ≤ 0 and the last inequality holds. In order to get max(ρ1 , ρ2 , 1 − α) < 1, we have the following conditions 0 ≤ (η 2 + η)Cqm ≤1 − (Cqm + 1)β, 2(1 + η)γµL η< . µ+L Therefore, the condition for γ is η(µ + L) 2 ≤γ≤ , 2(1 + η)µL (1 + cα)(µ + L) 81 which implies an additional condition for η. Therefore, the condition for η is   q  −Cqm + (Cqm )2 + 4(1 − (Cqm + 1)β) 4µL η ∈ 0, min  m , 2  . 2Cq (µ + L) (1 + cα) − 4µL 4µL η(µ+L) 2 where η ≤ (µ+L)2 (1+cα)−4µL is to ensure 2(1+η)µL ≤ (1+cα)(µ+L) such that we don’t get an empty set for γ. If we define ρ = max{ρ1 , ρ2 , 1 − α}, we obtain (1 + η)(1 + ncα) 2 2 Vk+1 ≤ ρVk + βγ σ (A.20) n and the proof is completed by applying (A.20) recurrently. A.5 Proof of Theorem 3.4.2 Proof. In Algorithm A.1, we can show E∥x̂k+1 − x̂k ∥2 = β 2 E∥q̂k ∥2 = β 2 E∥Eq̂k ∥2 + β 2 E∥q̂k − Eq̂k ∥2 = β 2 E∥qk ∥2 + β 2 E∥q̂k − qk ∥2 (A.21) ≤ (1 + Cqm )β 2 E∥qk ∥2 . and E∥qk ∥2 = E∥ − γĝk + ηek ∥2 ≤ 2γ 2 E∥ĝk ∥2 + 2η 2 E∥ek ∥2 ≤ 2γ 2 E∥ĝk ∥2 + 2Cqm η 2 E∥qk−1 ∥2 . (A.22) 82 Using (A.21)(A.22) and the Lipschitz continuity of ∇f (x), we have Ef (x̂k+1 ) + (Cqm + 1)Lβ 2 E∥qk ∥2 L ≤Ef (x̂k ) + E⟨∇f (x̂k ), x̂k+1 − x̂k ⟩ + E∥x̂k+1 − x̂k ∥2 + (Cqm + 1)Lβ 2 E∥qk ∥2 2 (1 + Cqm )Lβ 2 =Ef (x̂k ) + βE⟨∇f (x̂k ), −γĝk + ηek ⟩ + E∥qk ∥2 + (Cqm + 1)Lβ 2 E∥qk ∥2 2 k k k k 3(Cqm + 1)Lβ 2 =Ef (x̂ ) + βE⟨∇f (x̂ ), −γ∇f (x̂ ) + ηe ⟩ + E∥qk ∥2 2 βη βη ≤Ef (x̂k ) − βγE∥∇f (x̂k )∥2 + E∥∇f (x̂k )∥2 + E∥ek ∥2 2 2 h i + 3(Cqm + 1)Lβ 2 γ 2 E∥ĝk ∥2 + Cqm η 2 E∥qk−1 ∥2 h βη i ≤Ef (x̂k ) − βγ − − 3(Cqm + 1)Lβ 2 γ 2 E∥∇f (x̂k )∥2 2 n 3Cq (Cq + 1)Lβ 2 γ 2 X m k 2 3(Cqm + 1)Lβ 2 γ 2 2 + E∥∆ i ∥ + σ n2 i=1 n h βηC m i q + + (3Cqm + 1)Cqm Lβ 2 η 2 E∥qk−1 ∥2 , (A.23) 2 where the last inequality is from (A.7) with h∗ = 0. Letting si = 0 in (A.2), we have EQ ∥hk+1 i ∥2 ≤(1 − α)∥hki ∥2 + α∥gik ∥2 + α[(Cq + 1)α − 1]∥∆ki ∥2 . (A.24) Due to the assumption that each worker samples the gradient from the full dataset, we have Egik = E∇f (x̂k ), E∥gik ∥2 ≤ E∥∇f (x̂k )∥2 + σi2 . (A.25) Pn Define Λk = (Cqm + 1)Lβ 2 ∥qk−1 ∥2 + f (x̂k ) − f ∗ + 3c(Cqm + 1)Lβ 2 γ 2 n1 i=1 E∥hki ∥2 , and 83 from (A.23), (A.24), and (A.25), we have n 1X EΛk+1 ≤Ef (x̂k ) − f ∗ + 3(1 − α)c(Cqm + 1)Lβ 2 γ 2 E∥hki ∥2 n i=1 h βη i − βγ − − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 E∥∇f (x̂k )∥2 2 n (Cq + 1)Lβ 2 γ 2 h m 2 iX + 3nc(Cq + 1)α − 3ncα + 3Cq E∥∆ki ∥2 n2 i=1 (Cqm + 1)Lβ 2 γ 2 σ 2 + 3(1 + ncα) n h βηC m i q + + 3(Cqm + 1)Cqm Lβ 2 η 2 E∥qk−1 ∥2 . (A.26) 2 4Cq (Cq +1) If we let c = n , then the condition of α in (3.5) gives 3nc(Cq + 1)α2 − 3ncα + 3Cq ≤ 0 and n 2 21 X k+1 k ∗ EΛ ≤Ef (x̂ ) − f + 3(1 − α)c(Cqm + 1)Lβ γ E∥hki ∥2 n i=1 h βη i − βγ − − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 E∥∇f (x̂k )∥2 2 (Cqm + 1)Lβ 2 γ 2 σ 2 + 3(1 + ncα) n βηCqm +[ + 3(Cqm + 1)Cqm Lβ 2 η 2 ]E∥qk−1 ∥2 . (A.27) 2 1 Let η = γ and βγ ≤ 6(1+cα)(Cqm +1)L , we have βη βγ βγ − − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 = − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 ≥ 0. 2 2 r m +1)2 48L2 β 2 (Cq n −1+ 1+ m o Cq 1 Take γ ≤ min 12Lβ(Cqm +1) , 6Lβ(1+cα)(C m +1) will guarantee q h βηC m i q + 3(Cqm + 1)Cqm Lβ 2 η 2 ≤ (Cqm + 1)Lβ 2 . 2 Hence we obtain h βγ i (Cqm + 1)Lβ 2 γ 2 σ 2 EΛk+1 ≤ EΛk − − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 E∥∇f (x̂k )∥2 + 3(1 + ncα) . 2 n (A.28) Taking the telescoping sum and plugging the initial conditions, we derive (3.12). 84 A.6 Proof of Corollary 3.4.2 1 4Cq (Cq +1) Proof. With α = 2(Cq +1) and c = n , 1 + ncα = 1 + 2Cq is a constant. r 2 n −1+ 1+ 48L Cm o 1 q 1 √ We set β = Cqm +1 and γ = min 12L , . In general, Cqm is 12L(1+cα)(1+ K/n) 1 √ bounded which makes the first bound negligible, i.e., γ = when K is 12L(1+cα)(1+ K/n) large enough. Therefore, we have β 1 − 6(1 + cα)Lγ 1 − 3(1 + cα)(Cqm + 1)Lβ 2 γ = m ≤ m . (A.29) 2 2(Cq + 1) 4(Cq + 1) From Theorem 3.4.2, we derive K 1 X E∥∇f (x̂k )∥2 K k=1 4(Cqm + 1)(EΛ1 − EΛK+1 ) 12(1 + ncα)Lσ 2 γ ≤ + γK n 1 1 (1 + ncα)σ 2 1 ≤48L(Cqm + 1)(1 + cα)(EΛ1 − EΛK+1 )( + √ )+ √ , (A.30) K nK (1 + cα) nK which completes the proof. 85 APPENDIX B SUPPLEMENTARY OF CHAPTER 4 B.1 LEAD: Smooth Case of Prox-LEAD Algorithm B.1: LEAD Input: Stepsize η, parameterPn (α, γ), X0 , H1 , D1 = (I − W)Z for any Z K K Output: X or 1/n i=1 Xi 1: H1w = WH1 9: procedure C OMM(Y, H, Hw ) 2: X1 = X0 − η∇F(X0 ; ξ 0 ) 10: Q = C OMPRESSY − H 3: for k = 1, 2, · · · , K − 1 do 11: Ŷ = H + Q 4: Yk = Xk − η∇F(Xk ; ξ k ) − ηDk 12: Ŷw = Hw + WQ 5: Ŷk , Ŷwk , Hk+1 , Hk+1 w = C OMM(Yk , Hk , Hkw ) 13: H = (1 − α)H + αŶ γ 6: D k+1 = D + 2η (Ŷ − Ŷwk ) k k 14: Hw = (1 − α)Hw + αŶw 7: Xk+1 = Xk − η∇F(Xk ; ξ k ) − ηDk+1 15: Return: Ŷ, Ŷw , H, Hw 8: end for 16: end procedure B.2 Connection with Existing Algorithms In Section 4.2, we have discussed the motivation of LEAD and Prox-LEAD. We now turn to the relation between LEAD and some other existing algorithms from the perspective of the dual problem. We look at the problem (4.1) without the non-smooth regularizer, i.e., R(X) = 0. Consider the Fenchel conjugate of F defined as F∗ (Y) = supX∈Rn×p Y⊤ X − F(X), then we obtain the dual problem as ∗ √ minn×p F (− I − WS). S∈R If we apply the gradient descent method to the above problem, we need to evaluate the √ √ √ gradient − I − W∇F∗ (− I − WS). Let D = I − WS, then the iteration follows Dk+1 = Dk + θ(I − W)∇F∗ (−Dk ). When the gradient of the dual function is available, the communication proceeds after the gradient evaluation. If we compress the only communication step, the algorithm leads 86 to Option A in [81] with the single exception that the quantization procedure is slightly different. Since D belongs to the dual space of the original optimized variable, we derive the solution of the primal problem via the relation Xk+1 = ∇F∗ (−Dk ) (B.1) and the linear convergence is guaranteed under the strongly convex assumption on F. In most cases, it is difficult to evaluate the conjugate function, so we rewrite the rela- tion (B.1) into the following minimization problem Xk+1 = arg min F(X) + ⟨Dk , X⟩, X∈Rn×p and get an inexact estimate using the gradient method. Applying one step of gradient de- scent method to the subproblem and inserting the update into the original dual iterations, we get the following primal-dual iteration   k+1 X  = Xk − η∇F(Xk ) − ηDk ,  Dk+1 = Dk + θ(I − W)Xk+1 , where η, θ are stepsizes for primal and dual update respectively. It can be shown that this iteration is a special case of the incremental primal-dual gradient method (PDGM) in [92] and the linear convergence rate can be guaranteed. Furthermore, when the communi- cation of Xk+1 in the dual update is conducted by the compression procedure (and the stochastic gradient estimation is involved), the algorithm recovers Option B (Option C) of LessBit in [81]. If we proceed one more gradient descent step in the subproblem, we get   k+1 X  = Xk − η∇F(Xk ) − ηDk ,   k+1 X = Xk+1 − η∇F(Xk+1 ) − ηDk ,   Dk+1 = Dk + θ(I − W)Xk+1 . 87 Algorithm Non-smooth R ∇F Compression Convergence complexity Dual Gradient Descent ✗ ✗ ✗ O(κ e f κg ) LessBit-Option A ✗ ✗ ✓ O(C e + κf κg + Cκf κ fg ) [81] PDGM ✗ ✓ ✗ O(κ e f + κf κg ) [92] LessBit-Option B ✗ ✓ ✓ O(C e + κf κg + Cκf κ fg ) [81] NIDS ✗ ✓ ✗ O(κ e f + κg ) [31] LEAD ✗ ✓ ✓ O((1 e + C)(κf + κg ) + Cκf κg ) [33] PUDA ✓ ✓ ✗ O(κ e f + κg ) [106] Prox-LEAD ✓ ✓ ✓ O(C e + κf + κg + Cκf κg ) Algorithm 4.1 L Table B.1: The comparison of algorithms mentioned in Section B.2; κf := µ , κg := λmax (I−W) max wij λmin (I−W) and κeg := λmin(i,j)∈E (I−W) for nonnegative W. The addition of the step does not increase the computation of the gradient ∇F because it can be reused in the next iteration. So, we switch the order of the iteration and derive   k+1 X  = Xk − η∇F(Xk ) − ηDk ,  = Dk + θ(I − W)Xk+1 ,  k+1 D   Xk+1 = Xk − η∇F(Xk ) − ηDk+1 . By setting θ = 1, the above algorithm recovers NIDS of the smooth problems in [2, 31]. It has been shown that the additional step in NIDS improves the linear convergence rate of the previous two algorithms in terms of the condition numbers of the objective function and the network. The detailed comparison is listed in Table B.1. Compared to LEAD, Prox-LEAD improves the complexity by reducing O(C(κ e f + κg )) to O(C). e LessBit-Option B has complexity O(C e + κf κg + Cκf κeg ), which has a better en- tangled term involving C since κeg is less than κg . However, for all C ∈ [0, κg /(κg − κeg )), the complexity of Prox-LEAD is always better than the one of LessBit-Option B. 88 B.3 Proof of Theorem 4.4.2 Proof. Linear convergence. The following proof is applicable to the general choice of {pij } 1 and for simplicity, we only consider the uniform sampling case with pij = m . Lemma 4.4.2 still holds with different Gk = [g1k , · · · , gnk ]⊤ in procedure SGO, then we focus on the following term E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 X n  ∇f (xk ) − ∇f (x̃k )  2 il il = E xki − x∗ − η i i + ∇fi (x̃ki ) − ∇fi (x∗ ) i=1 mpil n  k k  ∗ ∇fil (xi ) − ∇fil (x̃i ) X k ∗ 2 k k ∗ = ∥X − X ∥ − 2η E xi − x , + ∇fi (x̃i ) − ∇fi (x ) i=1 mpil n 2 X ∇fil (xki ) − ∇fil (x̃ki ) +η 2 E + ∇fi (x̃ki ) − ∇fi (x∗ ) i=1 mpil X n = ∥Xk − X∗ ∥2 − 2η ⟨xki − x∗ , ∇fi (xki ) − ∇fi (x∗ )⟩ i=1 n X m 2 X ∇fij (xki ) − ∇fij (x̃ki ) +η 2 pij + ∇fi (x̃ki ) − ∇fi (x∗ ) i=1 j=1 mpij X n X n ∗ 2 ∗ k = ∥X − X ∥ − 2η ⟨xki −x , ∇fi (xki )⟩ + 2η (fi (xki ) − fi (x∗ ) − Vfi (xki , x∗ )) i=1 i=1 n X m ∗ ∗ 2 X ∇fij (xki ) − ∇fij (x ) fij (x ) − ∇fij (x̃ki ) +η 2 pij + + ∇fi (x̃ki ) − ∇fi (x∗ ) i=1 j=1 mpij mpij n n X m ∗ X X 1 k ≤ (1 − µη)∥X − X ∥ − 2η Vfi (xki , x∗ ) + 2η 2 ∥∇fij (xki ) − ∇fij (x∗ )∥2 i=1 i=1 j=1 m2 p ij n Xm 2 X ∇fij (x∗ ) − ∇fij (x̃ki ) + 2η 2 pij + ∇fi (x̃ki ) − ∇fi (x∗ ) , i=1 j=1 mpij where the inequality uses the strong convexity of fij in Assumption 4.3.2 and (a + b)2 ≤ 2a2 + 2b2 . n o Let ui be the random variable taking values in 1 mpil (∇fil (x̃ki ) − ∇fil (x∗ )) : l ∈ [m] with distribution Pi = {pil : l ∈ [m]}, then the last term is actually the summation of the 89 variance of ui s due to m X pij Eui = (∇fij (x̃∗i ) − ∇fij (x∗ )) = ∇fi (x̃ki ) − ∇fi (x∗ ). j=1 mp ij Applying the inequality E∥ui − Eui ∥2 ≤ E∥ui ∥2 to the last term, we get E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 n n ∗ X 4η 2 L X k ≤ (1 − µη)∥X − X ∥ − 2η Vfi (xki , x∗ ) + Vf (xk , x∗ ) i=1 mpmin i=1 i i n X m X 1 + 2η 2 ∥∇fij (x̃ki ) − ∇fij (x∗ )∥2 i=1 j=1 m2 p ij X n ∗ k ≤ (1 − µη)∥X − X ∥ − 2η Vfi (xki , x∗ ) i=1 2 n n 4η L X 4η 2 L X + Vfi (xki , x∗ ) + Vf (x̃k , x∗ ), mpmin i=1 mpmin i=1 i i where pmin := mini,j {pij } and the last inequality uses the Lipschitz smoothness of fij in Assumption 4.3.2. From the update of x̃k+1 i , we have Ex̃k+1 i = pxki + (1 − p)x̃ki . Hence EVfi (x̃k+1 i , x∗ ) = pVfi (xki , x∗ ) + (1 − p)Vfi (x̃ki , x∗ ). Combine them with the inequality (4.12) and (4.11), after taking the total expectation, we get X n EΦ k+1 + c̃ EVfi (x̃k+1i , x∗ ) i=1 n 4η 2 L  X ≤ (1 − ηµ)E∥Z − k Z∗ ∥2P − 2η − − c̃p EVfi (xki , x∗ ) mpmin i=1 n 4η 2 L X + EVfi (x̃ki , x∗ ) + (1 − α)CM E∥Hk − Z∗ ∥2 mpmin i=1 2η 2  γ  + 1 − λmin (I − W) E∥Dk − D∗ ∥2Q γ 2 Xn k k 2 + (1 − ηµ)E∥Ẑ − Z ∥(I−P)2 + c̃(1 − p) EVfi (x̃ki , x∗ ), i=1 90 8η 2 L where c̃ = pmpmin . By the choice of η and {pij }, we have 4η 2 L 1 1 + c̃p = (2 + 4) = = 2η. mpmin 18L 3L Therefore, X n EΦk+1 + c̃ EVfi (x̃k+1 i , x∗ ) ≤ (1 − ηµ)E∥Zk − Z∗ ∥2P + (1 − α)CM E∥Hk − Z∗ ∥2 i=1 2η 2  γ  + 1 − λmin (I − W) E∥Dk − D∗ ∥2Q γ 2 γλmax (I − W) + (1 − ηµ) E∥Ẑk − Zk ∥2I−P 2  n 4η 2 L X  + c̃ 1 − p + EVfi (x̃ki , x∗ ). c̃mpmin i=1 Note that 4η 2 L p 2 1−p+ =1− , c̃ = , c̃mpmin 2 9pL then we get n n 2 X ∗  2 X  EΦ k+1 + k+1 EVfi (x̃i , x ) ≤ ρ̃ EΦ + k EVfi (x̃ki , x∗ ) , 9pL i=1 9pL i=1 where   ηµ γ γλmax (I − W) p ρ̃ := max 1 − , 1 − λmin (I − W), 1 − α, (1 − ηµ) ,1 − . 2 − ηµ 2 2 2 Complexity. The linear convergence requires 1 α< , 1+C 1 2∆(α) γ≤ . λmax (I − W) ∆(α) + 24αCκf 1 α 1 Take α = 2(1+C) , then ∆(α) = 2 = 4(1+C) and 1 2 γ≤ . λmax (I − W) 1 + 48Cκf By taking γ equal to the upper bound and plugging η, α, γ into ρ̃, we get  −1 n 2o ρ̃ ≤ 1 − max 12κf − 1, κg + 48Cκf κg , 1 + C, 6κf , p and the proof is complete. 91 B.4 Proof of Theorem 4.4.3 Proof. Linear convergence. The following proof can be adapted to to show the conver- gence with the general {pij } while for simplicity, we focus on uniform sampling case. We start from the gradient term of the key inequality in Lemma 4.4.2 and replace ∇F(Xk , ξ k ) by Gk = [g1k , · · · , gnk ]⊤ , E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 n  ∇f (xk ) − ∇f (x̃k ) Pm ∇fij (x̃k ) − ∇fij (x∗ )  2 ij X il il j=1 k ∗ i il = E xi − x − η + i=1 mp il m n Pm k ∗  j=1 ∇fij (x̃ij ) − ∇fij (x ) k k  k ∗ 2 X k ∗ ∇f il (x i ) − ∇f il (x̃ il ) = ∥X − X ∥ − 2η E xi − x , + i=1 mpil m n P m k ∗ 2 ∇fil (xki ) − ∇fil (x̃kil ) j=1 ∇fij (x̃ij ) − ∇fij (x ) X 2 +η E + i=1 mpil m Xn ∗ 2 k = ∥X − X ∥ − 2η ⟨xki − x∗ , ∇fi (xki ) − ∇fi (x∗ )⟩ i=1 n X m Pm 2 X ∇fij (xki ) − ∇fij (x̃kij ) j=1 ∇fij (x̃kij ) − ∇fij (x∗ ) 2 +η pij + i=1 j=1 mpij m Xn Xn ∗ 2 k = ∥X − X ∥ − 2η ⟨xki −x ∗ , ∇fi (xki )⟩ + 2η (fi (xki ) − fi (x∗ ) − Vfi (xki , x∗ ))+ i=1 i=1 n X m Pm 2 X ∇fij (xki ) ∗ ∗ − ∇fij (x ) fij (x ) − ∇fij (x̃kij ) j=1 ∇fij (x̃kij ) − ∇fij (x∗ ) 2 η pij + + . i=1 j=1 mpij mpij m Using the strong convexity of fij in Assumption 4.3.2 and applying (a + b)2 ≤ 2a2 + 2b2 to the last term, we have E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 n n X m X X 1 k ≤ (1 − µη)∥X − X ∥ − 2η ∗ Vfi (xki , x∗ ) + 2η 2 ∥∇fij (xki ) − ∇fij (x∗ )∥2 i=1 i=1 j=1 m2 p ij n X m Pm 2 X ∇fij (x∗ ) − ∇fij (x̃kij ) j=1 ∇fij (x̃kij ) − ∇fij (x∗ ) 2 + 2η pij + . i=1 j=1 mpij m n o Let ui be the random variable taking values in 1 mpil (∇fil (x̃kil ) − ∇fil (x̃∗ )) : l ∈ [m] with distribution Pi = {pil : l ∈ [m]}, then the last term of the above inequality can be 92 upper bounded by Xn X n 2η 2 E∥ui − Eui ∥2 ≤ 2η 2 E∥ui ∥2 i=1 i=1 n X m 2 X 1 = 2η 2 pij (∇fij (x̃kij ) − ∇fij (x̃∗ )) . i=1 j=1 mpij Combining the inequality with the upper bound, we get E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 n n m X 4η 2 L X X 1 k ≤ (1 − µη)∥X − X ∥ − 2η ∗ Vfi (xki , x∗ ) + Vf (xk , x∗ ) i=1 m2 i=1 j=1 pij ij i n X m X 1 + 2η 2 ∥∇fij (x̃kij ) − ∇fij (x∗ )∥2 i=1 j=1 m2 p ij n n m X 4η 2 L X X 1 k ≤ (1 − µη)∥X − X ∥ − 2η ∗ Vfi (xki , x∗ ) + Vf (xk , x∗ ) i=1 m2 i=1 j=1 pij ij i n X m 4η 2 L X 1 + Vf (x̃k , x∗ ), m2 i=1 j=1 pij ij ij where the last inequality uses the Lipschitz smoothness of fij in Assumption 4.3.2. From the update of x̃ij , we have ∗ ∗ ∗ EVfij (x̃k+1 k k ij , x ) = pij Vfij (xi , x ) + (1 − pij )Vfij (x̃ij , x ). 93 Similar to the proof of Theorem 4.4.2, in total expectation, we can get X n X m ∗ EΦ k+1 + c̃ EVfij (x̃k+1 ij , x ) i=1 j=1 n X m X 1 k ≤ (1 − ηµ)E∥X − X ∥ − 2η ∗ 2 EVfij (xki , x∗ ) i=1 j=1 m n m n m 4η 2 L X X 1 ∗ XX + EV f ij (x k i , x ) + c̃ pij EVfij (xki , x∗ ) m2 i=1 j=1 pij i=1 j=1 n X m n m X 4η 2 L X X 1 + c̃ (1 − pij )EVfij (x̃kij , x∗ ) + EVfij (x̃kij , x∗ ) i=1 j=1 m2 i=1 j=1 pij 2  2η γ  + 1 − λmin (I − W) E∥Dk − D∗ ∥2Q + (1 − α)CM E∥Hk − Z∗ ∥2 γ 2 2η 2  γ  ≤ (1 − ηµ)E∥Zk − Z∗ ∥2P + 1 − λmin (I − W) E∥Dk − D∗ ∥2Q γ 2 n m n m XX ∗ 4η 2 L X X 1 + c̃ k (1 − pij )EVfij (x̃ij , x ) + 2 EVfij (x̃kij , x∗ ) i=1 j=1 m i=1 j=1 p ij (1 − ηµ)γλmax (I − W) + (1 − α)CM E∥Hk − Z∗ ∥2 + E∥Ẑk − Z∗ ∥2I−P , 2 where the last inequality is guaranteed by 2η 4η 2 L − 2 − c̃pij ≥ 0. m m pij 8η 2 L Define pmin = mini,j {pij } and take c̃ = m2 p2min , then by the choice of {pij } and η, the above condition is satisfied due to 2η 4η 2 L 1 1 2 − 2 − c̃pij = − − = 0. m m pij 3Lm 9Lm 9Lm Note that 4η 2 L ! 4η 2 L m2 pij  1  c̃(1 − pij ) + 2 = c̃ 1 − pij + = c̃ 1 − . m pij c̃ 2m Therefore, n m n m 2 XX  2 XX  EΦk+1 + k+1 ∗ EVfij (x̃ij , x ) ≤ ρ̃ EΦ + k EVfij (x̃kij , x∗ ) , 9L i=1 j=1 9L i=1 j=1 where   ηµ γ γλmax (I − W) 1 ρ̃ := max 1 − , 1 − λmin (I − W), 1 − α, (1 − ηµ) ,1 − . 2 − ηµ 2 2 2m 94 Complexity. The complexity analysis is identical to that in Theorem 4.4.2 with the single exception that we replace p by m−1 so we omit it here. 95 BIBLIOGRAPHY 96 BIBLIOGRAPHY [1] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. [2] Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019. [3] Zhi Li and Ming Yan. New convergence analysis of a primal-dual algorithm with large stepsizes. Advances in Computational Mathematics, 47(1):1–20, 2021. [4] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of COMPSTAT’2010, pages 177–186, Heidelberg, 2010. Physica-Verlag HD. [5] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manju- nath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI’16, pages 265–283, 2016. [6] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A flexible and ef- ficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015. [7] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 583–598, Berkeley, CA, USA, 2014. USENIX Association. [8] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Ima- geNet training in minutes. In Proceedings of the 47th International Conference on Par- allel Processing, ICPP 2018, pages 1:1–1:10, New York, NY, USA, 2018. ACM. [9] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1223–1231, USA, 2012. 97 [10] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2737–2745. Curran Associates, Inc., 2015. [11] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. Parallelized stochastic gradient descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603. Curran Associates, Inc., 2010. [12] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Interspeech 2014, September 2014. [13] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017. [14] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560–569. PMLR, 2018. [15] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 4452–4463, USA, 2018. Curran Associates Inc. [16] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Urban Stich, and Martin Jaggi. Error feedback fixes SignSGD and other gradient compression schemes. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Pro- ceedings of Machine Learning Research, pages 3252–3261. PMLR, 2019. [17] Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019. [18] Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. DoubleSqueeze: Par- allel stochastic gradient descent with double-pass error-compensated compression. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 6155–6165, 2019. [19] Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression algorithm for efficient distributed learning. In International Conference on Artificial Intelligence and Statistics, pages 133–143. PMLR, 2020. [20] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 7663–7673, 2018. 98 [21] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. An exact quantized decentralized gradient descent algorithm. IEEE Transactions on Signal Processing, 67(19):4934–4947, 2019. [22] Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. Robust and communication-efficient collaborative learning. In Advances in Neural Information Processing Systems, pages 8388–8399, 2019. [23] Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, Tong Zhang, and Ji Liu. Deepsqueeze: Decentralization meets error-compensated compression. CoRR, abs/1907.07346, 2019. [24] Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In Proceed- ings of the 36th International Conference on Machine Learning, pages 3479–3487. PMLR, 2019. [25] Anastasia Koloskova, Tao Lin, Sebastian U Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression. In International Confer- ence on Learning Representations, 2020. [26] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi- agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. [27] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016. [28] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for de- centralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017. [29] Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro. DLM: Decentralized lin- earized alternating direction method of multipliers. IEEE Transactions on Signal Pro- cessing, 63(15):4051–4064, 2015. [30] Kun Yuan, Wei Xu, and Qing Ling. Can primal methods outperform primal-dual methods in decentralized dynamic optimization? IEEE Transactions on Signal Pro- cessing, 68:4466–4480, 2020. [31] Yao Li and Ming Yan. On the linear convergence of two decentralized algorithms. Journal of Optimization Theory and Applications, 189(1):271–290, 2021. [32] Yao Li, Xiaorui Liu, Jiliang Tang, Ming Yan, and Kun Yuan. Decentralized compos- ite optimization with compression. arXiv preprint arXiv:2108.04448, 2021. [33] Xiaorui Liu, Yao Li, Rongrong Wang, Jiliang Tang, and Ming Yan. Linear convergent decentralized optimization with compression. In International Conference on Learning Representations, 2021. 99 [34] Hanlin Tang, Yao Li, Ji Liu, and Ming Yan. ErrorCompensatedX: Error compen- sation for variance reduced algorithms. Advances in Neural Information Processing Systems, 34:18102–18113, 2021. [35] Yao Li and Ming Yan. On the improved conditions for some primal-dual algo- rithms. arXiv preprint arXiv:2201.00139, 2022. [36] S Sundhar Ram, Angelia Nedić, and Venugopal V Veeravalli. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications, 147(3):516–545, 2010. [37] Angelia Nedic. Asynchronous broadcast-based convex optimization over a net- work. IEEE Transactions on Automatic Control, 56(6):1337–1351, 2010. [38] D. Jakovetic, J. Xavier, and J. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59:1131–1146, 2014. [39] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Pro- cessing, 62(7):1750–1761, 2014. [40] Roland Glowinski and A Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique, 9(R2):41–76, 1975. [41] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Dis- tributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011. [42] E. Wei and A. Ozdaglar. On the O(1/k) convergence of asynchronous distributed alternating direction method of multipliers. In Global Conference on Signal and Infor- mation Processing (GlobalSIP), 2013 IEEE, pages 551–554. IEEE, 2013. [43] Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. Multi-agent distributed optimization via inexact consensus ADMM. IEEE Transactions on Signal Processing, 63(2):482–497, 2015. [44] Mingyi Hong and Tsung-Hui Chang. Stochastic proximal gradient consensus over random networks. IEEE Transactions on Signal Processing, 65(11):2933–2948, 2017. [45] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017. [46] Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimiza- tion. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2017. 100 [47] Aryan Mokhtari, Wei Shi, Qing Ling, and Alejandro Ribeiro. A decentralized second-order method for dynamic optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 6036–6043. IEEE, 2016. [48] Minghui Zhu and Sonia Martínez. Discrete-time dynamic average consensus. Au- tomatica, 46(2):322–329, 2010. [49] J. Xu, S. Zhu, Y. Soh, and L. Xie. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In Proceedings of the 54th IEEE Conference on Decision and Control (CDC), pages 2055–2060, 2015. [50] Paolo Di Lorenzo and Gesualdo Scutari. NEXT: In-network nonconvex optimiza- tion. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120– 136, 2016. [51] Angelia Nedić, Alex Olshevsky, Wei Shi, and César A Uribe. Geometrically conver- gent distributed optimization with uncoordinated step-sizes. In American Control Conference (ACC), 2017, pages 3950–3955. IEEE, 2017. [52] Shi Pu, Wei Shi, Jinming Xu, and Angelia Nedić. A push-pull gradient method for distributed optimization in networks. In 2018 IEEE Conference on Decision and Control (CDC), pages 3385–3390. IEEE, 2018. [53] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for dis- tributed optimization and learning—part i: Algorithm development. IEEE Transac- tions on Signal Processing, 67(3):708–723, 2018. [54] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for distributed optimization and learning—part ii: Convergence analysis. IEEE Trans- actions on Signal Processing, 67(3):724–739, 2018. [55] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 3027–3036. JMLR, 2017. [56] César A Uribe, Soomin Lee, Alexander Gasnikov, and Angelia Nedić. A dual ap- proach for optimal algorithms in distributed optimization over networks. In 2020 Information Theory and Applications Workshop (ITA), pages 1–37. IEEE, 2020. [57] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023, 2015. [58] Sulaiman Alghunaim, Kun Yuan, and Ali H Sayed. A linearly convergent proximal gradient algorithm for decentralized optimization. In Advances in Neural Information Processing Systems, pages 2848–2858, 2019. 101 [59] A. Chen and A. Ozdaglar. A fast distributed proximal-gradient method. In the 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 601–608, 2012. [60] A. Nedić and A. Olshevsky. Distributed optimization over time-varying directed graphs. In The 52nd IEEE Annual Conference on Decision and Control, pages 6855– 6860, 2013. [61] C. Xi and U. Khan. On the linear convergence of distributed optimization over directed graphs. arXiv preprint arXiv:1510.02149, 2015. [62] J. Zeng and W. Yin. ExtraPush for convex smooth decentralized optimization over directed networks. Journal of Computational Mathematics, Special Issue on Compressed Sensing, Optimization, and Structured Solutions, 35(4):381–394, 2017. [63] Ying Sun, Gesualdo Scutari, and Daniel Palomar. Distributed nonconvex multia- gent optimization over time-varying networks. In 2016 50th Asilomar Conference on Signals, Systems and Computers, pages 788–794. IEEE, 2016. [64] Qing Ling and Alejandro Ribeiro. Decentralized dynamic optimization through the alternating direction method of multipliers. In 2013 IEEE 14th Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pages 170–174. IEEE, 2013. [65] Angelia Nedić and Alex Olshevsky. Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Transactions on Automatic Control, 61(12):3936–3947, 2016. [66] Angelia Nedić. Distributed optimization over networks. In Multi-agent Optimiza- tion, pages 1–84. Springer, 2018. [67] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, vol- ume 87. Springer Science & Business Media, 2013. [68] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud com- puting. In INTERSPEECH, pages 1488–1492, 2015. [69] Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Ma- chine Learning Research, pages 3636–3645, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. [70] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Proceedings of the 32Nd Interna- tional Conference on Neural Information Processing Systems, NIPS’18, pages 1306–1316, USA, 2018. Curran Associates Inc. [71] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pages 1509–1519, 2017. 102 [72] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compen- sated quantized SGD and its applications to large-scale distributed optimization. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Con- ference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5325–5333, 10–15 Jul 2018. [73] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gra- dient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 440–445, Copenhagen, Denmark, September 2017. Asso- ciation for Computational Linguistics. [74] Michael I Jordan, Jason D Lee, and Yun Yang. Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668–681, 2019. [75] Lam Nguyen, Phuong Ha Nguyen, Marten Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. SGD and Hogwild! Convergence without the bounded gradi- ents assumption. In International Conference on Machine Learning, pages 3750–3758. PMLR, 2018. [76] Robert M. Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. SGD: General analysis and improved rates. 36th International Conference on Machine Learning, ICML 2019, 2019-June:9090–9112, 2019. [77] Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Pe- ter Richtárik. Stochastic distributed learning with gradient quantization and vari- ance reduction. arXiv preprint arXiv:1904.05115, 2019. [78] P. Elias. Universal codeword sets and representations of the integers. IEEE Transac- tions on Information Theory, 21(2):194–203, March 1975. [79] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998. [80] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 770–778, 2016. [81] Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter Richtarik, and Sebastian Stich. A linearly convergent algorithm for decentralized optimization: Sending less bits for free! In International Conference on Artificial Intelligence and Statistics, pages 4087–4095. PMLR, 2021. [82] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9):803–812, 1986. 103 [83] Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus Püschel. D-ADMM: A communication-efficient distributed algorithm for separable optimization. IEEE Transactions on Signal Processing, 61(10):2718–2723, 2013. [84] Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2 : Decentralized training over decentralized data. In Proceedings of the 35th International Conference on Machine Learning, pages 4848–4856, 2018. [85] Gesualdo Scutari and Ying Sun. Distributed nonconvex constrained optimization over time-varying digraphs. Mathematical Programming, 176(1):497–544, 2019. [86] Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods. Math- ematical Programming, pages 1–49, 2020. [87] Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Accelerated primal-dual algo- rithms for distributed smooth convex optimization over networks. In International Conference on Artificial Intelligence and Statistics, pages 2381–2391. PMLR, 2020. [88] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Lee, and Laurent Massoulié. Optimal convergence rates for convex distributed optimization in networks. Journal of Machine Learning Research, 20:1–31, 2019. [89] Ruggero Carli, Fabio Fagnani, Paolo Frasca, and Sandro Zampieri. Gossip consen- sus algorithms via quantized communication. Automatica, 46(1):70–80, 2010. [90] Yucheng Lu and Christopher De Sa. Moniqua: Modulo quantized communication in decentralized SGD. In Proceedings of the 37th International Conference on Machine Learning, 2020. [91] Sindri Magnússon, Hossein Shokri-Ghadikolaei, and Na Li. On maintaining linear convergence of distributed learning and optimization under limited communica- tion. IEEE Transactions on Signal Processing, 68:6101–6116, 2020. [92] Sulaiman A Alghunaim and Ali H Sayed. Linear convergence of primal–dual gradient methods and their performance in distributed optimization. Automatica, 117:109003, 2020. [93] Zhuorui Li, Yiwei Liao, Kun Huang, and Shi Pu. Compressed gradient track- ing for decentralized optimization with linear convergence. arXiv preprint arXiv:2103.13748, 2021. [94] Yongyang Xiong, Ligang Wu, Keyou You, and Lihua Xie. Quantized distributed gradient tracking algorithm with linear convergence in directed networks. arXiv preprint arXiv:2104.03649, 2021. [95] Zhuoqing Song, Lei Shi, Shi Pu, and Ming Yan. Compressed gradient tracking for decentralized optimization over general directed networks. IEEE Transactions on Signal Processing, 70:1775–1787, 2022. 104 [96] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017. [97] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using pre- dictive variance reduction. Advances in neural information processing systems, 26:315– 323, 2013. [98] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Ad- vances in neural information processing systems, 27, 2014. [99] Dmitry Kovalev, Samuel Horváth, and Peter Richtárik. Don’t jump through hoops and remove those loops: Svrg and katyusha are better without the outer loop. In Algorithmic Learning Theory, pages 451–467. PMLR, 2020. [100] Aryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic aver- aging gradient algorithm. The Journal of Machine Learning Research, 17(1):2165–2199, 2016. [101] Kun Yuan, Bicheng Ying, Jiageng Liu, and Ali H Sayed. Variance-reduced stochas- tic learning by networked agents under random reshuffling. IEEE Transactions on Signal Processing, 67(2):351–366, 2018. [102] Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized stochastic optimization with accelerated convergence. IEEE Transactions on Signal Processing, 68:6255–6271, 2020. [103] Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-efficient dis- tributed optimization in networks with gradient tracking and variance reduction. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third In- ternational Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1662–1672. PMLR, 26–28 Aug 2020. [104] Hadrien Hendrikx, Francis Bach, and Laurent Massoulié. Dual-free stochastic de- centralized optimization with variance reduction. Advances in Neural Information Processing Systems, 33:19455–19466, 2020. [105] Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization, 32(2):354–385, 2022. [106] Sulaiman A Alghunaim, Ernest Ryu, Kun Yuan, and Ali H Sayed. Decentralized proximal gradient algorithms with linear convergence rates. IEEE Transactions on Automatic Control, 2020. [107] Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Distributed algorithms for composite optimization: unified framework and convergence analysis. IEEE Trans- actions on Signal Processing, 69:3555–3570, 2021. 105 [108] Haishan Ye, Ziang Zhou, Luo Luo, and Tong Zhang. Decentralized accelerated proximal gradient descent. Advances in Neural Information Processing Systems, 2020, 2020. [109] Laurent Condat. A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms. Journal of optimization theory and applications, 158(2):460–479, 2013. [110] Bang Cong Vu. A splitting algorithm for dual monotone inclusions involving coco- ercive operators. Advances in Computational Mathematics, 38(3):667–681, 2013. [111] Peijun Chen, Jianguo Huang, and Xiaoqun Zhang. A primal-dual fixed point algo- rithm for minimization of the sum of three convex separable functions. Fixed Point Theory and Applications, 2016(1):1–18, 2016. [112] Ming Yan. A new primal–dual algorithm for minimizing the sum of three functions with a linear operator. Journal of Scientific Computing, 76(3):1698–1717, 2018. 106