EFFICIENT DISTRIBUTED ALGORITHMS: BETTER THEORY AND
             COMMUNICATION COMPRESSION
                                By
                              Yao Li
                       A DISSERTATION
                           Submitted to
                   Michigan State University
           in partial fulfillment of the requirements
                        for the degree of
         Applied Mathematics – Doctor of Philosophy
Computational Mathematics, Science and Engineering – Dual Major
                               2022


                                            ABSTRACT
            EFFICIENT DISTRIBUTED ALGORITHMS: BETTER THEORY AND
                              COMMUNICATION COMPRESSION
                                                By
                                              Yao Li
Large-scale machine learning models are often trained by distributed algorithms over ei-
ther centralized or decentralized networks. The former uses a central server to aggregate
the information of local computing agents and broadcast the averaged parameters in a
master-slave architecture. The latter considers a connected network formed by all agents.
The information can only be exchanged with accessible neighbors with a mixing matrix
of communication weights encoding the network’s topology.
    Compared with centralized optimization, decentralization facilitates data privacy and
reduces the communication burden of the single central agent due to model synchroniza-
tion, but the connectivity of the communication network weakens the theoretical conver-
gence complexity of the decentralized algorithms. Therefore, there are still gaps between
decentralized and centralized algorithms in terms of convergence conditions and rates.
    In the first part of this dissertation, we consider two decentralized algorithms: EXTRA
and NIDS, which both converge linearly with strongly convex objective functions and an-
swer two questions regarding them. What are the optimal upper bounds for their stepsizes?
Do decentralized algorithms require more properties on the functions for linear convergence than
centralized ones? More specifically, we relax the required conditions for linear convergence
of both algorithms. For EXTRA, we show that the stepsize is comparable to that of cen-
tralized algorithms. For NIDS, the upper bound of the stepsize is shown to be exactly
the same as the centralized ones. In addition, we relax the requirement for the objective
functions and the mixing matrices. We provide the linear convergence results for both
algorithms under the weakest conditions.


    As the number of computing agents and the dimension of the model increase, the
communication cost of parameter synchronization becomes the major obstacle to effi-
cient learning. Communication compression techniques have exhibited great potential as
an antidote to accelerate distributed machine learning by mitigating the communication
bottleneck.
    In the rest of the dissertation, we propose compressed residual communication frame-
works for both centralized and decentralized optimization and design different algo-
rithms to achieve efficient communication.
    For centralized optimization, we propose DORE, a modified parallel stochastic gradi-
ent descent method with a bidirectional residual compression, to reduce over 95% of the
overall communication. Our theoretical analysis demonstrates that the proposed strategy
has superior convergence properties for both strongly convex and nonconvex objective
functions.
    Existing works mainly focus on smooth problems and compressing DGD-type algo-
rithms for decentralized optimization. The class of smooth objective functions and the
sublinear convergence rate under relatively strong assumptions limit these algorithms’
application and practical performance. Motivated by primal-dual algorithms, we pro-
pose Prox-LEAD, a linear convergent decentralized algorithm with compression, to tackle
strongly convex problems with a nonsmooth regularizer. Our theory describes the cou-
pled dynamics of the inexact primal and dual update as well as compression error with-
out assuming bounded gradients. The superiority of the proposed algorithm is demon-
strated through the comparison with state-of-the-art algorithms in terms of convergence
complexities and numerical experiments. Our algorithmic framework also generally en-
lightens the compressed communication on other primal-dual algorithms by reducing the
impact of inexact iterations.


Copyright by
    YAO LI
       2022


Dedicated to my parents and in memory of my grandmother, a woman of strength, kindness and
                      love, who passed away before seeing me graduate.
                                             v


                                  ACKNOWLEDGEMENTS
There are many people who encouraged me and shared great times with me throughout
my doctoral career. Please accept my apologies if there are any names that have been
forgotten.
    First of all, I would like to express my deep gratitude to my advisor and dissertation
director, Professor Ming Yan. With his dedicated guidance and support over the past five
years, I have developed a good taste and a broad vision of research. Again, thanks to
my friend-like relationship with him, I could maintain a work-life balance and alleviate
the pressure from life and studies. I would also like to thank all the other committee
members, Professor Ekaterina Rapinchuk, Professor Kalyanmoy Deb, and Professor Mark
Iwen, for their guidance and advice. They provided helpful feedback and suggestions on
my path to graduation.
    I want to thank my most important research partner of mine, Dr. Xiaorui Liu. We
collaborated with each other, discussed and shared insightful research ideas, and com-
pleted many good papers. Without him, I would not have expanded my knowledge in
deep learning and bridged the gap between theoretic algorithmic research and machine
learning applications so quickly. I would also like to thank Dr. Brendt Wohlberg and Dr.
Youzuo Lin for their short-term guidance in LANL.
    I also want to thank my friends Jing Huang, Jian Song, Mengzhi Chen, Menglun Wang,
Ningyu Sha, Kai Huang, Jialin Qu, Runze Su, Rui Wang and Tao Feng for the precious
time they spent with me.
    Last but not least, I would like to express my greatest gratitude to my parents, Youzhu
Li and Yiyan Liu. Their selfless love has always been the greatest motivation for me to
complete my doctoral career.
                                              vi


                                TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     x
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      xii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . .              . . . . . . .   1
   1.1 Decentralized Optimization . . . . . . . . . . . . . . . . . . . .     . . . . . . .   1
   1.2 Parallel SGD with Bidirectional Communication Compression              . . . . . . .   2
   1.3 Decentralized Optimization with Compression . . . . . . . . .          . . . . . . .   3
   1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   5
   1.5 Other Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . .   5
CHAPTER 2     EXTRA AND NIDS: DECENTRALIZED ALGORITHMS WITH
              IMPROVED CONDITIONS . . . . . . . . . . . . . . . . . . . . . .             .   7
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   7
   2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   8
   2.3 Algorithms and Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . .     .   9
   2.4 New Linear Convergence Results for EXTRA and NIDS . . . . . . . . . .              . 12
       2.4.1 Linear Convergence of EXTRA . . . . . . . . . . . . . . . . . . . .          . 13
       2.4.2 NIDS without Nonsmooth Term . . . . . . . . . . . . . . . . . . .            . 20
   2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . 24
   2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 27
CHAPTER 3     DORE: A CENTRALIZED ALGORITHM WITH BIDIRECTIONAL
              COMMUNICATION COMPRESSION . . . . . . . . . . . . . . . .                   . 28
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 28
   3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 29
   3.3 Double Residual Compression SGD . . . . . . . . . . . . . . . . . . . . .          . 31
       3.3.1 The Proposed DORE . . . . . . . . . . . . . . . . . . . . . . . . . .        . 32
       3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 35
   3.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 36
       3.4.1 The Strongly Convex Case . . . . . . . . . . . . . . . . . . . . . .         . 37
       3.4.2 The Nonconvex Case with R = 0 . . . . . . . . . . . . . . . . . . .          . 38
   3.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . 39
       3.5.1 Strongly convex . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . 40
       3.5.2 Nonconvex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . 41
       3.5.3 Communication Efficiency . . . . . . . . . . . . . . . . . . . . . .         . 43
   3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 44
                                             vii


CHAPTER 4    PROX-LEAD: A LINEAR CONVERGENT CONVERGENT AL-
             GORITHM WITH EFFICIENT COMMUNICATION FOR COM-
             POSITE PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . .        . 45
   4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
       4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 47
       4.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
   4.2 The Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 50
       4.2.1 Prox-LEAD as Inexact PUDA . . . . . . . . . . . . . . . . . . . . .        . 52
       4.2.2 Stochastic Gradient Oracle . . . . . . . . . . . . . . . . . . . . . .     . 54
   4.3 Assumptions on Regularity . . . . . . . . . . . . . . . . . . . . . . . . . .    . 55
   4.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 56
       4.4.1 The General Stochastic Setting . . . . . . . . . . . . . . . . . . . .     . 62
       4.4.2 Finite-sum Setting with Variance Reduction . . . . . . . . . . . . .       . 65
   4.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 67
       4.5.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . .   . 67
   4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  73
   APPENDIX A       SUPPLEMENTARY OF CHAPTER 3 . . . . . . . . . . . . . .                74
   APPENDIX B       SUPPLEMENTARY OF CHAPTER 4 . . . . . . . . . . . . . .                86
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  96
                                          viii


                                        LIST OF TABLES
Table 3.1: A comparison between related algorithms. The second column indi-
           cates whether algorithm compresses only the gradients or both the
           gradients and the model. DORE is able to converges linearly to the
           O(σ) neighborhood of optimal point like full-precision SGD and DI-
           ANA in the strongly convex case while achieving much better commu-
           nication efficiency. DORE also admits linear speedup in the nonconvex
           case like DoubleSqueeze but DORE doesn’t require the assumptions of
           bounded compression error or bounded gradient. . . . . . . . . . . . .           40
Table 4.1: Stochastic gradient oracle (SGO). . . . . . . . . . . . . . . . . . . . . . .    53
Table 4.2: Summary of the convergence compleixty for Prox-LEAD to achieve ϵ-
           accuracy with fiexed stepsizes. The first row is the complexity with
           the full gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Table B.1: The comparison of algorithms mentioned in Section B.2; κf := Lµ , κg :=
           λmax (I−W)            max         wij
           λmin (I−W)
                      and κeg := λmin(i,j)∈E
                                       (I−W)
                                                 for nonnegative W. . . . . . . . . . . . . 88
                                                   ix


                                     LIST OF FIGURES
                                k   ∗
Figure 2.1: LEFT: the error ∥x    −x ∥F
                              ∥x0 −x∗ ∥F
                                          vs iterations for DGD with different step-
            sizes, EXTRA with three stepsizes, and NIDS. RIGHT: The random
            network with 10 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . .  25
                                k   ∗
Figure 2.2: LEFT: the error ∥x    −x ∥F
                              ∥x0 −x∗ ∥F
                                          vs iterations for DGD with different step-
            sizes, EXTRA with three stepsizes, and NIDS. RIGHT: The random
            network with 10 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . .  25
Figure 2.3: The comparison of proved stepsizes for EXTRA and NIDS with the
            optimal choice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
                                           k   ∗
Figure 2.4: The figure of residuals ∥x       −x ∥F
                                         ∥x0 −x∗ ∥F
                                                    with respect to iteration. The left
                                                 ¯
            graph is for strongly convex f (x) and the right one is for strongly
            convex f (x). re-EXTRA and re-NIDS stand for implementing EXTRA
            and NIDS over relaxed Wnew . . . . . . . . . . . . . . . . . . . . . . . .    27
Figure 3.1: An illustration of DORE. . . . . . . . . . . . . . . . . . . . . . . . . . .  32
Figure 3.2: Per iteration time cost on Resnet18 for SGD, QSGD, and DORE. It is
            tested in a shared cluster environment connected by Gigabit Ether-
            net interface. DORE speeds up the training process significantly by
            mitigating the communication bottleneck. . . . . . . . . . . . . . . . . .    41
Figure 3.3: Linear regression on synthetic data. When the learning rate is 0.05,
            DoubleSqueeze diverges. In both cases, DORE, SGD, and DIANA
            converge linearly to the optimal point, while QSGD, MEM-SGD, Dou-
            bleSqueeze, and DoubleSqueeze (topk) only converge to the neigh-
            borhood even when full gradient is available. . . . . . . . . . . . . . . .   42
Figure 3.4: LeNet trained on MNIST. DORE converges similarly as most base-
            lines. It outperforms DoubleSqueeze using the same compression
            method while has similar performance as DoubleSqueeze (topk). . . .           42
Figure 3.5: Resnet18 trained on CIFAR10. DORE achieves similar convergence
            and accuracy as most baselines. DoubeSuqeeze converges slower and
            suffers from the higher loss but it works well with topk compression. .       43
                                                  x


Figure 4.1: Smooth logistic regression problem (λ1 = 0). In the full gradient
            case ((a) and (b)), LEAD (2bit) and LessBit (2bit) converge similarly
            as NIDS (32bit) and LEAD (32bit) in terms of epochs/iterations, but
            they requires much fewer bits in communication. In the stochastic
            case ((c) and (d)), the 2bit variants of LEAD match well with their
            32bit variants in terms of the number of gradient evaluation, but
            they requires much fewer bits. Note that LEAD-SAGA requires more
            memory and more iterations/communication than LEAD-LSVRG, but
            it computes only one gradient in each iteration, while LEAD-LSVRG
            computes at least two gradients in each iteration. . . . . . . . . . . . .    69
Figure 4.2: Nonsmooth logistic regression problem (λ1 = 0.005). In the full gradi-
            ent case ((a) and (b)), Prox-LEAD (2bit) converges similarly as NIDS
            and Prox-LEAD (32bit) in terms of epochs/iterations, but it requires
            much fewer bits than the other three algorithms. In the stochastic
            case ((c) and (d)), the 2bit variants match well with their 32bit vari-
            ants in terms of the number of gradient evaluation, but they requires
            much fewer bits. Note that Prox-LEAD-SAGA requires more mem-
            ory and more iterations/communication than Prox-LEAD-LSVRG,
            but it computes only one gradient in each iteration, while Prox-LEAD-
            LSVRG computes at least two gradients in each iteration. . . . . . . . .      70
Figure A.1: The norm of compressed variable in linear regression. . . . . . . . . . .     74
Figure A.2: Training under different compression block sizes. . . . . . . . . . . . .     75
Figure A.3: Training under different α. . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure A.4: Training under different β. . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure A.5: Training under different η. . . . . . . . . . . . . . . . . . . . . . . . . . 76
                                             xi


                             LIST OF ALGORITHMS
Algorithm 3.1: DORE1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  33
Algorithm 4.1: Prox-LEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  52
Algorithm 4.2: Compressed Communication Procedure (COMM) . . . . . . . . . .              52
Algorithm A.1: DORE with R(x) = 0 . . . . . . . . . . . . . . . . . . . . . . . . . .     77
Algorithm B.1: LEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
                                           xii


                                         CHAPTER 1
                                     INTRODUCTION
This chapter briefly overviews the problems considered and the results developed in the
following separated chapters. The detailed introduction and claimed notations are con-
tained throughout each chapter.
1.1    Decentralized Optimization
Decentralized optimization problem is to minimize f¯(x) :=       1
                                                                   Pn
                                                                 n   i=1 fi (x) collaboratively
over a network of n agents. We consider convex and differentiable function fi : Rp → R,
which is known only by the corresponding agent i. The whole system is decentralized
because each agent has an estimation of the global variable and can only exchange the
estimation with their accessible neighbors during each iteration. A symmetric mixing
matrix W ∈ Rn×n is used to encode the communication weights between the agents and
enforce the consensus. The minimum condition for W has one eigenvalue 1 with the
all-one vector 1 being a corresponding eigenvector. All other eigenvalues of W are less
than 1. We provide new and stronger linear convergence results for two state-of-the-art
algorithms: EXTRA in [1] and NIDS in [2]. More specifically, by assuming L-smoothness
of each fi ,
    • We show the linear convergence of EXTRA under the strong convexity of f¯ and the
      relaxed condition λmin (W) > −5/3. The upper bound of the stepsize can be as large
         5+3λmin (W)
      as     4L
                     , which is shown to be optimal in [3] for general convex problems;
    • We show the linear convergence of NIDS under the same condition on f¯ and W as
      EXTRA with any network-independent stepsize α ∈ (0, 2/L).
                                              1


1.2    Parallel SGD with Bidirectional Communication Compression
We first consider centralized optimization with efficient communication over a parame-
ter server architecture and propose algorithms based on the well-known parallel stochas-
tic gradient descent (SGD) to achieve efficient communication. Stochastic gradient algo-
rithms [4] are efficient at minimizing the objective function f : Rd → R which is usually
defined as f (x) := Eξ∼D [ℓ(x, ξ)], where ℓ(x, ξ) is the objective function defined on data
sample ξ and model parameter x. A basic stochastic gradient descent repeats the gradient
“descent” step xk+1 = xk − γg(xk ) where xk is the current iteration and γ is the step size.
The stochastic gradient g(xk ) is computed based on an i.i.d. sampled mini-batch from the
training data distribution D and serves as the estimator of the full gradient ∇f (xk ). In
large-scale machine learning, the number of data samples and the model size are usually
huge. Distributed learning utilizes many computers/cores to perform the stochastic algo-
rithms to reduce the training time. It has attracted extensive attention due to the demand
for highly efficient model training [5, 6, 7, 8].
    We focus on the data-parallel SGD [9, 10, 11], which provides a scalable solution
to speed up the training process by distributing the whole data to multiple computing
nodes, and consider the following problem
                                                   n
                                                1X
                     minimize f (x) + R(x) =          Eξ∼Di [ℓ(x, ξ)] +R(x),           (1.1)
                        x∈Rd                    n i=1 |    {z       }
                                                         :=fi (x)
where each fi (x) is a local objective function of the worker node i defined based on the
allocated data under distribution Di and R : Rd → R is usually a closed convex regular-
izer.
    We propose DORE, adapted by compressed communication on both pull and push
directions of Parallel SGD, to reduce the bidirectional communication cost in distributed
learning and provide a complete theoretical analysis of the algorithm’s behavior in both
strongly convex and nonconvex settings.
                                                2


1.3    Decentralized Optimization with Compression
We then return to decentralized optimization and try to generalize the communication
compression scheme of DORE over networks without a parameter server. We consider
the general composite problems, i.e., the smooth problems with nonsmooth regularizer,
over the connected n-node network G in the form of
                                           n
                                        1 X                          
                             minimize          Eξ ∼D fi (x, ξi ) +r(x) ,                     (1.2)
                                x∈Rp   n i=1 | i i{z          }
                                                   =:fi (x)
where fi (x, ξi ) is the objective function at node i defined on the data ξi sampled from the
distribution Di , and x denotes the model parameters. We use fi (x) to define the overall
local objective function at node i, and we will differentiate fi (x, ξi ) and fi (x) using differ-
ent inputs. The data distributions {Di } can be heterogeneous. In other words, the data
distributions can be different from node to node, and we do not make assumptions about
data heterogeneity. The function fi (x, ξi ) is assumed to be convex and smooth, and r(x)
is a proper, convex, and possibly nonsmooth function shared across the nodes. The graph
G encodes the topology of the communication network where information is exchanged
along the edges.
    In recent years, various communication compression techniques, such as quantization
and sparsification, have been developed to reduce communication costs. Notably, ex-
tensive studies [12, 13, 14, 15, 16, 17, 18, 19] have utilized gradient compression to boost
communication efficiency for centralized optimization significantly. They enable efficient
large-scale optimization while maintaining comparable convergence rates and practical
performance with their non-compressed counterparts. This remarkable success has sug-
gested the potential and significance of communication compression in decentralized al-
gorithms.
    While extensive attention has been paid to centralized optimization, communication
compression is relatively less studied in decentralized algorithms because the algorithm
design and analysis are more challenging to cover general communication topologies.
                                                 3


Recent efforts are trying to push this research direction. For instance, DCD-SGD and
ECD-SGD [20] introduce difference compression and extrapolation compression to re-
duce model compression error. [21, 22] introduce QDGD and QuanTimed-DSGD to have
exact convergence with small stepsizes. DeepSqueeze [23] directly compresses the lo-
cal model and compensates for the compression error in the next iteration. CHOCO-
SGD [24, 25] presents a novel quantized gossip algorithm that reduces compression error
by difference compression and preserves the model average. Nevertheless, most current
works focus on the compression of primal-only algorithms, i.e., reduce to DGD [26, 27] or
P-DSGD [28]. They are unsatisfying regarding convergence rate, stability, and the capa-
bility to handle heterogeneous data. Part of the reason is that they inherit the drawback
of DGD-type algorithms, whose convergence rate is slow in heterogeneous data scenarios
where the data distributions are significantly different from agent to agent.
    In the literature on decentralized optimization, it has been proved that primal-dual
algorithms can achieve faster converge rates and better support heterogeneous data [29,
1, 2, 30]. However, it is unknown whether communication compression is feasible for
primal-dual algorithms and how fast the convergence can be with compression. This
chapter attempts to bridge this gap by investigating the communication compression for
primal-dual decentralized algorithms.
    We design a decentralized residual compression scheme and propose a novel decen-
tralized algorithm with compression, Prox-LEAD, to achieve linear convergence under
the strongly convex assumption and efficient communication. We derive the convergence
complexity of Prox-LEAD for two types of fi , the general expectation of loss fi (x, ξi ) as
defined in (1.2) and the finite-sum setting. We combine Prox-LEAD with Loopless SVRG
and SAGA for the second case to achieve the exact linear convergence with stochastic
gradients.
                                             4


1.4    Publications
Chapters 2-4 are based on the following papers respectively:
Published papers:
    • [31] Yao Li and Ming Yan. On the linear convergence of two decentralized algo-
      rithms. Journal of Optimization Theory and Applications, 189(1):271–290, 2021
    • [19] Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression
      algorithm for efficient distributed learning. In International Conference on Artificial
      Intelligence and Statistics, pages 133–143. PMLR, 2020
In print:
    • [32] Yao Li, Xiaorui Liu, Jiliang Tang, Ming Yan, and Kun Yuan. Decentralized com-
      posite optimization with compression. arXiv preprint arXiv:2108.04448, 2021
1.5    Other Papers
In addition to the above papers, I completed the following three papers during my doc-
toral study. The first work also considers optimization with compression, which proposes
an improved algorithmic framework with a compression scheme different from the one
covered in Chapter 3 to achieve efficient communication for centralized learning. The
second paper proposes LEAD to solve smooth problems with compressed communica-
tion, which is the special case of Prox-LEAD in Chapter 4. The theoretical result of LEAD
in [33] is extended and improved by Prox-LEAD in [32]. The last one considers the equiv-
alence of several existing primal-dual algorithms and derives the improved bound on
stepsizes for some of them, which is beyond the topic I would like to discuss here. For
the consistency and the conciseness of the whole dissertation, I omit these papers.
                                               5


• [34] Hanlin Tang, Yao Li, Ji Liu, and Ming Yan. ErrorCompensatedX: Error compen-
  sation for variance reduced algorithms. Advances in Neural Information Processing
  Systems, 34:18102–18113, 2021
• [33] Xiaorui Liu, Yao Li, Rongrong Wang, Jiliang Tang, and Ming Yan. Linear con-
  vergent decentralized optimization with compression. In International Conference on
  Learning Representations, 2021
• [35] Yao Li and Ming Yan. On the improved conditions for some primal-dual algo-
  rithms. arXiv preprint arXiv:2201.00139, 2022
                                         6


                                          CHAPTER 2
     EXTRA AND NIDS: DECENTRALIZED ALGORITHMS WITH IMPROVED
                                         CONDITIONS
2.1    Introduction
Early decentralized methods based on decentralized gradient descent [26, 36, 37, 27, 38]
have sublinear convergence for strongly convex objective functions, because of the dimin-
ishing stepsize that is needed to obtain a consensual and optimal solution. This sublinear
convergence rate is much slower than that for centralized ones. The first decentralized al-
gorithm with linear convergence [39] is based on Alternate Direction Multiplier Method
(ADMM) [40, 41]. Note that this type of algorithms has O(1/k) rate for general convex
functions [42, 43, 44]. After that, many linearly convergent algorithms are proposed.
Some examples are EXTRA [1], NIDS [2], DIGing [45, 46], ESOM [47], gradient track-
ing methods [48, 49, 50, 46, 45, 51, 52], exact diffusion [53, 54], and dual optimal [55, 56].
There are also works on composite functions, where each private function is the sum of a
smooth and a nonsmooth functions [57, 2, 58, 59]. Another topic of interest is decentral-
ized optimization over directed and dynamic graphs [60, 61, 62, 63, 45, 64, 65]. Interested
readers are referred to [66] and the references therein.
    This chapter focuses on two linear convergent algorithms: EXTRA and NIDS, and pro-
vides better theoretical convergence results. EXact firsT-ordeR Algorithm (EXTRA) was
proposed in [1], and its iteration is described in (2.2). For the general convex case, where
each fi is convex and L-smooth (i.e., has a L-Lipschitz continuous gradient), the conver-
                                              
gence condition in [1] is α ∈ 0, 1+λmin  L
                                           (W)
                                                 . Therefore, there is an implicit condition for
W that the smallest eigenvalue of W is larger than −1. Later the condition is relaxed to
                    
α ∈ 0, 5+3λ4Lmin (W)
                       in [3], and the corresponding requirement for W is that the smallest
eigenvalue of W is larger than −5/3. In addition, this condition for the stepsize is shown
                                                  7


to be optimal, i.e., EXTRA may diverge if the condition is not satisfied. Though we can al-
ways manipulate W to change the smallest eigenvalue, the convergence speed of EXTRA
depends on the matrix W. In the numerical experiment, we will see that it is beneficial to
choose small eigenvalues for EXTRA in certain scenarios.
     The linear convergence of EXTRA requires additional conditions on the functions.
There are mainly three types of conditions used in the literature: the strong convexity of
f¯ (and some weaker variants) [1], the strong convexity of each fi (and some weaker vari-
ants) [3], and the strong convexity of one function fi [54]. Note that the condition on f¯ is
much weaker than the other two; there are cases where f¯ is strongly convex but none of
fi ’s is. E.g., fi = ∥eTi x∥22 for p = n > 1, where ei is the vector whose ith component is 1 and
all other components are 0. If f¯ is (restricted) strongly convex with parameter µf¯, the lin-
                                                                            
                                                             µ ¯(1+λ    (W))
ear convergence of EXTRA is shown when α ∈ 0, f Lmin                  2        in [1]. The upper bound
for the stepsize is very conservative, and the better performance with a larger stepsize
was shown numerically in [1] without proof. If each fi is strongly convex with parameter
                                                                                                 
µ, the linear convergence is shown when α ∈ 0, 1+λL+µ         min (W)
                                                                        and α ∈ 0, 5+3λ4L   min (W)
                                                                                                      in [58]
and [3], respectively. One contribution of this chapter is to show the linear convergence
                                                                                             
of EXTRA under the (restricted) strong convexity of f¯ and α ∈ 0, 5+3λ4L              min (W)
                                                                                                .
     The algorithm NIDS (Network InDependent Stepsize) was proposed in [2]. Though
there is a small difference from EXTRA, NIDS can choose a stepsize that does not depend
on the mixing matrices. The linear convergence of NIDS in [2] requires I ≽ W ≻ −I and
strong convexity of f (x). In this chapter, we relax this condition for linear convergence
to (restricted) strong convexity of f¯(x) and the relaxed mixing matrices with I ≽ W ≻
−(5/3)I.
2.2      Notation
We let
                                                    Xn
                                           f (x) :=     fi (xi ),                                       (2.1)
                                                    i=1
                                                    8


where each xi ∈ Rp is the local copy of the global variable x and the kth iterated point is
xki . Since agent i has its own estimate xi of the global variable x, we put them together
and define
                                     x = [x1 , x2 , · · · , xn ]⊤ ∈ Rn×p .
The gradient of f is defined as
                         ∇f (x) = [∇f1 (x1 ), ∇f2 (x2 ), · · · , ∇fn (xn )]⊤ ∈ Rn×p .
We say that x is consensual if x1 = x2 = · · · = xn , i.e., x = 1x⊤ , where x ∈ Rp×1 and
1 = [1, 1, · · · , 1]⊤ ∈ Rn×1 .
     In this chapter, we use ∥ · ∥ and ⟨·, ·⟩ to denote the Frobenious norm and the corre-
sponding inner product, respectively. For a given matrix M ∈ Rn×p and any positive
(semi)definite matrix H, which is denoted as H ≻ 0 (H ≽ 0 for positive semidefinite), we
                      p
define ∥M∥H := tr(M⊤ HM). The largest and the smallest eigenvalues of a matrix A are
defined as λmax (A) and λmin (A). For a symmetric positive semidefinite matrix A, we let
λ+                                                           †
  min (A) be the smallest nonzero eigenvalue. A is the pseudo inverse of A. For a matrix
A ∈ Rn×n , we say a matrix B ∈ Rn×p is in Ker{A} if AB = 0n×p , and B is in Range{A} if
there exists C ∈ Rn×p such that B = AC. For simplicity, we may use x+ and x to replace
xk+1 and xk , respectively, in the proofs.
2.3     Algorithms and Prerequisites
One iteration of EXTRA can be expressed as
                        xk+2 = (I + W)xk+1 − Wx    f k − α[∇f (xk+1 ) − ∇f (xk )].            (2.2)
The stepsize α > 0, and the symmetric matrices W and W                 f satisfy I + W ≽ 2W
                                                                                          f ≽ 2W.
The initial value x0 is chosen arbitrarily, and x1 = Wx0 − α∇f (x0 ). In practice, we usually
           I+W
let Wf=
              2
                  .
     One iteration of NIDS is
                                 I+W                                                          (2.3)
                                                                                 
                         xk+2 =    2
                                        2xk+1 − xk −α(∇f (xk+1 ) − ∇f (xk )) ,
                                                        9


                                                                                          I+W 0
where α > 0 is the stepsize. The initial value x0 is chosen arbitrarily, and x1 =           2
                                                                                              [x  −
α∇f (x0 )].
                          I+W
    If we choose W f=
                            2
                                 in (2.2), the difference between EXTRA and NIDS in the above
mathematical forms happens only in the communicated data, i.e.,whether we exchange
the gradient information or not at each step. In practice, EXTRA can gain the advantage
of time overlap by parallelizing communication and gradient evaluation, while NIDS
evaluates the gradient and then communicates after the gradient is added. However, this
small difference brings big changes in the convergence [2]. In order for both algorithms
to converge, we have the following assumptions on W and W.             f
Assumption 2.3.1 (Mixing matrix). The connected network G = {V, E} consists of a set of
nodes V = {1, 2, · · · , n} and a set of undirected edges E. An undirected edge (i, j) ∈ E means
that there is a connection between agents i and j and both agents can exchange data. The mixing
matrices W = [wij ] ∈ Rn×n and W     f = [w  eij ] ∈ Rn×n satisfy:
   1. (Decentralized property): If i ̸= j and (i, j) ∈  / E, then wij = w
                                                                        eij = 0.
   2. (Symmetry): W = W⊤ , W       f =W   f ⊤.
   3. (Null space property): Null{W − W}      f = span{1} ⊆ Null{I − W}.     f
   4. (Spectral property):   I+W
                                   ≽W  f ≻ − 1 I,     W
                                                      f ≽ W.
                               2               3
Remark 2.3.1. Parts 2-4 imply that the spectrum of W is enlarged to (− 53 , 1], while the original
                                                                                      I+W
assumption is (−1, 1] for doubly stochastic matrices. Therefore, in our assumption,     2
                                                                                            does not
have to be positive definite. This assumption for W is strictly weaker than those in [1] and [2].
Remark 2.3.2. From [1, Proposition 2.2], Null{I − W} = span{1}. It is a critical result for
both algorithms.
    Before showing their theoretical results, we reformulate both algorithms.
                                                     10


Reformulation of EXTRA: We reformulate EXTRA by introducing a variable y ∈ Rn×p as
                                   xk+1 = Wx
                                           f k + yk − α∇f (xk ),                              (2.4a)
                                   yk+1 = yk − (W f − W)xk+1 ,                               (2.4b)
with y0 = −(W   f − W)x0 . Then (2.4) is equivalent to EXTRA (2.2).
Proposition 2.3.1. Let the x-sequence generated by (2.4) with y0 = −(W    f − W)x0 be {xk }∞ ,
                                                                                                k=1
then it’s identical to the sequence generated by EXTRA (2.2) with the same initial point x0 .
Proof. From (2.4a), we have
                  x1 =Wxf 0 + y0 − α∇f (x0 ) = Wx f 0 − (Wf − W)x0 − α∇f (x0 )
                     =Wx0 − α∇f (x0 ).
For k ≥ 0, we have
                xk+2 =Wx f k+1 + yk+1 − α∇f (xk+1 ) = Wxk+1 + yk − α∇f (xk+1 )
                      =(I + W)xk+1 − Wx  f k − α[f (xk+1 ) − f (xk )],
where the second and last equalities are from (2.4b) and (2.4a), respectively.
Remark 2.3.3. By (2.4b) and the assumption of y0 , yk ∈ Range{W          f − W} for all k. Also,
xk+1 = (W  f − W)† (yk − yk+1 ) + zk+1 for some zk+1 ∈ Ker{W     f − W}.
Reformulation of NIDS: We adopt the following reformulation from [2]:
                                           I−W k
                              dk+1 = dk +   2α
                                                [x  − α∇f (xk ) − αdk ],                      (2.5a)
                              xk+1 = xk − α∇f (xk ) − αdk+1 ,                                (2.5b)
with d0 = 0. The equivalence is shown in [2].
   To establish the linear convergence of EXTRA and NIDS, we need the following two
assumptions.
                                                  11


Assumption 2.3.2 (Uniqueness). There is a unique minimizer x∗ for f¯(x).
Assumption 2.3.3 (L-smoothness and restricted strong convexity). Each function fi is a
proper, closed and convex function with a Lipschitz continuous gradient:
                           ∥∇fi (x) − ∇fi (ex)∥ ≤ L∥x − x            e ∈ Rp ,
                                                             e∥, ∀x, x                       (2.6)
where L > 0 is the Lipschitz constant. Furthermore, f¯(x) is (restricted) strongly convex with
respect to x∗ :
                        ⟨x − x∗ , ∇f¯(x) − ∇f¯(x∗ )⟩ ≥ µf¯∥x − x∗ ∥2 , ∀x ∈ Rp .             (2.7)
    From [67, Theorem 2.1.5], the inequality (2.6) is equivalent to, for any x, x̃ ∈ Rn×p ,
                         ⟨x − x                x)⟩ ≥ L−1 ∥∇f (x) − ∇f (e
                               e, ∇f (x) − ∇f (e                          x)∥2 .             (2.8)
Proposition 2.3.2 ([1, Appendix A]). The following two statements are equivalent:
   1. f¯(x) is (restricted) strongly convex with respect to x∗ ;
   2. For any η > 0, g(x) := f (x) + η2 ∥x∥2I−W is µg (restricted) strongly convex with respect to
       x∗ = 1(x∗ )⊤ . Specially, we can characterize
                                                      2 +
                                              (                         )
                                                 µf¯ µf¯λmin (I − W)
                                    µg = min        ,                  η .
                                                 2      µ2f¯ + 16L2
    This proposition shows
                    ⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩ + η∥x − x∗ ∥2I−W ≥ µg ∥x − x∗ ∥2            (2.9)
for any x ∈ Rn×p .
2.4    New Linear Convergence Results for EXTRA and NIDS
Throughout this section, we assume that Assumptions 2.3.1-2.3.3 hold. Two techniques
are used to show the linear convergence: a) Proposition 2.3.2 serves as a bridge to connect
f¯(x) and f (x). It is the key to the weaker assumption on objective functions. b) Both algo-
rithms are equivalent to the extended Proximal Alternating Predictor-Corrector (PAPC)
                                                   12


in [3], and this equivalence is the key to relaxing the conditions on the mixing matrices
W and W.  f
2.4.1   Linear Convergence of EXTRA
                I+W
When W  f =
                  2
                     , EXTRA is recovered by applying the extended PAPC in [3] to the fol-
lowing dual form of the decentralized consensus problem
                                                      √
                                      minimize f ∗ ( I − Wy),
                                           y
where f ∗ is the conjugate function of f and y is the dual variable. In this case, EXTRA has
the optimal bound of the stepsize over the relaxed mixing matrix W ≻ −(5/3)I. This fact
enlightens us on the critical Lemma 2.4.3.
    For simplicity, we introduce some notations. Because of part 4 of Assumption 2.3.1,
given the mixing matrices W and W,      f there is a constant
                                      3                1          i
                                 θ∈       , min                 ,1
                                        4          1 − λmin (W)
                                                             f
such that
                             W :=θW  f + (1 − θ)I ≻ 0,                                 (2.10)
                              H :=W + (θ − 12 )(I − W)  f =    I+W
                                                                     ≻ 0,              (2.11)
                                                                  f
                                                                 2
                             M :=(W  f − W)† ≽ 0,                                      (2.12)
                              G :=W + I − 2W      f ≽ 0.                               (2.13)
Based on (2.10), we have
                                    f = W − (1 − θ)(I − W).
                                    W                         f                        (2.14)
Let (x∗ , y∗ ) be a fixed point of (2.4), it is straightforward to show that
                                            f − W)x∗ =0.
                                           (W                                          (2.15)
Part 3 of Assumption 2.3.1 shows that x∗ is consensual, i.e., x∗ = 1(x∗ )⊤ for certain x∗ ∈
Rp . The y-iteration in (2.4b) and the initialization of y0 show yk ∈ Range{W     f − W} =
                                                    13


Ker{1⊤ }. Then we have 1⊤ y∗ = α1⊤ ∇f (x∗ ) = 0. Thus, x∗ is the unique minimizer of
f¯(x).
Lemma 2.4.1 (Norm over range space [2, Lemma 3]). Let A ∈ Rn×n be symmetric positive
(semi)definite with rank r (r ≤ n) and λ1 ≥ λ2 ≥ · · · ≥ λr > 0 be its r eigenvalues. Then
Range{A} is a rp-dimensional subspace in Rn×p and has a norm defined by ∥x∥2A† := ⟨x, A† x⟩,
where A† is the pseudo inverse of A. In addition, λ−1       2          2       −1     2
                                                     1 ∥x∥ ≤ ∥x∥A† ≤ λr ∥x∥ for all x ∈
Range{A}.
    For simplicity, we let x+ and x stand for xk+1 and xk , respectively, in the proofs. The
same simplification applies to yk .
Lemma 2.4.2 (Norm equality). Let {(xk , yk )}∞  k=1 be the sequence generated by (2.4), then it
satisfies
                              ∥xk+1 − x∗ ∥2W−W
                                           f   = ∥yk − yk+1 ∥2M .                             (2.16)
Proof. From Remark 2.3.3, we have
                                    x+ = M(y − y+ ) + z+                                      (2.17)
for z+ ∈ Ker{W  f − W}. This equality and (2.15) give
           ∥x+ − x∗ ∥2W−W
                      f      =⟨x+ − x∗ , (W
                                          f − W)(x+ − x∗ )⟩ = ⟨x+ , (W    f − W)x+ ⟩
                             =⟨M(y − y+ ), y − y+ ⟩ = ∥y − y+ ∥2M ,
where the third equality holds because of (2.12), (2.17), and y −y+ ∈ Range(W        f −W).
Lemma 2.4.3 (A key inequality for EXTRA). Let {(xk , yk )}∞      k=1 be generated by (2.4), then we
have
              ∥xk+1 − x∗ ∥2H + ∥yk+1 − y∗ ∥2M
            ≤∥xk − x∗ ∥2H + ∥yk − y∗ ∥2M − ∥xk − xk+1 ∥2(θ− 3 )(I−W)
                                                                   f − ∥x
                                                                           k+1
                                                                               − x∗ ∥2G
                                                            4
               − ∥xk − xk+1 ∥W2
                                − 2α⟨xk+1 − x∗ , ∇f (xk ) − ∇f (x∗ )⟩.                        (2.18)
                                              14


Proof. The iteration (2.4) and equation (2.14) show
                2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩
              =2⟨x+ − x∗ , W(x
                            f − x+ ) + W(x f + − x∗ ) − (x+ − x∗ ) + (y − y∗ )⟩
              =2⟨x+ − x∗ , W(x
                            f − x+ ) + ( W  f − I)(x+ − x∗ )
                    f − W)(x+ − x∗ ) + y+ − y + y − y∗ ⟩
                 + (W
              =2⟨x+ − x∗ , W(x
                            f − x+ )⟩ + 2⟨x+ − x∗ , y+ − y∗ ⟩ − 2∥x+ − x∗ ∥2
                                                                                  G
              =2⟨x+ − x∗ , W(x − x+ )⟩ − 2⟨x+ − x∗ , (1 − θ)(I − W)(x  f     − x+ )⟩
                 + 2⟨x+ − x∗ , y+ − y∗ ⟩ − 2∥x+ − x∗ ∥2G ,                                  (2.19)
where the first equality comes from (2.4a), the second one follows (2.4b), and the last one is
from (2.14). From Remark 2.3.3, x+ −x∗ = M(y−y+ )+z+ −x∗ for some z+ ∈ Ker{W−W}.          f
Thus ⟨z+ − x∗ , y+ − y∗ ⟩ = 0, and the equality (2.19) can be rewritten as
                2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩
              =2⟨x+ − x∗ , W(x − x+ )⟩ − 2⟨x+ − x∗ , (1 − θ)(I − W)(x  f     − x+ )⟩
                 + 2⟨M(y − y+ ), y+ − y∗ ⟩ − 2∥x+ − x∗ ∥2G .
   Using the basic equality 2⟨a − b, b − c⟩ = ∥a − c∥2 − ∥a − b∥2 − ∥b − c∥2 and Lemma 2.4.2,
we have
                    ∥x+ − x∗ ∥W2
                                  − ∥x+ − x∗ ∥2(1−θ)(I−W)        +
                                                        f + ∥y − y ∥M
                                                                       ∗ 2
                  =∥x − x∗ ∥2W − ∥x − x∗ ∥2(1−θ)(I−W)f + ∥y − y ∥M
                                                                   ∗ 2
                                                                        ∗ 2
                     − ∥x − x+ ∥2W + ∥x − x+ ∥2(1−θ)(I−W)         +
                                                         f − ∥x − x ∥W−W   f
                     − 2∥x+ − x∗ ∥2G − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩.                     (2.20)
Note that the following inequality holds,
         1
         2
           ∥x+ − x∗ ∥2W−W
                      f     ≤ ∥x+ − x∗ ∥2W−W
                                         f      + 21 ∥x − x∗ ∥2W−W
                                                               f     − 14 ∥x − x+ ∥2W−W
                                                                                    f   .
                                               15


Adding it onto both sides of (2.20), we have
                       ∥x+ − x∗ ∥2H − 21 ∥x+ − x∗ ∥2G + ∥y+ − y∗ ∥2M
                     ≤∥x − x∗ ∥2H − 21 ∥x − x∗ ∥2G + ∥y − y∗ ∥2M
                                                                         1
                       − ∥x − x+ ∥2W − ∥x − x+ ∥2(θ− 3 )(I−W)                  + 2
                                                                   f + ∥x − x ∥G
                                                               4         4
                       − 2∥x+ − x∗ ∥2G − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩.                          (2.21)
Apply the inequality 41 ∥x−x+ ∥2G ≤ 21 ∥x−x∗ ∥2G + 12 ∥x+ −x∗ ∥2G , then the key inequality (2.18)
is obtained.
    In the following theorem, we assume G ̸= 0 (i.e., W            f ̸= (I + W)/2). It is easy to amend
the proof to show the result for this special case.
Theorem 2.4.1 (Q-linear convergence of EXTRA). Under Assumptions 2.3.1-2.3.3, we define
                                                     4θ−3
                                 r1 = 4(1−θ)2 λ            −1 (I−W))
                                                                     > 0,                          (2.22)
                                                   max (W
                                                                 f
                                                1
                                 r2 = 2λ              −1 ) > 0,                                    (2.23)
                                          max (GW
                                 r3 = r1 +rr21 +r
                                               r2
                                                  1 r2
                                                        ∈ (0, 1),                                  (2.24)
and choose two small parameters ξ and η such that
                                                  n                 o
                                   ξ ∈ 0, min 4λ r(WM)       3
                                                                   ,1 ,                            (2.25)
                                                         max
                                                                
                                   η ∈ 0, 4αλ λmin(W)−2α(W)ξ
                                                               2L .                                (2.26)
                                                  min
In addition, we define
                           P :=H + 2ξ (I − W) ≻ 0,
                           Q :=M + (r3 − 2ξλmax (WM))W−1 ≻ 0.
Then for any stepsize α ∈ (0, 2λminL(W) ), we have
                ∥xk+1 − x∗ ∥2P + ∥yk+1 − y∗ ∥2Q ≤ ρ(∥xk − x∗ ∥2P + ∥yk − y∗ ∥2Q ),                 (2.27)
                                                       16


where                              n                                                 
                                                        α2 L                      2α2 L   η
                       ρ := max 1 − 2α −             λmin (W)
                                                                 µg , 4α −     λmin (W)   ξ
                                                                                            ,
                                                                o                                   (2.28)
                                            r3 −4ξλmax (WM)
                                   1 − r +(1−2ξ)λ        (WM)
                                                                  < 1.
                                          3          max
Proof. From (2.18) in Lemma 2.4.3, we have
                  ∥x+ − x∗ ∥2H + ∥y+ − y∗ ∥2M
              ≤∥x − x∗ ∥2H + ∥y − y∗ ∥2M − ∥x − x+ ∥2(θ− 3 )(I−W)                       +
                                                                             f − ∥x − x ∥G
                                                                                                ∗ 2
                                                                       4
                  − ∥x − x+ ∥2W − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩.                                  (2.29)
Then we find an upper bound of −∥x − x+ ∥W                2
                                                              − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩.
                      − ∥x − x+ ∥W    2
                                         − 2α⟨x+ − x∗ , ∇f (x) − ∇f (x∗ )⟩
                   =α2 ∥∇f (x) − ∇f (x∗ )∥W       2                    ∗
                                                    −1 − 2α⟨x − x , ∇f (x) − ∇f (x )⟩
                                                                                              ∗
                      − ∥W(x − x+ ) − α(∇f (x) − ∇f (x∗ ))∥W             2
                                                                             −1
                                      2L
                   ≤ − 2α − λ α (W)            ⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩
                                             
                                   min
                      − ∥W(x − x+ ) − α(∇f (x) − ∇f (x∗ ))∥W             2
                                                                             −1 ,
where, the inequality comes from (2.8). Combining it with (2.29), we have
                       ∥x+ − x∗ ∥2H + ∥y+ − y∗ ∥2M − ∥x − x∗ ∥2H − ∥y − y∗ ∥2M
                                        2L
                    ≤ − 2α − λ α (W)            ⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩
                                              
                                    min
                       − ∥W(x − x+ ) − α(∇f (x) − ∇f (x∗ ))∥W              2
                                                                              −1
                                                                   ∗ 2
                       − ∥x − x+ ∥2(θ− 3 )(I−W)              +
                                                  f − ∥x − x ∥G .                                   (2.30)
                                            4
The inequality (2.30) shows that {(xk , yk )}∞           k=1 is a Cauchy sequence converging to the
fixed point (x∗ , y∗ ) of (2.4). From (2.9), we can bound the first term on the right hand side
of (2.30) as
                                  α2 L
                                           ⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩
                                         
                   − 2α −      λmin (W)
                              α2 L                                       α2 L
                                       η∥x − x∗ ∥2I−W − 2α −                       µg ∥x − x∗ ∥2 .
                                                                                 
                ≤ 2α −     λmin (W)                                    λmin (W)
                                                                                                    (2.31)
                                                        17


Next, we bound the two terms involving successive iterated points, i.e., −∥W(x − x+ ) −
α(∇f (x) − ∇f (x∗ ))∥2W−1 − ∥x − x+ ∥2(θ− 3 )(I−W) f . Note that
                                            4
                       W(x − x+ ) − α(∇f (x) − ∇f (x∗ ))
                     =G(x+ − x∗ ) − (y+ − y∗ ) + (1 − θ)(I − W)(x    f       − x+ ).   (2.32)
We use T1 , T2 , and T3 to denote the three terms on the right hand side of (2.32), respec-
tively. Using the definition of r1 in (2.22), we have
                                   2                + 2
                − ∥T1 + T2 + T3 ∥W    −1 − ∥x − x ∥
                                                        (θ− 3 )(I−W)
                                                            4
                                                                   f
                                           − 12                    1
             = − ∥T1 + T2 ∥2W−1 − 2⟨W           (T1 + T2 ), W− 2 T3 ⟩ − ∥T3 ∥2W−1
                   4θ−3
                − 4(1−θ)
                         ∥x − x+ ∥2(1−θ)(I−W)
                                            f
                                              1                    1
             ≤ − ∥T1 + T2 ∥2W−1 − 2⟨W− 2 (T1 + T2 ), W− 2 T3 ⟩ − (1 + r1 )∥T3 ∥2W−1
                    r1
             ≤−   1+r1
                       ∥T1  + T2 ∥2W−1 ,
where the last inequality comes from the Cauchy inequality
                                                1
                               −2⟨a, b⟩ ≤   1+r1
                                                  ∥a∥2  + (1 + r1 )∥b∥2 .
    Combining it with the last term −∥x+ − x∗ ∥2G on the right hand side of (2.30), we have
                         r1
                     − 1+r1
                            ∥T1  + T2 ∥2W−1 − ∥x+ − x∗ ∥2G
                                                     1            1
                  ≤−     r1
                       1+r1
                            ∥T2 ∥2W−1  −  2r1
                                         1+r1
                                                ⟨W− 2 T1 , W− 2 T2 ⟩ −    r1
                                                                       1+r1
                                                                             ∥T1 ∥2W−1
                               2
                     − r2 ∥T1 ∥W       1
                                 −1 − 2 ∥x
                                           +
                                               − x∗ ∥2G
                  ≤ − r3 ∥y+ − y∗ ∥2W−1 − 2ξ ∥x+ − x∗ ∥2G ,                            (2.33)
where ξ < 1 is a small positive parameter, and r2 and r3 are defined as (2.23) and (2.24),
respectively. Since G = (I − W) − 2(W     f − W), we have
                          ∥x+ − x∗ ∥2G = ∥x+ − x∗ ∥2I−W − 2∥y − y+ ∥2M .               (2.34)
                                                   18


Therefore
               r1
          −   1+r1
                   ∥T1       2
                       + T2 ∥W  −1 − ∥x
                                          +
                                            − x∗ ∥2G
                                     ξ
       ≤ − r3 ∥y+ − y∗ ∥W  2
                              −1 − 2 ∥x
                                         +
                                            − x∗ ∥2I−W − ξ∥y − y+ ∥2M
                                     ξ
       ≤ − r3 ∥y+ − y∗ ∥W  2
                              −1 − 2 ∥x
                                         +
                                            − x∗ ∥2I−W + 2ξ∥y+ − y∗ ∥2M + 2ξ∥y − y∗ ∥2M
       ≤ − (r3 /λmax (WM) − 2ξ)∥y+ − y∗ ∥2M − 2ξ ∥x+ − x∗ ∥2I−W + 2ξ∥y − y∗ ∥2M .                       (2.35)
Let ξ < r3 /(4λmax (WM)), then we have r3 /λmax (WM) − 2ξ > 2ξ. Putting (2.31) and (2.35)
together onto (2.30), we have
             ∥x+ − x∗ ∥2H + 2ξ ∥x+ − x∗ ∥2I−W + (1 + (r3 /λmax (WM) − 2ξ))∥y+ − y∗ ∥2M
                              2L                                      2L
          ≤ 1 − 2α − λ α (W)         µg ∥x − x∗ ∥2H + 2α − λ α (W)          η∥x − x∗ ∥2I−W
                                                                        
                           min                                     min
             + (1 + 2ξ)∥y − y∗ ∥2M .
Let ρ be defined as (2.28), we get (2.27). Note that the choice of ξ and η affects the defini-
tion of P and Q, but not the algorithm. Hence for any α ∈ (0, 2λminL(W) ), Q-linear conver-
gence is guaranteed for (xk − x∗ , yk − y∗ ).
    Because ∥xk −x∗ ∥2P ≤ ∥xk −x∗ ∥2P +∥yk −y∗ ∥2Q , the sequence {∥xk −x∗ ∥2P }∞               k=1 converges
R-linearly to 0 at the rate of ρ.
                                                                                         I+W
    Two special cases are not covered by the theorem: θ = 1 and W                f =
                                                                                           2
                                                                                               . When θ = 1,
                                      r2                    I+W
we have r1 = ∞ and r3 =             1+r2
                                         .  When W   f =
                                                              2
                                                                 ,  i.e., G = 0, we have r2 = ∞ and
       r1
r3 =  1+r1
           . In both cases, the linear convergence rate is
                            n                                                         
                                                  2α2 L                        4α2 L       η
                   ρ = max 1 − 2α −          2−θ+θλmin (W)
                                                             µg , 4α −     2−θ+θλmin (W)   ξ
                                                                                             ,
                                                    o                                                   (2.36)
                                    βr3 −4ξ(2−θβ)
                            1−   βr3 +(1−2ξ)(2−θβ)
                                                      ,
where β = 1 − λ2 (W) is the spectral gap. It is exactly the limit of ρ in (2.28) with r1 or r2
approaching infinity.
Remark 2.4.1. The upper bound for the stepsize α, 2(1 − θ + θλmin (W))/L,      f         is much larger than
                                                                     2
that in [1] for ensuring linear convergence, 2µg λmin (W)/L  f         , when W
                                                                              f is positive definite. In the
                                                        19


             f = (I + W)/2, we have α < (2 − θ + θλmin (W))/L. Since we can choose θ as
special case W
close as possible to 3/4, the upper bound of α attains (3λmin (W) + 5)/(4L), which coincides the
optimal bound given in [3] for general convex functions. In [3], the linear convergence was shown
under the strong convexity of all functions {fi }ni=1 .
2.4.2   NIDS without Nonsmooth Term
We consider NIDS next. In the smooth case, NIDS can be recovered by PAPC applied to
the primal form of the decentralized consensus problem
                                                         √
                                 minimize f (x),     s.t. I − Wx = 0.
                                       x
It motivates us to show the inequality in Lemma 2.4.5.
    [2, Lemma 1] shows that, with the initialization (d0 = 0, x0 ), the fixed point (d∗ ∈
Range(I − W), x∗ ) of (2.5) satisfies
                                            d∗ + ∇f (x∗ ) = 0,                             (2.37a)
                                              (I − W)x∗ = 0,                               (2.37b)
and x∗ is the consensual solution to the problem (2.1). We will use the following important
equality, which can be derived from (2.5)
                            (I −   I−W
                                      2
                                         )(dk+1  − dk ) = I−W
                                                           2α
                                                               (xk+1  − x∗ ).                (2.38)
    Motivated by the proof of EXTRA, we introduce another matrix to measure the dis-
tance to the fixed point. We still pick θ ∈ ( 34 , 1] such that
                                                                  
                                  I+W                           I−W
                              θ     2
                                          + (1 − θ)I = I − θ      2
                                                                       ≻ 0.                  (2.39)
Define a new symmetric matrix
                                                                 †
                               f = 2(I − W)† − θI =
                               M                             I−W
                                                                     − θI.                   (2.40)
                                                               2
Then M f is a norm over Range(I − W). Note that M                                   f = −θ1. In
                                                            f is invertible because M1
the following proofs, we use the same simplification x and x+ .
                                                    20


Lemma 2.4.4 (Equality). Let {(dk , xk )}∞ k=1 be the sequence generated by (2.5), we have the fol-
lowing two equalities:
                  ⟨xk+1 − x∗ , dk+1 − d∗ ⟩ = α⟨dk+1 − dk , dk+1 − d∗ ⟩M−(1−θ)I
                                                                            f                  (2.41a)
                  ⟨xk+1 − x∗ , dk+1 − dk ⟩ = α∥dk+1 − dk ∥2M−(1−θ)I
                                                                f        .                     (2.41b)
Proof. Since d+ − d∗ ∈ Range(I − W), we have
            ⟨x+ − x∗ , d+ − d∗ ⟩ =⟨(I − W)(x+ − x∗ ), (I − W)† (d+ − d∗ )⟩
                                 =α⟨(2I − (I − W))(d+ − d), (I − W)† (d+ − d∗ )⟩
                                 =α⟨(2(I − W)† − I)(d+ − d), d+ − d∗ ⟩,                         (2.42)
where the second equality follows (2.38). Replacing d∗ with d in (2.42), we get (2.41b) in
the same way.
Lemma 2.4.5 (A key inequality for NIDS). Let {(dk , xk )}∞        k=1 be generated by (2.5). We have,
with any r4 ∈ (0, θ − 43 ),
              ∥xk+1 − x∗ ∥2 + α2 ∥dk+1 − d∗ ∥2M+(θ−   1
                                                        +2r4 )I
                                              f
                                                      2
            ≤∥xk − x∗ ∥2 + α2 ∥dk − d∗ ∥2M+(θ−  1
                                                  −2r
                                                          − α2 ∥dk − dk+1 ∥2M+(θ−   3
                                                                                      −r
                                                     4 )I                               4 )I
                                          f                                   f
                                                2                                   4
              + α2 ∥∇f (xk ) − ∇f (x∗ )∥2 − 2α⟨xk − x∗ , ∇f (xk ) − ∇f (x∗ )⟩.                  (2.43)
Proof. The iteration (2.5) and the definition of M f in (2.40) show
          2α⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩
        =2⟨x − x∗ , x − x+ ⟩ − 2α⟨x − x∗ , d+ − d∗ ⟩
        =2⟨x − x+ , x − x∗ ⟩ − 2α⟨x − x+ , d+ − d∗ ⟩ − 2α⟨x+ − x∗ , d+ − d∗ ⟩
        =2⟨x − x+ , x − αd+ − x∗ + αd∗ ⟩ + 2α2 ⟨d − d+ , d+ − d∗ ⟩M−(1−θ)I f
        =2⟨x − x+ , x+ − x∗ + α∇f (x) − α∇f (x∗ )⟩ + 2α2 ⟨d − d+ , d+ − d∗ ⟩M−(1−θ)If        ,
where the first and the last equalities use (2.5b) and the third one follows (2.41a).
                                               21


   From (2.5b), we obtain
            2α⟨x − x+ , ∇f (x) − ∇f (x∗ )⟩
          =∥x − x+ ∥2 + α2 ∥∇f (x) − ∇f (x∗ )∥2 − ∥x − x+ − α∇f (x) + α∇f (x∗ )∥2
          =∥x − x+ ∥2 + α2 ∥∇f (x) − ∇f (x∗ )∥2 − α2 ∥d+ − d∗ ∥2 .                               (2.44)
Together with the basic equality 2⟨a − b, b − c⟩ = ∥a − c∥2 − ∥b − c∥2 − ∥a − b∥2 , we get
           ∥x+ − x∗ ∥2 + α2 ∥d+ − d∗ ∥2M−(1−θ)I
                                          f
         =∥x − x∗ ∥2 + α2 ∥d − d∗ ∥2M−(1−θ)I
                                       f          − α2 ∥d − d+ ∥2M−(1−θ)I
                                                                    f           − α2 ∥d+ − d∗ ∥2
            + α2 ∥∇f (x) − ∇f (x∗ )∥2 − 2α⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩.                          (2.45)
                    3
   Since r4 < θ −   4
                      ≤ 1/4, the following inequality holds,
             −( 21 − 2r4 )∥d+ − d∗ ∥2 ≤ ( 21 − 2r4 )∥d − d∗ ∥2 − ( 41 − r4 )∥d − d+ ∥2 .
Adding it onto both sides of (2.45), we get (2.43).
Theorem 2.4.2 (Q-linear convergence for NIDS). Under Assumptions 2.3.1-2.3.3, we define
                                              (λmax (I − W) − 2)2 
                             r5 = max 2,                                      .                  (2.46)
                                           2 − ( 43 + r4 )λmax (I − W)
                                                       1
For any stepsize α ∈ (0, L2 ), we choose η ∈ (0, α(2−αL)r   5
                                                              ) and define
                  n                                                                      o
                                                                          4r4
      ρ3 = max 1 − α(2 − αL)µg , α(2 − αL)ηr5 , 1 −             2λmax ((I−W)+ )− 12 +2r4
                                                                                           < 1,  (2.47)
Then we have
                                ∥xk+1 − x∗ ∥2I+ I−W + α2 ∥dk+1 − d∗ ∥2Q
                                                  r5
                                                                                                 (2.48)
                              ≤ρ(∥x −k
                                         x∗ ∥2I+ I−W      2
                                                      + α ∥d −  k
                                                                    d∗ ∥2Q ),
                                                  r5
where Q := Mf + (θ − 1 + 2r4 )I ≻ 0.
                       2
                                                    22


Proof. Given any α ∈ (0, L2 ), we have
                α2 ∥∇f (x) − ∇f (x∗ )∥2 − 2α⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩
               ≤ − α(2 − αL)⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩
               = − α(2 − αL)⟨x − x∗ , ∇f (x) − ∇f (x∗ )⟩ − α(2 − αL)η∥x − x∗ ∥2I−W
                 + α(2 − αL)η∥x − x∗ ∥2I−W
               ≤ − α(2 − αL)µg ∥x − x∗ ∥2 + α(2 − αL)η∥x − x∗ ∥2I−W ,
where the first inequality is from (2.8) and the second one uses (restricted) strong convex-
ity (2.9). Together with (2.43), we have
                   ∥x+ − x∗ ∥2 + α2 ∥d+ − d∗ ∥2M+(θ−         1
                                                               +2r  4 )I
                                                     f
                                                             2
                ≤∥x − x∗ ∥2 + α2 ∥d − d∗ ∥2M+(θ−         1
                                                           −2r
                                                                    − α2 ∥d − d+ ∥2M+(θ− 3
                                                                                           −r
                                                               4 )I                            4 )I
                                                 f                                 f
                                                         2                               4
                   − α(2 − αL)µg ∥x − x∗ ∥2 + α(2 − αL)η∥x − x∗ ∥2I−W ,                             (2.49)
The equality (2.38) gives
               ∥x+ − x∗ ∥2I−W
              =∥(I − W)(x+ − x∗ )∥2(I−W)† = α2 ∥(2I − (I − W))(d+ − d)∥2(I−W)†
              =α2 ∥d − d+ ∥2(2I−(I−W))(I−W)† (2I−(I−W)) = α2 ∥d − d+ ∥24(I−W)† −4I+(I−W)
              ≤α2 r5 ∥d − d+ ∥2M+(θ−  3
                                        −r )I
                                              ,                                                     (2.50)
                                          4
                               f
                                      4
where the second equality follows (2.38), the fourth one is from d − d+ ∈ Range(I − W),
and the inequality holds with the definition of r5 in (2.46). Combing (2.49) and (2.50), we
derive
                      ∥x+ − x∗ ∥2 +  1
                                     r5
                                        ∥x+    − x∗ ∥2I−W + α2 ∥d+ − d∗ ∥2M+(θ−      1
                                                                                       +2r 4 )I
                                                                               f
                                                                                     2
                   ≤(1 − α(2 − αL)µg )∥x − x∗ ∥2 + α(2 − αL)η∥x − x∗ ∥2I−W
                      + α2 ∥d − d∗ ∥2M+(θ−    1
                                                −2r )I
                                                       .                                            (2.51)
                                                   4
                                     f
                                              2
    Let ρ3 be defined as (2.47), and we show (2.48). Meanwhile, the Q-linear convergence
of (dk , xk ) implies the R-linear convergence of xk .
                                                        23


    This theorem shows that NIDS is still linearly convergent over a relaxed W and keeps
                                                      2
the network-independent stepsize, which attains       L
                                                        practically.
2.5    Numerical Experiments
In this section, we compare the performance of EXTRA and NIDS over the relaxed mixing
matrices in the following two scenarios:
    • Comparison of Decentralized Gradient Descent (DGD), EXTRA, and NIDS with dif-
      ferent stepsizes for a doubly stochastic matrix W.
    • Comparison of EXTRA and NIDS with different stepsizes for a relaxed matrix W.
    We consider the following decentralized sensing problem. Each agent i ∈ {1, · · · , n}
has its own private measured data Mi ∈ Rmi ×p and yi ∈ Rmi based on the unknown
common variable x ∈ Rp . Suppose that yi = Mi x + ei with independently identically
distributed random noise ei ∈ Rmi . The goal is to estimate x cooperatively over the
network, and the problem is
                                                   n
                                               1X1
                            minimize f¯(x) =           ∥Mi x − yi ∥22 .
                                  x            n i=1 2
The data {Mi }ni=1 and x are generated from Gaussian distribution. We normalize each Mi
such that ∥Mi⊤ Mi ∥ = 10, i.e., L = 10. In both scenarios, we set n = 10, p = 5, x0 = 0, and
       I+W
W
f=
         2
            for EXTRA.
    For the first scenario, we construct the matrix W based on the Metropolis constant
edge weight matrix in [1, §2.4]. In this case W f is positive definite, and we can set θ = 3 .
                                                                                                 4
             5I+3W                                                                  (1+λmin (W))µf¯
Then W =        8
                   . We implement EXTRA with three different stepsizes: α1 =             100
                                                      1+λmin (W)
(the stepsize for linear convergence in [1]), α2 =        10
                                                                 (the stepsize for convergence
                          5+3λmin (W)
only in [1]), and α3 =        40
                                      (our largest stepsize). For NIDS, the stepsize is set to
      1
α4 =  5
        although it is the upper bound of the stepsize which is not attainable in our proof
theoretically.
                                               24


    The result with mi = 1 is illustrated in Fig. 2.1. Because we have n > p, the function
f¯(x) is strongly convex with probability one. NIDS requires the least number of iteration
to attain the expected tolerance. Meanwhile, EXTRA with our proposed stepsize has
better performance than that given in [1].
      10 0                                                                    6
                                                                 8
     10 -2
                                                                                 4
     10 -4
                                                      10              2
                                                                                           5
     10 -6
                                                                            7
                                                            1                      9
     10 -8
    10 -10                                                            3
           0  500  1000  1500     2000   2500   3000
                                k    ∗
Figure 2.1: LEFT: the error ∥x    −x ∥F
                              ∥x0 −x∗ ∥F
                                         vs iterations for DGD with different stepsizes, EXTRA
with three stepsizes, and NIDS. RIGHT: The random network with 10 nodes.
                                                                        4
      10 0
                                                                                     3
                                                              10
     10 -2
                                                                          5
     10 -4
                                                         8          2                    9
     10 -6                                                                    1
     10 -8
    10 -10
           0  500  1000  1500     2000   2500   3000               6                   7
                                k    ∗
Figure 2.2: LEFT: the error ∥x    −x ∥F
                              ∥x0 −x∗ ∥F
                                         vs iterations for DGD with different stepsizes, EXTRA
with three stepsizes, and NIDS. RIGHT: The random network with 10 nodes.
    Then, we set mi = 10 in Fig. 2.2. In this case, individual functions fi (x) and f (x) are
strongly convex. NIDS and EXTRA with the largest stepsize lead the performance. Here
two results of EXTRA are the same as that of NIDS although they are set with different
stepsizes. The observation may indicate that there is an optimal choice of stepsize be-
                                                   25


                                                                          5+3λmin (W)
tween α2 and α3 for both EXTRA and NIDS. By setting α5 =                     40+µf¯
                                                                                       for EXTRA and
         2
α6 =   10+µf¯
              , we compare these algorithms in Fig. 2.3. This figure suggests that the optimal
stepsize may depends on the problem/functions. How to find the optimal stepsize is an
important research topic and beyond the scope of this chapter.
                             10 0
                            10 -2
                            10 -4
                            10 -6
                            10 -8
                           10 -10
                                  0      20     40    60     80  100  120   140
Figure 2.3: The comparison of proved stepsizes for EXTRA and NIDS with the optimal
choice.
     Next, we turn to the relaxed mixing matrices. Based on the previous created W, we
                         4W−I
replace it by Wnew =        3
                                    to scale the range of eigenvalues to (− 35 , 1]. In this case, some
diagonal entries of Wnew may be negative. We consider the worst topology of network,
line topology, i.e., each agent has at most two neighbors. In this experiment, we solve
the same problem using EXTRA and NIDS on the unrelaxed and relaxed mixing matri-
ces, respectively, over the line. For NIDS, since the stepsize is network-independent, we
relax the mixing matrix W to Wnew more aggressively so that λmin (Wnew ) approaches
                                                                                                  1
− 53 and compare the performance with the unrelaxed case of NIDS under α =                        5
                                                                                                    . For
                                               5+3λmin (W)
EXTRA, we set the stepsize to α =                  40
                                                           , and compare the performance with the
                                              5+3λmin (Wnew )
relaxed one under the stepsize α =                    40
                                                                where we only perturb W mildly so
that λmin (Wnew ) approaches −1. The result is shown in Fig. 2.4.
     From Fig. 2.4, if the topology of network is weak, switching to the relaxed mixing
matrix may offer better performance when using NIDS and EXTRA to solve the problem.
                                                      26


      10 0                                           10 0
     10 -2                                          10 -2
     10 -4                                          10 -4
     10 -6                                          10 -6
     10 -8                                          10 -8
    10 -10                                         10 -10
           0 200   400   600   800   1000    1200         0  200     400    600     800    1000
                                       k   ∗
Figure 2.4: The figure of residuals ∥x   −x ∥F
                                    ∥x0 −x∗ ∥F
                                                with respect to iteration. The left graph is for
strongly convex f¯(x) and the right one is for strongly convex f (x). re-EXTRA and re-NIDS
stand for implementing EXTRA and NIDS over relaxed Wnew .
The improvement for NIDS is more distinguished.
2.6      Conclusion
In this chapter, we relax the mixing matrices and prove the linear convergence of EXTRA
and NIDS under the (restricted) strongly convexity assumption on f¯. A larger upper
bound of the stepsize is derived for EXTRA compared with that given in [1] and [58].
NIDS can choose a network-independent stepsize and this stepsize can be chosen as the
same as that of centralized ones. We relax the conditions for the mixing matrices and the
functions, while keeping the same stepsize.
    In numerical experiments on linear regression, EXTRA with the larger stepsize con-
verges faster than using the µf¯-dependent stepsize in [1]. Over the unrelaxed mixing
matrix, NIDS leads the performance in most cases and is the easiest to implement. If
the topology of network is weak, using the relaxed mixing matrix can accelerate NIDS.
For EXTRA, in general, we may not choose the mixing matrices to be relaxed due to the
tiny improvement, but the larger stepsize derived in the relaxed case is competent to be
considered.
                                                27


                                         CHAPTER 3
          DORE: A CENTRALIZED ALGORITHM WITH BIDIRECTIONAL
                          COMMUNICATION COMPRESSION
3.1    Introduction
In the well-known parameter server framework [7, 11], each worker node evaluates its
own stochastic gradient {∇fe i (xk )}ni=1 and send it to the master node, which collects all
gradients and calculates their average (1/n) ni=1 ∇f        k
                                               P e
                                                        i (x ). Then the master node further
takes the gradient descent step with the averaged gradient and broadcasts the new model
parameter xk+1 to all worker nodes. It makes use of the computational resources from all
nodes. In reality, the network bandwidth is often limited. Thus, the communication cost
for the gradient transmission and model synchronization becomes the dominating bottle-
necks as the number of nodes and the model size increase, which hinders the scalability
and efficiency of SGD.
    One common way to reduce the communication cost is to compress the gradient infor-
mation by either gradient sparsification or quantization [13, 12, 15, 68, 69, 70, 71, 72] such
that many fewer bits of information are needed to be transmitted. However, little atten-
tion has been paid on how to reduce the communication cost for model synchronization
and the corresponding theoretical guarantees. Obviously, the model shares the same size
as the gradient, so does the communication cost. Thus, merely compressing the gradient
can reduce at most 50% of the communication cost, which suggests the importance of
model compression. Notably, the compression of model parameters is much more chal-
lenging than gradient compression. One key obstacle is that its compression error cannot
be well controlled by the step size γ and thus it cannot diminish like that in the gradient
compression [20]. In this chapter, we aim to bridge this gap by investigating algorithms
to compress the full communication in the optimization process and understanding their
                                              28


 theoretical properties. Our contributions can be summarized as:
• We proposed DORE, which can compress both the gradient and the model information
   such that more than 95% of the communication cost can be reduced.
• We provided theoretical analyses to guarantee the convergence of DORE under strongly
   convex and nonconvex assumptions without the bounded gradient assumption.
• Our experiments demonstrate the superior efficiency of DORE comparing with the state-
   of-art baselines without degrading the convergence speed and the model accuracy.
 3.2    Background
 Recently, many works try to reduce the communication cost to speed up the distributed
 learning, especially for deep learning applications, where the size of the model is typi-
 cally very large (so is the size of the gradient) while the network bandwidth is relatively
 limited. Below we briefly review relevant papers.
     Gradient quantization and sparsification. Recent works [13, 12, 71, 17, 14] have
 shown that the information of the gradient can be quantized into a lower-precision vec-
 tor such that fewer bits are needed in communication without loss of accuracy. [12] pro-
 posed 1Bit SGD that keeps the sign of each element in the gradient only. It empirically
 works well, and [14] provided theoretical analysis systematically. QSGD [13] utilizes an
 unbiased multi-level random quantization to compress the gradient while Terngrad [71]
 quantizes the gradient into ternary numbers {0, ±1}. In DIANA [17], the gradient differ-
 ence is compressed and communicated contributing to the estimator of the gradient in
 the master node.
     Another effective strategy to reduce the communication cost is sparsification. [70]
 proposed a convex optimization formulation to minimize the coding length of stochastic
 gradients. A more aggressive sparsification method is to keep the elements with relatively
 larger magnitude in gradients, such as top-k sparsification [15, 68, 73].
                                               29


    Model synchronization. The typical way for model synchronization is to broadcast
model parameters to all worker nodes. Some works [69, 74] have been proposed to reduce
model size by enforcing sparsity, but it cannot be applied to general optimization prob-
lems. Some alternatives including QSGD [13] and ECQ-SGD [72] choose to broadcast all
quantized gradients to all other workers such that every worker can perform model up-
date independently. However, all-to-all communication is not efficient since the number
of transmitted bits increases dramatically in large-scale networks. DoubleSqueeze [18] ap-
plies compression on the averaged gradient with error compensation to speed up model
synchronization.
    Error compensation.       [12] applied error compensation on 1Bit-SGD and achieved
negligible loss of accuracy empirically. Recently, error compensation was further stud-
ied [72, 15, 16] to mitigate the error caused by compression. The general idea is to add the
compressed error to the next compression step:
                                ĝ = Q(g + e), e = (g + e) − ĝ.
However, to the best of our knowledge, most of the algorithms with error compensa-
tion [72, 15, 16, 18] need to assume bounded gradient, i.e., E∥g∥2 ≤ B, and the conver-
gence rate depends on this bound.
    Contributions of DORE. The most related papers to DORE are DIANA [17] and Dou-
bleSqueeze [18]. Similarly, DIANA compresses gradient difference on the worker side
and achieves good convergence rate. However, it doesn’t consider the compression in
model synchronization, so at most 50% of the communication cost can be saved. Doub-
leSqueeze applies compression with error compensation on both worker and server sides,
but it only considers nonconvex objective functions. Moreover, its analysis relies on a
bounded gradient assumption, i.e., E∥g∥2 ≤ B, and the convergence error has a depen-
dency on the gradient bound like most existed error compensation works.
    In general, the uniform bound on the norm of the stochastic gradient is a strong as-
sumption which might not hold in some cases. For example, it is violated in the strongly
                                               30


convex case [75, 76]. In this chapter, we design DORE, the first algorithm which utilizes
gradient and model compression with error compensation without assuming bounded
gradients. Unlike existing error compensation works, we provide a linear convergence
rate to the O(σ) neighborhood of the optimal solution for strongly convex functions and
a sublinear rate to the stationary point for nonconvex functions with linear speedup. In
Table 3.1, we compare the asymptotic convergence rates of different quantized SGDs with
DORE.
3.3     Double Residual Compression SGD
In this section, we introduce the proposed DOuble REsidual compression SGD (DORE)
algorithm. Before that, we introduce a common assumption for the compression operator.
     In this work, we adopt an assumption from [13, 71, 17] that the compression variance
is linearly proportional to the magnitude.
Assumption 3.3.1. The stochastic compression operator Q : Rd → Rd is unbiased, i.e., EQ(x) =
x and satisfies
                                    E∥Q(x) − x∥2 ≤ C∥x∥2 ,                                 (3.1)
for a nonnegative constant C that is independent of x. We use x̂ to denote the compressed x, i.e.,
x̂ ∼ Q(x).
     Many feasible compression operators can be applied to our algorithm since our theo-
retical analyses are built on this common assumption. Some examples of feasible stochas-
tic compression operators include:
• No Compression: C = 0 when there is no compression.
• Stochastic Quantization: A real number x ∈ [a, b], (a < b) is set to be a with probability
    b−x                          x−a
    b−a
         and b with probability  b−a
                                     , where a and b are predefined quantization levels [13].
    It satisfies Assumption 3.3.1 when ab > 0 and a < b.
                                               31


                                                                Master
                                                                           Gr
                                                        al                     ad
                                                es idu                             ien
                                              r                                        tr
                                           nt                al         M                 esi
                                     ra die               idu
                                                                          od                  du
                                   G                   es        …          el
                                                                                r esi
                                                                                                al
                                                     r
                                                  el                                  du
                                              od                                        al
                                            M
                                                                 …
                              Worker                 Worker            Worker                    Worker
                                  Figure 3.1: An illustration of DORE.
                                                                                                                     x
• Stochastic Sparsification: A real number x is set to be 0 with probability 1 − p and                               p
                                                                                                                       with
    probability p [71]. It satisfies Assumption 3.3.1 with C = (1/p) − 1.
• p-norm Quantization: A vector x is quantized element-wisely by Qp (x) = ∥x∥p sign(x) ◦
    ξ, where ◦ is the Hadamard product and ξ is a Bernoulli random vector satisfying
                      |xi |                                                                             ∥x∥1 ∥x∥p
    ξi ∼ Bernoulli( ∥x∥   p
                            ). It satisfies Assumption 3.3.1 with C = maxx∈Rd                             ∥x∥22
                                                                                                                  − 1 [17].
    To decrease the constant C for a higher accuracy, we can further decompose a vector
    x ∈ Rd into blocks, i.e., x = (x(1)⊤ , x(2)⊤ , · · · , x(m)⊤ )⊤ with x(l) ∈ Rdl and m
                                                                                                              P
                                                                                                                 l=1 dl = d,
    and compress the blocks independently.
3.3.1    The Proposed DORE
Many previous works [13, 12, 71] reduce the communication cost of P-SGD by quantizing
the stochastic gradient before sending it to the master node, but there are several intrinsic
issues.
     First, these algorithms will incur extra optimization error intrinsically. Let’s consider
the case when the algorithm converges to the optimal point x∗ . By the first-order optimal-
ity, we have (1/n) ni=1 ∇fi (x∗ ) = 0., the data distributions may be different for different
                     P
worker nodes in general, and thus we may have ∇fi (x∗ ) ̸= ∇fj (x∗ ), ∀i, j ∈ {1, . . . , n} and
i ̸= j. In other words, each individual ∇fi (x∗ ) may be far away from zero. This will cause
                                                                 32


large compression variance according to Assumption 3.3.1, which indicates that the upper
bound of compression variance E∥Q(x) − x∥2 is linearly proportional to the magnitude of
x.
     Second, most existing algorithms [12, 13, 71, 14, 72, 17] need to broadcast the model or
gradient to all worker nodes in each iteration. It is a considerable bottleneck for efficient
optimization since the amount of bits to transmit is the same as the uncompressed gradi-
ent. DoubleSqueeze [18] is able to apply compression on both worker and server sides.
However, its analysis depends on a strong assumption on bounded gradient. Meanwhile,
no theoretical guarantees are provided for the convex problems.
Algorithm 3.1: DORE1
  1: Input: Stepsize α, β, γ, η, initialize h0 = h0i = 0d , x̂0i = x̂0 , ∀i ∈ {1, . . . , n}.
  2: for k = 1, 2, · · · , K − 1 do
  3:   For each worker i ∈ {1, 2, · · · , n}:           12: For the master:
  4:   Sample gik such that E[gik |x̂ki ] = ∇fi (x̂ki ) 13: Receive {∆   ˆ ki } from workers
  5:   Gradient residual: ∆ki = gik − hki               14: ∆ˆ = 1/n n ∆
                                                               k
                                                                         P      ˆk
                                                                              i   i
  6:   Compression: ∆       ˆ k = Q(∆k )                15: ĝk = hk + ∆  ˆ k {= 1/n Pn ĝik }
                               i     i                                                      i
  7:   hk+1   =  h k
                   i +  α  ˆ
                           ∆ k
                             i                          16: xk+1 = proxγR (x̂k − γĝk )
         i
  8:   { ĝik = hki + ∆ˆ ki }                           17: hk+1 = hk + α∆       ˆk
  9:   Send ∆  ˆ k to the master                        18: Model residual: qk = xk+1 − x̂k + ηek
                 i
 10:   Receive q̂k from the master                      19: Compression: q̂k = Q(qk )
 11:   x̂k+1
         i    = x̂ki + β q̂k                            20: ek+1 = qk − q̂k
                                                        21: x̂k+1 = x̂k + β q̂k
                                                        22: Broadcast q̂k to workers
 23: end for
 24: Output: x̂K or any x̂K      i
     We proposed DORE to address all aforementioned issues. Our motivation is that the
gradient should change smoothly for smooth functions so that each worker node can keep
a state variable hki to track its previous gradient information. As a result, the residual be-
tween new gradient and the state hki should decrease, and the compression variance of
the residual can be well bounded. On the other hand, as the algorithm converges, the
model would only change slightly. Therefore, we propose to compress the model resid-
ual such that the compression variance can be minimized and also well bounded. We
                                                   33


also compensate the model residual compression error into next iteration to achieve a
better convergence. Due to the advantages of the proposed double residual compres-
sion scheme, we can derive the fastest convergence rate through analyses without the
bounded gradient assumption. Below are some key steps of our algorithm as showed in
Algorithm 3.1 and Figure 3.1:
[lines 4-9]: each worker node sends the compressed gradient residual (∆                  ˆ k ) to the master
                                                                                            i
node and updates its state hki with ∆        ˆ k;
                                               i
[lines 13-15]: the master node gathers the compressed gradient residual ({∆                    ˆ k )} from all
                                                                                                 i
worker nodes and recovers the averaged gradient ĝk based on its state hk ;
[lines 16]: the master node applies gradient descent algorithms (possibly with the proxi-
mal operator);
[lines 18-22]: the master node broadcasts the compressed model residual with error com-
pensation (q̂k ) to all worker nodes and updates the model;
[lines 10-11]: each worker node receives the compressed model residual (q̂k ) and updates
its model xki .
    In the algorithm, the state hki serves as an exponential moving average of the local gra-
dient in expectation, i.e., EQ hk+1 i    = (1 − α)hki + αgik , as proved in Lemma A.4.1. Therefore,
as the iteration approaches the optimum, hki will also approach the local gradient ∇fi (x∗ )
rapidly which contributes to small gradient residual and consequently small compression
variance. Similar difference compression techniques are also proposed in DIANA and its
variance-reduced variant [17, 77].
    1
      Equations in the curly bracket are just notations for the proof but does not need to computed actually.
                                                      34


3.3.2   Discussion
In this subsection, we provide more detailed discussions about DORE including model
initialization, model update, the special smooth case as well as the compression rate of
communication.
    Initialization. It is important to take the identical initialization x̂0 for all worker and
master nodes. It is easy to be ensured by either setting the same random seed or broad-
casting the model once at the beginning. In this way, although we don’t need to broadcast
the model parameters directly, every worker node updates the model x̂k in the same way.
Thus we can keep their model parameters identical. Otherwise, the model inconsistency
needs to be considered.
    Model update. It is worth noting that although we can choose an accurate model xk+1
as the next iteration in the master node, we use x̂k+1 instead. In this way, we can en-
sure that the gradient descent algorithm is applied based on the exact stochastic gradient
which is evaluated on x̂ki at each worker node. This dispels the intricacy to deal with
inexact gradient evaluated on xk and thus it simplifies the convergence analysis.
    Smooth case. In the smooth case, i.e., R = 0, Algorithm 3.1 can be simplified. The
master node quantizes the recovered averaged gradient with error compensation and
broadcasts it to all worker nodes. The details of this simplified case can be found in
Appendix A.3.
    Compression rate. The compression of the gradient information can reduce at most
50% of the communication cost since it only considers compression during gradient ag-
gregation while ignoring the model synchronization. However, DORE can further cut
down the remaining 50% communication.
    Taking the blockwise p-norm quantization as an example, every element of x can be
                 3
represented by   2
                   bits using the simple ternary coding {0, ±1}, along with one magnitude
for each block. For example, if we consider the uniform block size b, the number of bits
to represent a d-dimension vector of 32 bit float-point numbers can be reduced from 32d
                                              35


bits to 32 db + 32 d bits. As long as the block size b is relatively large with respect to the
constant 32, the cost 32 db for storing the float-point number is relatively small such that
the compression rate is close to 32d/( 32 d) ≈ 21.3 times (for example, 19.7 times when
b = 256).
    Applying this quantization, QSGD, Terngrad, MEM-SGD, and DIANA need to trans-
mit (32d+32 db + 32 d) bits per iteration and thus they are able to cut down 47% of the overall
2 × 32d bits per iteration through gradient compression when b = 256. But with DORE,
we only need to transmit 2(32 db + 32 d) bits per iteration. Thus DORE can reduce over
95% of the total communication by compressing both the gradient and model transmis-
sion. More efficient coding techniques such as Elias coding [78] can be applied to further
reduce the number of bits per iteration.
3.4    Convergence Analysis
To show the convergence of DORE, we make the following commonly used assumptions.
Assumption 3.4.1. Each worker node samples an unbiased estimator of the gradient stochasti-
cally with bounded variance, i.e., for i = 1, 2, · · · , n and ∀x ∈ Rd ,
                             E[gi |x] = ∇fi (x),    E∥gi − ∇fi (x)∥2 ≤ σi2 ,                      (3.2)
                                                                        1
                                                                          Pn
where gi is the estimator of ∇fi at x. In addition, we define σ 2 =    n   i=1 σi2 .
Assumption 3.4.2. Each fi is L-Lipschitz differentiable, i.e., for i = 1, 2, · · · , n and ∀x, y ∈ Rd ,
                                                                  L
                           fi (x) ≤ fi (y) + ⟨∇fi (y), x − y⟩ +     ∥x − y∥2 .                    (3.3)
                                                                  2
Assumption 3.4.3. Each fi is µ-strongly convex (µ ≥ 0), i.e., for i = 1, 2, · · · , n and ∀x, y ∈ Rd ,
                                                                  µ
                           fi (x) ≥ fi (y) + ⟨∇fi (y), x − y⟩ +     ∥x − y∥2 .                    (3.4)
                                                                  2
    For simplicity, we use the same compression operator for all worker nodes, and the
master node can apply a different compression operator. We denote the constants in As-
sumption 3.3.1 as Cq and Cqm for the worker and master nodes, respectively. Then we set
                                                    36


α and β in both algorithms to satisfy
                                q                                q
                           1 − 1 − 4Cq (C  nc
                                             q +1)
                                                           1+      1−    4Cq (Cq +1)
                                                                              nc
                                                    ≤α≤                              ,
                                 2(Cq + 1)                        2(Cq + 1)
                                                               1
                                                 0<β ≤              ,                         (3.5)
                                                           Cqm   +1
          4Cq (Cq +1)
with c ≥       n
                      .
    We consider two scenarios in the following two subsections: f is strongly convex with
a convex regularizer R and f is nonconvex with R = 0.
3.4.1   The Strongly Convex Case
Theorem 3.4.1. Under Assumptions 3.3.1-3.4.3, if α and β in Algorithm 3.1 satisfy (3.5), η and
γ satisfy
                                                    q
                                          −Cqm +       (Cqm )2 + 4(1 − (Cqm + 1)β)
                             η < min                                                   ,
                                                              2Cqm
                                                                    
                                                     4µL
                                                 2
                                                                       ,                      (3.6)
                                       (µ + L) (1 + cα) − 4µL
                        η(µ + L)                      2
                                    ≤γ ≤                        ,                             (3.7)
                       2(1 + η)µL           (1 + cα)(µ + L)
then we have
                                                    (1 + η)(1 + ncα) 2 2
                               Vk+1 ≤ ρk V1 +                          βγ σ ,                 (3.8)
                                                        n(1 − ρ)
with
               Vk =β(1 − (Cqm + 1)β)E∥qk−1 ∥2 + E∥x̂k − x∗ ∥2
                                         n
                          (1 + η)cβγ 2 X
                        +                   E∥hki − ∇fi (x∗ )∥2 ,
                               n        i=1
                              (η 2 + η)C m                   2(1 + η)βγµL              
                                            q
                  ρ = max                        , 1 + ηβ −                      , 1 − α < 1.
                               1 − (Cqm + 1)β                      µ+L
                                                      37


Corollary 3.4.1. When there is no error compensation and we set η = 0, then ρ = max(1 −
2βγµL
  µ+L
      ,1 − α). If we further set
                                 1                      1               4Cq (Cq + 1)
                        α=              ,     β=               ,    c=               ,                 (3.9)
                            2(Cq + 1)                Cqm  +1                   n
                                               2
and choose the largest stepsize γ =   (µ+L)(1+2Cq /n)
                                                         , the convergent factor is
                                                                   (µ + L)2  1 Cq 
                   (1 − ρ)−1 = max 2(Cq + 1), (Cqm + 1)                           +      .            (3.10)
                                                                       2µL      2   n
Remark 3.4.1. In particular, suppose {∆i }ni=1 are compressed using the Bernoulli p-norm quan-
                                                             1
tization with the largest block size dmax , then Cq =       αw
                                                                  − 1, with
                                                            ∥x∥22
                                   αw =       min                      ≤ 1.
                                           0̸=x∈Rdmax    ∥x∥1 ∥x∥p
                                                                                             1
Similarly, q is compressed using the Bernoulli p-norm quantization with Cqm =              αm
                                                                                               − 1. Then the
linear convergent factor is
                                          n 2 1 (µ + L)2  1 2                    2 o
                             −1
                      (1 − ρ) = max              ,                      − +            .              (3.11)
                                            αw αm µL                  2 n nαw
                                                n                           o
                                                    2 µ+L 1          1     1
While the result of DIANA in [17] is max αw , µ 2 − n + nαw , which is larger than (3.11)
with αm = 1 (no compression for the model). When there is no compression for ∆i , i.e., αw = 1,
the algorithm reduces to the gradient descent, and the linear convergent factor is the same as that
of the gradient descent for strongly convex functions.
Remark 3.4.2. Although error compensation often improves the convergence in practice, in the-
ory, no compensation, i.e., η = 0, provides the best convergence rate. This is because we don’t have
much information of the error being compensated. Filling this gap will be an interesting future
direction.
3.4.2   The Nonconvex Case with R = 0
Theorem 3.4.2. Under Assumptions 3.3.1-3.4.2 and the additional assumption that each worker
samples the gradient from the full dataset, we set α and β according to (3.5). By choosing
                                    q             2 2   m    2
                           n −1 + 1 + 48L β C(Cmq +1)                        1           o
                                                      q
                  γ ≤ min                                       ,                          ,
                                   12Lβ(Cqm + 1)                  6Lβ(1 + cα)(Cqm + 1)
                                                      38


we have
                            β                                   K
                            2
                               − 3(1 + cα)(Cqm + 1)Lβ 2 γ X
                                                                    E∥∇f (x̂k )∥2
                                            K                  k=1
                            Λ1 − ΛK+1 3(Cqm + 1)(1 + ncα)Lβ 2 σ 2 γ
                         ≤                 +                                    ,                     (3.12)
                                γK                           n
where
                              Λk =(Cqm + 1)Lβ 2 ∥qk−1 ∥2 + f (x̂k ) − f ∗
                                                                 n
                                                          2 21
                                                               X
                                      +  3c(Cqm  + 1)Lβ γ            E∥hki ∥2 .                       (3.13)
                                                             n  i=1
                                1              1             4Cq (Cq +1)
Corollary 3.4.2. Let α =     2(Cq +1)
                                      ,β  = Cqm +1
                                                   , and c =      n
                                                                         , then 1 + ncα is a fixed constant.
                 1 √
If γ =                    , when K is relatively large, we have
        12L(1+cα)(1+ K/n)
                                       K
                                   1 X                       1        1
                                          E∥∇f (x̂k )∥2 ≲       +√          .                         (3.14)
                                  K k=1                      K        Kn
                                                            √
Remark 3.4.3. The dominant term in (3.14) is O(1/ Kn), which implies that the sample com-
plexity of each worker node is O(1/(nϵ2 )) in average to achieve an ϵ-accurate solution. It shows
that, same as DoubleSqueeze in [18], DORE is able to perform linear speedup. Furthermore, this
convergence result is the same as the P-SGD without compression. Note that DoubleSqueeze
                          2
has an extra term (1/K) 3 , and its convergence requires the bounded variance of the compression
operator.
3.5     Numerical Experiments
In this section, we validate the theoretical results and demonstrate the superior perfor-
mance of DORE. Our experimental results demonstrate that (1) DORE achieves similar
convergence speed as full-precision SGD and state-of-art quantized SGD baselines and
(2) its iteration time is much smaller than most existing algorithms, supporting the supe-
rior communication efficiency of DORE.
    To make a fair comparison, we choose the same Bernoulli ∞-norm quantization as
described in Section 3.3 and the quantization block size is 256 for all experiments if not
                                                      39


      Algorithm       Direction       Compression       Linear Rate    Nonconvex Rate
                                                                             1
        QSGD            Uni.      2-norm Quantization      N/A               K
                                                                               +B
                                                                             1
       DIANA            Uni.      p-norm Quantization        ✓             √
                                                                             Kn
                                                                                + K1
   DoubleSqueeze         Bi.       Bounded Variance        N/A         √1 + 1        + K1
                                                                        Kn     K 2/3
        DORE             Bi.        Assumption 3.3.1         ✓             √1 + 1
                                                                             Kn     K
Table 3.1: A comparison between related algorithms. The second column indicates
whether algorithm compresses only the gradients or both the gradients and the model.
DORE is able to converges linearly to the O(σ) neighborhood of optimal point like full-
precision SGD and DIANA in the strongly convex case while achieving much better com-
munication efficiency. DORE also admits linear speedup in the nonconvex case like Dou-
bleSqueeze but DORE doesn’t require the assumptions of bounded compression error or
bounded gradient.
being explicitly stated because ∞-norm quantization is unbiased and commonly used.
The parameters α, β, η for DORE are chosen to be 0.1, 1 and 1, respectively.
    The baselines we choose to compare include SGD, QSGD [13], MEM-SGD [15], DI-
ANA [17], DoubleSqueeze and DoubleSqueeze (topk) [18]. SGD is the vanilla SGD with-
out any compression and QSGD quantizes the gradient directly. MEM-SGD is the QSGD
with error compensation. DIANA, which only compresses and transmits the gradient
difference, is a special case of the proposed DORE. DoubleSqueeze quantizes both the
gradient on the workers and the averaged gradient on the server with error compensa-
tion. Although DoubleSqueeze is claimed to work well with both biased and unbiased
compression, in our experiment it converges much slower and suffers the loss of accuracy
with unbiased compression. Thus, we also compare with DoubleSqueeze using the Top-k
compression as presented in [18].
3.5.1  Strongly convex
To verify the convergence for strongly convex and smooth objective functions, we conduct
the experiment on a linear regression problem: f (x) = ∥Ax − b∥2 + λ∥x∥2 . The data
matrix A ∈ R1200×500 and optimal solution x∗ ∈ R500 are randomly synthesized. Then
we generate the prediction b by sampling from a Gaussian distribution whose mean is
                                            40


                                  2.5          SGD
                                               QSGD
                                  2.0          DORE
                                  1.5
                        Seconds
                                  1.0
                                  0.5
                                  0.0
                                    0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 0.0040 0.0045 0.0050
                                                           Bandwidth (1s/Mbit)
Figure 3.2: Per iteration time cost on Resnet18 for SGD, QSGD, and DORE. It is tested in
a shared cluster environment connected by Gigabit Ethernet interface. DORE speeds up
the training process significantly by mitigating the communication bottleneck.
Ax∗ . The rows of the data matrix A are allocated evenly to 20 worker nodes. To better
verify the linear convergence to the O(σ) neighborhood around the optimal solution, we
take the full gradient in each node for all algorithms to exclude the effect of the gradient
variance (σ = 0).
   As showed in Figure 3.3, with full gradient and a constant learning rate, DORE con-
verges linearly, same as SGD and DIANA, but QSGD, MEM-SGD, DoubleSqueeze, as
well as DoubleSqueeze (topk) converge to a neighborhood of the optimal point. This is
because these algorithms assume the bounded gradient and their convergence errors de-
pend on that bound. Although they converge to the optimal solution using a diminishing
step size, their converge rates will be much slower.
   In addition, we also validate that the norms of the gradient and model residual de-
crease exponentially, and it explains the linear convergence behavior of DORE. For more
details, please refer to Appendix A.1.
3.5.2   Nonconvex
To verify the convergence in the nonconvex case, we test the proposed DORE with two
classical deep neural networks on two representative datasets, i.e., LeNet [79] on MNIST
and Resnet18 [80] on CIFAR10. In the experiment, we use 1 parameter server and 10
                                                                 41


workers, each of which is equipped with an NVIDIA Tesla K80 GPU. The batch size for
each worker node is 256. We use 0.1 and 0.01 as the initial learning rates for LeNet and
Resnet18, and decrease them by a factor of 0.1 after every 25 and 100 epochs, respectively.
All parameter settings are the same for all algorithms.
                                1
                           10                                                                                                  10
                                                                                                                                    1
                                2
                    10                                                                                                     10
                                                                                                                                    1
                                5
                    10                                                                                                     10
                                                                                                                                    3
  ||xk x * ||2                                                                                              ||xk x * ||2
                                8
                    10                                                                                                     10
                                                                                                                                    5
                                            SGD                                                                                                 SGD
                            11              QSGD                                                                                                QSGD
                 10                         MEM-SGD                                                                        10
                                                                                                                                    7
                                                                                                                                                MEM-SGD
                                            DIANA                                                                                               DIANA
                            14              DoubleSqueeze                                                                           9           DoubleSqueeze
                 10                                                                                                        10
                                            DoubleSqueeze (topk)                                                                                DoubleSqueeze (topk)
                                            DORE                                                                                                DORE
                                        0         2000       4000              6000     8000        10000                                   0    2500   5000   7500 10000 12500 15000 17500 20000
                                                                   Iteration                                                                                           Iteration
                                                 (a) Learning rate=0.05                                                                          (b) Learning rate=0.025
Figure 3.3: Linear regression on synthetic data. When the learning rate is 0.05, Doub-
leSqueeze diverges. In both cases, DORE, SGD, and DIANA converge linearly to the
optimal point, while QSGD, MEM-SGD, DoubleSqueeze, and DoubleSqueeze (topk) only
converge to the neighborhood even when full gradient is available.
                                                                               SGD                                                                                                 SGD
                           2.0                                                 QSGD                                            2.0                                                 QSGD
                                                                               MEM-SGD                                                                                             MEM-SGD
                                                                               DIANA                                                                                               DIANA
                           1.5                                                 DoubleSqueeze                                   1.5                                                 DoubleSqueeze
           Training Loss                                                                                           Test Loss
                                                                               DoubleSqueeze (topk)                                                                                DoubleSqueeze (topk)
                                                                               DORE                                                                                                DORE
                           1.0                                                                                                 1.0
                           0.5                                                                                                 0.5
                           0.0                                                                                                 0.0
                                    0       10      20     30         40         50    60      70                                       0       10      20     30         40         50    60    70
                                                                    Epoch                                                                                              Epoch
                                                    (a) Training loss                                                                                        (b) Test Loss
Figure 3.4: LeNet trained on MNIST. DORE converges similarly as most baselines. It
outperforms DoubleSqueeze using the same compression method while has similar per-
formance as DoubleSqueeze (topk).
   Figures 3.4 and 3.5 show the training loss and test loss for each epoch during the train-
ing of LeNet on the MNIST dataset and Resnet18 on CIFAR10 dataset. The results indicate
that in the nonconvex case, even with both compressed gradient and model information,
DORE can still achieve similar convergence speed as full-precision SGD and other quan-
tized SGD variants. DORE achieves much better convergence speed than DoubleSqueeze
                                                                                                            42


                                                                                         4.5
                    4.5                              SGD                                                               SGD
                                                     QSGD                                4.0                           QSGD
                    4.0                              MEM-SGD                                                           MEM-SGD
                                                     DIANA                               3.5                           DIANA
                    3.5
                                                     DoubleSqueeze                                                     DoubleSqueeze
    Training Loss
                                                                                         3.0
                                                                             Test Loss
                    3.0                              DoubleSqueeze (topk)                                              DoubleSqueeze (topk)
                                                     DORE                                2.5
                                                                                                                       DORE
                    2.5
                    2.0                                                                  2.0
                    1.5                                                                  1.5
                    1.0                                                                  1.0
                          0   50       100       150         200      250                      0   50    100       150         200      250
                                             Epoch                                                             Epoch
                                   (a) Training Loss                                                    (b) Test Loss
Figure 3.5: Resnet18 trained on CIFAR10. DORE achieves similar convergence and accu-
racy as most baselines. DoubeSuqeeze converges slower and suffers from the higher loss
but it works well with topk compression.
using the same compression method and converges similarly with DoubleSqueeze with
Topk compression as presented in [18]. We also validate via parameter sensitivity in Ap-
pendix A.2 that DORE performs consistently well under different parameter settings such
as compression block size, α, β and η.
3.5.3                     Communication Efficiency
In terms of communication cost, DORE enjoys the benefit of extremely efficient commu-
nication. As one example, under the same setting as the Resnet18 experiment described
in the previous section, we test the time cost per iteration for SGD, QSGD, and DORE un-
der varied network bandwidth. We didn’t test MEM-SGD, DIANA, and DoubleSqueeze
because MEM-SGD, DIANA have similar time cost as QSGD while DoubleSqueeze has
similar time cost as DORE. The result showed in Figure 3.2 indicates that as the band-
width becomes worse, with both gradient and model compression, the advantage of
DORE becomes more remarkable compared to the baselines that don’t apply compres-
sion for model synchronization .
                                                                            43


3.6    Conclusion
Communication cost is the severe bottleneck for distributed training of modern large-
scale machine learning models. Extensive works have compressed the gradient informa-
tion to be transferred during the training process, but model compression is rather limited
due to its intrinsic difficulty. In this chapter, we proposed the Double Residual Compres-
sion SGD named DORE to compress both gradient and model communication that can
mitigate this bottleneck prominently. The theoretical analyses suggest good convergence
rate of DORE under weak assumptions. Furthermore, DORE is able to reduce 95% of
the communication cost while maintaining similar convergence rate and model accuracy
compared with the full-precision SGD.
                                                44


                                            CHAPTER 4
    PROX-LEAD: A LINEAR CONVERGENT CONVERGENT ALGORITHM WITH
            EFFICIENT COMMUNICATION FOR COMPOSITE PROBLEMS
4.1    Introduction
In recent years, the communication cost has become the bottleneck in the distributed
training of machine learning models, given that the computation becomes much faster
with powerful computing devices such as GPUs and TPUs. Therefore, the communi-
cation efficiency gains increasing attention in the algorithm design. On the one hand,
decentralized communication has been shown to be an effective and important direc-
tion for improving communication efficiency [28]. On the other hand, various com-
munication compression techniques, such as quantization and sparsification, which are
originally developed for centralized settings [12, 13, 14, 15, 16, 17, 19, 18], have been
shown to be significant in reducing the communication cost for decentralized optimiza-
tion [20, 21, 24, 25, 33, 81]. These works have exhibited the great potential of decentralized
optimization with communication compression in speeding up decentralized machine
learning.
     The composite problem (1.2) abstracts many important applications such as regular-
ized empirical risk minimization in statistics and machine learning, and optimal con-
trol of multi-agent systems. More specifically, we consider two different settings on the
smooth components of the objective function, i.e., the general stochastic setting and the
finite-sum setting. In the general stochastic setting, the problem follows (1.2) where the
local functions {fi } are defined as the expectation over the general local sample distribu-
tions {Di }. In the finite-sum setting, we consider the discrete distribution on local nodes,
and the local functions {fi } are defined as the unweighted average over local samples,
fi (x) = m1 m
           P
               j=1 fij (x), where each fij stands for the loss function defined on the jth batch
                                                45


of samples at node i. In this work, we assume that the number of batches m is the same
for all nodes for simplicity. Note that it can be easily generalized to the case when the
nodes have different numbers of batches.
    Contribution. The main contribution of this chapter is a Proximal gradient LinEAr
convergent Decentralized algorithm with compression, Prox-LEAD, which solves the
problem (1.2).
    Specifically, the contributions can be summarized as follows:
   • We propose the first decentralized stochastic proximal gradient algorithm with com-
      pressed communication, Prox-LEAD. It converges linearly up to the neighborhood
      of the optimal solution in the general stochastic setting. In the finite-sum setting, we
      establish the linear convergence to the exact solution for Prox-LEAD’s two variance
      reduction variants, i.e., Loopless SVRG and SAGA.
   • We provide a rigorous theory of Prox-LEAD on the compressed communication
      and convergence complexities in different settings on the smooth component of
      the problem. Without the restriction on data heterogeneity and gradient bound-
      edness, Prox-LEAD maintains a comparable convergence rate compared to the un-
      compressed counterpart. Our theorems indicate that Prox-LEAD works with arbi-
      trary compression precision. Moreover, with reasonably aggressive compression,
      Prox-LEAD significantly reduces the communication cost almost for free.
   • Our algorithmic framework builds bridges between many known algorithms. With-
      out involving compression, it provides stochastic and variance reduction variants of
      some deterministic algorithms, and it has a better convergence complexity against
      the existing non-accelerated stochastic decentralized algorithms. When the nons-
      mooth regularizer is absent, it reduces to LEAD and achieves a better convergence
      complexity. The framework also enlightens other primal-dual algorithms to apply
                                             46


      compressed communication and reduce the impact of inexact primal and dual iter-
      ations.
    • We present comprehensive experiments to verify our theorems and the effectiveness
      of the proposed algorithm in different settings. The comparison with state-of-art al-
      gorithms demonstrates the superiority of Prox-LEAD and its stochastic variants.
      Moreover, it is robust to parameter tuning, which exhibits great advantages in prac-
      tice.
    The rest of this chapter is organized as follows. Section 4.1.1 and Section 4.1.2 sum-
marize related works and introduce the notations used in this chapter, respectively. In
Section 4.2, the proposed algorithm Prox-LEAD, motivation and derivation, as well as
convergence complexities are presented. The assumptions on regularity are introduced
in Section 4.3, and the convergence analyses and major theorems are illustrated in Sec-
tion 4.4. Importantly, we also detail the connection with existing algorithms in Section B.2.
Finally, numerical experiments are presented in Section 4.5.
4.1.1   Related Work
Many algorithms were proposed to solve the decentralized optimization of the average
of functions defined over agents in networks. The early decentralized algorithms can
be traced back to the work by [82]. DGD [26] is the most classical decentralized algo-
rithm. It is intuitive and simple but converges slowly due to the diminishing stepsize
that is needed to obtain the optimal solution [27]. Its stochastic version D-PSGD [28] has
been shown effective for training nonconvex deep learning models. Algorithms based on
primal-dual formulations or gradient tracking are proposed to eliminate the convergence
bias in DGD-type algorithms and improve the convergence rate, such as D-ADMM [83],
DLM [29], EXTRA [1], NIDS [2], D2 [84], Exact Diffusion [53], NEXT [50], DIGing [45],
Harnessing [46], SONATA [85], GSGT [86], OPTRA [87], etc. There are also dual-based
                                              47


methods which apply gradient methods on the dual formulation [55, 88, 56]. These algo-
rithms are able to achieve optimal bounds but requires computing the non-trivial gradient
of the dual function.
    To improve the communication efficiency, communication compression is first applied
to decentralized settings by [20]. It proposes two algorithms, i.e., DCD-SGD and ECD-
SGD, which require compression of high accuracy and are not stable with aggressive com-
pression. [21, 22] introduce QDGD and QuanTimed-DSGD to achieve exact convergence
with small stepsize and the convergence is slow. DeepSqueeze [23] compensates the com-
pression error to the compression in the next iteration. Motivated by the quantized av-
erage consensus algorithms, such as the work in [89], the quantized gossip algorithm
Choco-Gossip [24] converges linearly to the consensual solution. Combining Choco-
Gossip and D-PSGD leads to a decentralized algorithm with compression, Choco-SGD,
which converges sublinearly under the strong convexity and gradient boundedness as-
sumptions. Its nonconvex variant is further analyzed in [25]. A new compression scheme
using the modulo operation is introduced in [90] for decentralized optimization. A gen-
eral algorithmic framework aiming to maintain the linear convergence of distributed opti-
mization under compressed communication is considered in [91]. It requires a contractive
property that is not satisfied by many decentralized algorithms including the algorithms
proposed in this chapter. A preliminary version of this work [33] proposes the first linear
convergent decentralized algorithm with compression, LEAD. However, the composite
problem is not considered in LEAD, and LEAD is deficient in dealing with stochastic gra-
dients. [81] introduces a linear convergent algorithm for decentralized optimization with
communication compression based on a primal-dual decentralized algorithm [92]. Com-
munication compression is also proposed for gradient tracking algorithms [93, 94, 95], but
they require double communication cost since two vectors need to be transmitted in each
communication run.
    Variance reduction techniques such as SAG [96], SVRG [97], SAGA [98] and Loop-
                                            48


less SVRG [99] have been introduced to accelerate stochastic optimization problems with
finite-sum structure. SVRG-type gradient estimator requires more gradient evaluation
but they are memory friendly. SAGA reduces the number of gradient evaluation in each
iteration but it requires more memory space. Variance reduction has been applied to de-
centralized optimization [100, 101, 102, 103, 104, 81].
    Decentralized algorithms such as PG-EXTRA [57] and NIDS [2] are proposed for com-
posite optimization, and the sublinear rate is proved when the smooth component is
strongly convex and smooth. [58] introduces a proximal gradient algorithm (P2D2), [105]
enhances the convergece rate of SONATA, and the linear convergence rate is proved in
both literature when the nonsmooth component is shared across all nodes. A proximal
unified decentralized algorithm (PUDA) [106] and another proximal decentralized algo-
rithmic framework [107] unify many existing algorithms and establish linear convergence
for composite optimization when the nonsmooth component is shared. A decentralized
accelerated proximal gradient descent algorithm in proposed in [108]. However, commu-
nication compression and stochastic optimization are not considered in these algorithms.
4.1.2   Notation
We clarify commonly used notation in this section. We use bold lower-case letters to
denote column vectors in Rp and bold upper-case letters for matrices in Rn×p . The lower-
case letter with a subscript will be the corresponding row of a matrix denoted by the same
letter in the upper-case. For example, in the algorithm, we use xi ∈ Rp for the local copy
of the model parameters at node i and X = [x1 , x2 , · · · , xn ]⊤ is the matrix, whose rows
are these local copies. Throughout the chapter, without any specification, we use ⟨·, ·⟩ as
standard vector/matrix inner product, whose form is dependent on context. Similarly,
                                               p
∥ · ∥P is a vector/matrix norm defined as ⟨·, P(·)⟩ for a positive semi-definite matrix
P ∈ Rn×n , with some specific domain. We use M† to denote the pseudo-inverse of a
matrix M ∈ Rn×n . If M is symmetric, we use λmax (M), λi (M) and λmin (M) to denote the
                                               49


largest, the ith-largest and the smallest nonzero eigenvalues of M, respectively. For the
stochastic approximate of the local gradient, we use ∇fi (xi , ξi ) ∈ Rp as the estimate of
the deterministic gradient ∇fi (xi ) ∈ Rp of xi at node i. With the collection of all local
estimates {∇fi (xi , ξi )}ni=1 for {∇fi (xi )}ni=1 , we define ∇F(X) ∈ Rn×p and ∇F(X, ξ) ∈ Rn×p
as the compact matrix form of them. The commonly used all-zero vector(matrix) and
all-one vector are 0 ∈ Rp (Rn×p ) and 1 ∈ Rn respectively. The proximal operator with
                                                                                     1
parameter η > 0 of a function r : Rp → R is proxηr = arg minz∈Rp r(z) +             2η
                                                                                       ∥z  − x∥2 .
Finally, we use [n] to replace {1, · · · , n} for abbreviation.
4.2    The Proposed Algorithms
An equivalent form of the problem (1.2) reformulating the decentralized constraint via a
mixing matrix is provided as follows:
                                                    Xn             Xn
                                 X∗ = arg min           fi (xi ) +     r(xi ),               (4.1)
                                       (I−W)X=0 i=1                i=1
                                         X∈Rn×p |       {z    }    | {z }
                                                      =:F(X)        =:R(X)
where X = [x1 , · · · , xn ]⊤ is the collection of local xi s and W is a symmetric matrix which
restricts the feasible region of the above problem to the subspace {1x⊤ ∈ Rn×p | ∀x ∈ Rp }.
The optimal solution X∗ = 1(x∗ )⊤ is consensual and provides an optimal solution x∗ ∈ Rp
to the problem (1.2). The detailed assumptions on W are shown below.
Assumption 4.2.1 (Mixing matrix). The graph G = {V, E} is undirected and connected with
V = [n] := {1, 2, · · · , n}. The mixing matrix W = [wij ] ∈ Rn×n is symmetric and satisfies
   1. wij = 0 if i ̸= j, (i, j) ̸∈ E and wij > 0 if (i, j) ∈ E.
   2. −1 < λn (W) ≤ · · · ≤ λ2 (W) < λ1 (W) = 1 and W1 = 1.
    We propose a Proximal gradient LinEAr convergent Decentralized algorithm with
compressed communication, Prox-LEAD, to solve the problem (4.1). Our algorithm ex-
tends LEAD proposed in [33] to deal with the nonsmooth component via the proximal
                                                      50


operator, and to accelerate the stochastic optimization in the finite-sum setting. Moreover,
when the compression is absent and the gradient oracle returns full gradient, Prox-LEAD
is shown to be covered by the proximal unified decentralized framework in [106]. Algo-
rithm 4.1 illustrates the prototype of Prox-LEAD in the compact form, and it reduces to
LEAD by setting R = 0.
    In Algorithm 4.1, line 6 uses the gradient from the stochastic gradient oracle to create
an auxiliary variable Z as the information to be communicated. The COMM procedure
compresses the difference between Z and a state variable H by a unbiased operator satis-
fying the following condition.
Assumption 4.2.2 (Unbiased compression operator). The stochastic operator Q : Rp → Rp
is unbiased, i.e., EQ(x) = x, and there exists C ≥ 0 such that
                               E∥x − Q(x)∥2 ≤ C∥x∥2 , ∀x ∈ Rp .
In particular, when C = 0, we treat Q as the identity operator.
    The benefit of this difference compression is to reduce the compression error asympo-
totically [17, 33]. Since the following variance of the stochastic estimation is dependent
on the distance between Zk+1 and Hk ,
                     E∥Ẑk+1 − Zk+1 ∥2 = E∥Q(Zk+1 − Hk ) − (Zk+1 − Hk )∥2 ,
                                         |                {z             }
                                                    =O(∥Zk+1 −Hk ∥)
the variance of the compression will vanish as Z and H converge to the same point. The
updates in line 8 and line 9 essentially compensate the compression error locally. Note
that the COMM procedure is first proposed in the preliminary version of this work [33],
and please to refer to Section 3.1 in [33] for a detailed explanation for the reduced com-
pression error and implicit error compensation.
    Lines 8 and 9 of Algorithm 4.1 proceed in parallel after the compressed communi-
cation is exchanged. The proximal mapping in Line 10 is applied to the rows of Vk+1
                                                51


separately, which is defined as
                                                                               
                          (xk+1
                               1 )
                                       ⊤
                                                             proxηr (v1k+1 )⊤   
                                ..               
                                                 =                  ..          
                                                                                  .
                   
                                 .                                  .          
                                                                               
                                       ⊤
                           (xk+1
                               n )                             proxηr (vnk+1 )⊤
         Algorithm 4.1: Prox-LEAD
         Input: Stepsize η, parameter            α, γ, initial X0 , H1 , D1 = 0
         Output: XK or 1/n ni=1 XK
                                    P
                                               i
          1: H1w = WH1
          2: Z1 = X0 − η∇F(X0 , ξ 0 )
          3: X1 = proxηR (Z1 )
          4: for k = 1, 2, · · · , K − 1 do
          5:   Gk = SGO(Xk )
          6:   Zk+1 = Xk − ηGk − ηDk
          7:   Ẑk+1 , Ẑk+1
                         w ,H
                                  k+1
                                        , Hk+1
                                            w    = COMM(Zk+1 , Hk , Hkw )
                                    γ
          8:   D  k+1       k
                       = D + 2η (Ẑ        k+1
                                               − Ẑk+1
                                                     W )
          9:   Vk+1 = Zk+1 − γ2 (Ẑk+1− Ẑk+1       W )
         10:   Xk+1 = proxηR Vk+1
         11: end for
         Algorithm 4.2: Compressed Communication Procedure (COMM)
          1: procedure COMM(Zk+1 , Hk , Hkw )
          2:   Qk = Q(Zk+1 − Hk )                                           ▷ Compression
                 k+1       k          k
          3:   Ẑ     =H +Q
          4:   Ẑw = Hkw + WQk
                 k+1
                                                                         ▷ Communication
          5:   Hk+1 = (1 − α)Hk + αẐ
          6:   Hk+1
                  w    = (1 − α)Hkw + αẐw
          7:   Return: Ẑk+1 , Ẑk+1  w ,H
                                               k+1
                                                   , Hk+1
                                                       w
          8: end procedure
4.2.1 Prox-LEAD as Inexact PUDA
The problem (4.1) can be reformulated into the following three operators splitting
                                                               1
                               minimize        F(X) + ι0 (B 2 X) + R(X),                  (4.2)
                                  X∈R    n×p
                                                        52


                                                    The finite-sum setting
    The general setting
                                    Loopless SVRG                               SAGA
    Sample ξi ∼ Di                             Sample l ∈ [m] ∼ Pi randomly
                          Sample ω ∈ {0, 1} ∼ Bernoulli (p),
                                                                          1
                                  1                               gi =        (∇fil (xi ) − ∇fil (x̃il ))
                          gi =       (∇fil (xi ) − ∇fil (x̃i ))         mpil
    gi = ∇fi (xi , ξi ).        mpil                                           m
                                                                            1 X
                                + ∇fi (x̃i ),                           +         ∇fij (x̃ij ),
                                                                           m
                                                                              j=1
                          x̃i = ω · xi + (1 − ω) · x̃i .          x̃il = xi .
                         Table 4.1: Stochastic gradient oracle (SGO).
where ι0 is the indicator function taking zero value at 0 and infinity elsewhere.
    Many existing schemes such as Condat-Vu in [109, 110], PDFP in [111] and PD3O
in [112] are applicable to this problem but we choose to adapt the inexact PDHG with a
single proximal gradient step and derive the following iteration which is not the realiza-
tion of any scheme mentioned above.
                               
                                k+1
                               X
                                       = Xk − η∇F(Xk ) − ηDk ,
                                                   λ
                               
                                k+1
                               D       = Dk + (I − W)Xk+1 ,
                               
                                                  2
                                k+1
                               V       = Xk − η∇F(Xk ) − ηDk+1                                          (4.3)
                               
                                                ηλ            
                                        = I−          (I − W) Xk+1 ,
                               
                                                   2
                               
                               
                               
                                 Xk+1 = proxηR (Vk+1 ).
    Compared to LEAD, the third step is reformulated to fully depend on X but the num-
ber of communication is unchanged, and a proximal map is applied on V directly in the
final step. The iteration (4.3) can be shown as a special case of PUDA in [106] which has
global linear convergence for strongly convex F.
    The benefit of this iteration is to maintain the consensus of X in optimality, which
further implies the consensus of V and X. As shown in Section 4.4, the consensus of X is
the key to the linear convergence of Prox-LEAD. It also explains why we need the same
nonsmooth function r for all nodes.
                                                  53


    If we compress the only communication step involving X via COMM procedure,
(D, V) is updated by the inexact information with the controllable compression error.
Therefore, we regard Prox-LEAD as inexact PUDA in terms of the inexact dual and prox-
imal steps.
4.2.2   Stochastic Gradient Oracle
As shown in Algorithm 4.1, Prox-LEAD uses a stochastic gradient oracle (SGO) to esti-
mate the gradient, and different stochastic estimators are listed in Table 4.1 for gradient
estimation. The stochastic gradient oracle returns three types of estimation. In the general
setting, each node uses sample distribution Di to provide an unbiased stochastic gradient
and the variance exists. In the finite-sum setting, we assume each node will construct
a discrete distribution Pi = {pil : l ∈ [m]} for m mini-batches and the stochastic gradi-
ent will be corrected by two different variance reduction schemes: Loopless SVRG and
SAGA.
    For Loopless SVRG (LSVRG), each node will have a reference point x̃i where the full
gradient is evaluated after a random period. A random variable l ∈ [m] will be sam-
pled first with distribution Pi , then a Bernoulli random variable ω will be sampled to
determine whether the reference point will be update or not. If ω = 1, the reference is
replaced by the latest xi , otherwise unchanged. For SAGA, each node will have m ref-
erence points, {x̃ij : j ∈ [m]}. After the index l is sampled, x̃il will be replaced by the
latest xi while remaining reference points are unchanged. Both schemes use reference
points to correct the gradient by variance reduction. Loopless SVRG is memory-friendly
but requires more gradient evaluations. Empirically, SAGA converges faster in terms of
the number of gradient evaluations, but it requires more memory space. The following
Table 4.2 summarizes the convergence complexity of Prox-LEAD in different settings.
                                             54


                             Algorithm             Convergence complexity
                             Prox-LEAD
                                                   O(C
                                                    e + κf + κg + Cκf κg )
                            Theorem 4.4.1
                         Prox-LEAD LSVRG         e + κf + κg + Cκf κg + p−1 )
                                                 O(C
                            Theorem 4.4.2
                         Prox-LEAD SAGA
                                                 O(C
                                                  e + κf + κg + Cκf κg + m)
                            Theorem 4.4.3
Table 4.2: Summary of the convergence compleixty for Prox-LEAD to achieve ϵ-accuracy
with fiexed stepsizes. The first row is the complexity with the full gradient.
    We define condition numbers, κf and κg , as follows
                                           L         λmax (I − W)
                                   κf =      , κg =                ,
                                           µ         λmin (I − W)
where L and µ are regularity constants of the objective function assumed in Assump-
tion 4.3.2. When the compression procedure and the regularizer are removed, the com-
plexities of LEAD LSVRG and LEAD SAGA are reduced to O(κ                e f + κg + p−1 ) and O(κ   e f+
κg + m) respectively, which are better than LessBit-Option D in [81] in terms of the condi-
tional numbers, and it improves over the stochastic LEAD in [33].
4.3     Assumptions on Regularity
In the general stochastic setting, each fi (x) in (1.2) is the expectation of the local loss func-
tion under the sample distribution Di , and we make the following inter-node assumption.
Assumption 4.3.1 (Locally bounded gradient variance). In the general stochastic setting,
each local stochastic gradient ∇fi (x, ξi ) is an unbiased estimate, i.e., Eξi ∇fi (x, ξi ) = ∇fi (x), and
satisfies
                                 E∥∇fi (x∗ , ξi ) − ∇fi (x∗ )∥2 ≤ σi2 ,
where x∗ is the optimal solution to the problem (1.2).
    This locally bounded variance at the optimal point is strictly weaker than the uni-
formly bounded variance assumption [33].
                                                   55


    Given a smooth convex function f , we define the Bregman distance with respect to f
as
                      Vf (x, y) = f (x) − f (y) − ⟨∇f (y), x − y⟩, ∀x, y ∈ Rp .
With the above definition, we impose different assumptions to the regularity of smooth
function component in two settings.
Assumption 4.3.2 (Strong convexity and smoothness). Each fi is a smooth, µ-strongly con-
vex function, i.e., ∀x, y ∈ Rp ,
                                                                      µ
                           fi (x) − fi (y) − ⟨∇fi (y), x − y⟩ ≥ ∥x − y∥2 .                      (4.4)
                           |                 {z                } 2
                                         =Vfi (x,y)
In the general stochastic setting, each fi is L-smooth in expectation, i.e., ∀x, y ∈ Rp ,
                E∥∇fi (x, ξi ) − ∇fi (y, ξi )∥2 ≤ 2L[fi (x) − fi (y) − ⟨∇fi (y), x − y⟩] .      (4.5)
                                                    |                   {z                 }
                                                                   =2LVfi (x,y)
In the finite-sum setting, the L-smoothness is imposed on each fij instead, i.e., ∀x, y ∈ Rp ,
                  ∥∇fij (x) − ∇fij (y)∥2 ≤ 2L[fij (x) − fij (y) − ⟨∇fij (y), x − y⟩] .          (4.6)
                                               |                    {z                  }
                                                              =2LVfij (x,y)
    Note the inequalities in (4.4) and (4.5) are equivalent to, ∀X, Y ∈ Rn×p ,
                                                                              µ
                  VF (X, Y) = F(X) − F(Y)− ⟨∇F(Y), X − Y⟩ ≥                     ∥X − Y∥2 ,
                                                                              2
                                                         Xn
                                                2
                 E∥∇F(X, ξ) − ∇F(Y, ξ)∥ ≤ 2L                 Vfi (xi , yi ) = 2LVF (X, Y).
                                                         i=1
Remark 4.3.1. In [76], the above two types of L-smoothness assumptions are discussed, and the
latter is used to derive an improved convergence rate of SGD. The expected L-smoothness is shown
to be significantly weaker than the most commonly-used L-smoothness of fi (x, ξi ) for all ξi .
4.4     Convergence Analysis
In this section, we present the convergence of Prox-LEAD in both stochastic scenarios.
We first show two fundamental lemmas regarding the conditional expectation on the
                                                      56


compression operator, then we present the two scenarios in Sections 4.4.1 and 4.4.2, re-
spectively. More specifically, in Section 4.4.1, the linear convergence to the neighborhood
will be shown under the general stochastic setting, while in Section 4.4.2, two variance
reduction schemes are used to exploit the exact linear convergence in the problems with
finite-sum structure.
     The stochastic actions such as compression and gradient estimation generate two se-
quences of σ-algebra where the stochastic variables in this procedure are adapted. We use
F k to denote the σ-algebra of gradient estimation at kth step and Hk is the σ-algebra of
stochastic compression at the same step. {F k } and {Hk } satisfy
                          F 1 ⊂ H1 ⊂ F 2 ⊂ H2 ⊂ · · · ⊂ F k ⊂ Hk ⊂ · · · .
With these notations, we can clarify the stochastic dependencies among the variables gen-
erated by the algorithm. For example, tuple (Gk , Zk+1 ) is measurable in F k and tuple
(Ẑk+1 , Hk+1 , Dk+1 , Vk+1 , Xk+1 ) is measurable in Hk .
     Throughout the section, we use E to denote the conditional expectations EF k and EHk
given the context for simplicity. Then we define some auxiliary constants related to the
optimal solution X∗ to the problem (4.1). X∗ is consensual, i.e., each row x∗ solves the
problem (1.2). We let z∗ = x∗ − (η/n) ni=1 ∇fi (x∗ ), which is in the pre-image of proxηr at
                                            P
x∗ . In addition, we let
                                                      η ⊤
                                Z∗ = 1(z∗ )⊤ = X∗ −     11 ∇F(X∗ ),                     (4.7)
                                                      n
                                        1
                                D∗ = (X∗ − Z∗ ) − ∇F(X∗ )
                                        η
                                             1     
                                    = − I − 11⊤ ∇F(X∗ ).                                (4.8)
                                              n
To show the convergence of the proposed algorithm, we characterizes the decrease of
∥Zk − Z∗ ∥, ∥Dk − D∗ ∥(I−W)† , ∥Xk − X∗ ∥, and ∥Hk − Z∗ ∥. The convergence of Hk to Z∗
shows the decrease of the compression error to zero. The following lemma shows the
connection of those values for the k and k + 1-th iteration.
                                                 57


Lemma 4.4.1 (One-step progress). Let {(Zk , Dk , Xk )} be the sequence generated from Algo-
rithm 4.1. Under Assumption 4.2.1, taking the expectation conditioned on the stochastic compres-
                                                                  2
sion operator at the k-th iteration, we have for any γ ≤    λmax (I−W)
                                                                       ,
                ∥Zk+1 − Z∗ ∥2 = ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 + η 2 ∥Dk − D∗ ∥2
                                  − 2η⟨Dk − D∗ , Xk − X∗ − ηGk + η∇F(X∗ ⟩,                         (4.9)
      E∥Dk+1 − D∗ ∥2(I−W)†
                                               γ2                          γ2
   = ∥Dk − D∗ ∥2(I−W)† − γ∥Dk − D∗ ∥2 +             ∥Z k+1
                                                            − Z ∗ 2
                                                                  ∥ I−W  +      E∥Ẑk+1 − Zk+1 ∥2I−W
                                               4η 2                        4η 2
          γ
       + ⟨Dk − D∗ , Xk − X∗ − ηGk + η∇F(X∗ )⟩,                                                    (4.10)
           η
and
                                                              γ2
             E∥Xk+1 − X∗ ∥2 ≤ ∥Zk+1 − Z∗ ∥2I− γ (I−W) +           E∥Ẑk+1 − Zk+1 ∥2(I−W)2 .       (4.11)
                                                    2          4
Proof. (i) The equality (4.9) is shown from Line 6 of Algorithm 4.1 and (4.8) directly. (ii)
Note that (I − W)Z∗ = 0. Then from Line 8 of Algorithm 4.1, we have
                                                      γ
                       Dk+1 − D∗ = Dk − D∗ +             (I − W)(Ẑk+1 − Z∗ ),
                                                      2η
which gives
          E∥Dk+1 − D∗ ∥2(I−W)†
                                γ                              γ2
       = ∥Dk − D∗ ∥2(I−W)† + ⟨Dk − D∗ , Zk+1 − Z∗ ⟩ + 2 ∥Zk+1 − Z∗ ∥2I−W
                                η                             4η
               2
             γ
           + 2 E∥Ẑk+1 − Zk+1 ∥2I−W
             4η
                                                  γ
       = ∥Dk − D∗ ∥2(I−W)† − γ∥Dk − D∗ ∥2 + ⟨Dk − D∗ , Xk − X∗ − ηGk + η∇F(X∗ )⟩
                                                   η
               2                        2
             γ                        γ
           + 2 ∥Zk+1 − Z∗ ∥2I−W + 2 E∥Ẑk+1 − Zk+1 ∥2I−W ,
             4η                       4η
where the second equality comes from Line 6 of Algorithm 4.1.
    (iii) From the definition of Z∗ , we have
                                         X∗ = proxηR (Z∗ ).
                                                  58


Therefore,
     E∥Xk+1 − X∗ ∥2 = ∥proxηR (Vk+1 ) − proxηR (Z∗ )∥2
                      ≤ E∥Vk+1 − Z∗ ∥2
                                          γ                       2
                      = E Zk+1 − Z∗ −         (I − W)(Ẑk+1 − Z∗ )
                                            2
                              γ                        2     γ2
                      = I − (I − W) (Zk+1 − Z∗ ) + E∥(I − W)Ẑk+1 − Zk+1 ∥2
                               2                               4
                                                      2
                                                    γ
                      ≤ ∥Zk+1 − Z∗ ∥2I− γ (I−W) + E∥(I − W)Ẑk+1 − Zk+1 ∥2 ,
                                        2            4
where the first inequality comes from the non-expansiveness of proxηR and the last in-
equality follows from
              γ        2                       γ2
            I − (I − W) = I − γ(I − W) + (I − W)2
               2                                  4
                                 γ                γ        1
                                                                 γ             1
                           = I − (I − W) − (I − W) 2 I − (I − W) (I − W) 2
                                 2                2               2
                                 γ
                           ≼ I − (I − W),
                                 2
since γ2 (I − W) ≼ I. The inequality (4.11) is obtained.
   The next lemma builds the critical inequality for Algorithm 4.1 and serves as the key
step in the proofs. We define the following two positive (semi)definite matrices
                                      γ
                            P := I − (I − W), Q := (I − W)†
                                      2
to measure the convergence.
Lemma 4.4.2 (The key inequality). Under Assumptions 4.2.1 and 4.2.2, we choose α ∈ (0, (1 +
C)−1 ) such that ∆(α) := α − (1 + C)α2 > 0 and
                         2                      1         2∆(α)ηµ 
                     η ∈ 0,    , γ ∈ 0,                                .
                            µ               λmax (I − W) ∆(α)ηµ + 4αC
                                                59


Then the sequence {(Ẑk , Zk , Dk , Hk )} generated by Algorithm 4.1 satisfies
                           ηµ  k+1                  2η 2
                       1−        ∥Z      − Z∗ ∥2P +        E∥Dk+1 − D∗ ∥2Q
                             2                         γ
                     + CM E∥Hk+1 − Z∗ ∥2 + E∥Ẑk+1 − Zk+1 ∥2I−P
                                                              2η 2
                   ≤ ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 +                 ∥Dk − D∗ ∥2Q− γ I
                                                               γ                 2
                     + (1 − α)CM ∥Hk − Z∗ ∥2 ,                                               (4.12)
where the expectation is conditioned on the stochastic compression at k-th step and
                                        (1 − γ2 λmax (I − W))ηµ
                                M :=                                 > 0.
                                                  2αC
   In particular, when C = 0, i.e., Ẑk = Zk , (4.12) still holds as
         ηµ  k+1                 2η 2
      1−      ∥Z     − Z∗ ∥2P +         E∥Dk+1 − D∗ ∥2Q + CM E∥Hk+1 − Z∗ ∥2
           2                        γ
                                            2η 2
 ≤ ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 +                 ∥Dk − D∗ ∥2Q− γ I + (1 − α)CM ∥Hk − Z∗ ∥2 (4.13)
                                             γ                  2
with
                                         (1 − γ2 λmax (I − W))ηµ
                               CM :=                                  > 0.
                                                     2α
                                    2
Proof. Letting (4.9) + (4.10)× 2ηγ , we get
                        2η 2
      ∥Zk+1 − Z∗ ∥2P +       E∥Dk+1 − D∗ ∥2Q
                          γ
                                             2η 2
   = ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 +                ∥Dk − D∗ ∥2Q − η 2 ∥Dk − D∗ ∥2
                                               γ
         γ
      + E∥Ẑk+1 − Zk+1 ∥2I−W
         2
                                             2η 2
   = ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 +                ∥Dk − D∗ ∥2Q− γ I + E∥Ẑk+1 − Zk+1 ∥2I−P , (4.14)
                                               γ                   2
                                                   60


    From the update of H in Line 5 of the compression procedure ‘COMM’, we obtain
     E∥Hk+1 − Z∗ ∥2
   = ∥(1 − α)(Hk − Z∗ ) + α(Zk+1 − Z∗ )∥2 + α2 E∥Qk − EQk ∥2
   = (1 − α)∥Hk − Z∗ ∥2 + α∥Zk+1 − Z∗ ∥2 − α(1 − α)∥Zk+1 − Hk ∥2 + α2 E∥Ẑk+1 − Zk+1 ∥2
   ≤ (1 − α)∥Hk − Z∗ ∥2 + α∥Zk+1 − Z∗ ∥2 − α(1 − α)∥Zk+1 − Hk ∥2 + α2 CE∥Zk+1 − Hk ∥2
   = (1 − α)∥Hk − Z∗ ∥2 + α∥Zk+1 − Z∗ ∥2 − ∆(α)∥Zk+1 − Hk ∥2
where the second equality uses
                ∥(1 − α)x + αy∥2 = (1 − α)∥x∥2 + α∥y∥2 − α(1 − α)∥x − y∥2 ,
and the inequality is from Assumption 4.2.2.
    Multiplying the H-inequality by C and adding it to the following inequality
                        ∆(α)E∥Ẑk+1 − Zk+1 ∥2 ≤ ∆(α)C∥Zk+1 − Hk ∥2 ,
we have
                         ∆(α)E∥Ẑk+1 − Zk+1 ∥2 + CE∥Hk+1 − Z∗ ∥2
                       ≤ (1 − α)C∥Hk − Z∗ ∥2 + αC∥Zk+1 − Z∗ ∥2
                                                       αC
                       ≤ (1 − α)C∥Hk − Z∗ ∥2 +                ∥Zk+1 − Z∗ ∥2P ,             (4.15)
                                                    λmin (P)
if γ <       2
       λmax (I−W)
                  since P ≻ 0 and λmax (P−1 ) = λ−1  min (P).
                                       (1− γ λ   (I−W))ηµ
    When C > 0, let M = λmin2αC(P)ηµ
                                     =     2 max
                                               2αC
                                                           > 0, then (4.14) + M × (4.15) gives
                          ηµ  k+1                2η 2
                       1−      ∥Z     − Z∗ ∥2P +        E∥Dk+1 − D∗ ∥2Q
                            2                       γ
                      + CM E∥Hk+1 − Z∗ ∥2 + E∥Ẑk+1 − Zk+1 ∥2(∆(α)M −1)I+P
                                                            2η 2
                   ≤ ∥Xk − X∗ − ηGk + η∇F(X∗ )∥2 +               ∥Dk − D∗ ∥2Q− γ I
                                                              γ                2
                      + (1 − α)CM ∥Hk − Z∗ ∥2 .
    Notice that for the given range of γ, we have
                                     ∆(α)M I ≽ 2(I − P),
                                                61


hence (4.12) is proved.
    Lastly, when C = 0, we have Ẑk+1 = Zk+1 for all k. Multiplying the H-inequality by
         (1− γ2 λmax (I−W))ηµ
CM :=              2α
                              and adding it to (4.14), we complete the proof.
4.4.1   The General Stochastic Setting
In the general stochastic setting with fi (xi ) = Eξi ∼Di fi (xi , ξi ), the assumptions on fi and
the variance of the stochastic gradient at the optimal point allow us to show the linear
convergence of Algorithm 4.1 up to a neighborhood of the optimal point.
Theorem 4.4.1 (Prox-LEAD). Under Assumptions 4.2.1–4.3.2, let {(Xk , Dk , Zk , Ẑk , Hk )} be
the sequence generated by Algorithm 4.1 and σ 2 = n1 ni=1 σi2 . Set α ∈ (0, (1 + C)−1 ) such that
                                                            P
∆(α) = α − (1 + C)α2 > 0 and we can choose
                             1 i                       1            2∆(α)ηµ 
                        η ∈ 0,       ,     γ ∈ 0,                                  ,
                                2L                 λmax (I − W) ∆(α)ηµ + 4αC
then, in total expectation,
                                       1              1        2η 2 σ 2
                                         EΦk+1 ≤ ρk EΦ1 +               ,
                                       n              n        1−ρ
where
                   ηµ  k                 2η 2
      k
    Φ := 1 −             ∥Z − Z∗ ∥2P +          ∥Dk − D∗ ∥2Q + CM ∥Hk − Z∗ ∥2 + ∥Ẑk − Zk ∥2I−P
                     2                      γ
and
                                                                                       
                            ηµ          γ                                 γλmax (I − W)
      ρ = max 1 −                , 1 − λmin (I − W), 1 − α, (1 − ηµ)                      < 1.
                          2 − ηµ         2                                       2
Proof. In Lemma 4.4.2, we derive the one-step progress inequality in expectation condi-
tioned on the stochastic compression at kth step and we now focus on the term involving
X and Gk in (4.12).
                                                     62


    Take the conditional expectation on stochastic gradient at kth step, we have
     E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2
  = ∥Xk − X∗ ∥2 − 2η⟨Xk − X∗ , ∇F(Xk ) − ∇F(X∗ )⟩ + η 2 E∥∇F(Xk , ξ k ) − ∇F(X∗ )∥2
  ≤ ∥Xk − X∗ ∥2 − 2η⟨Xk − X∗ , ∇F(Xk ) − ∇F(X∗ )⟩ + 2η 2 E∥∇F(Xk , ξ k ) − ∇F(X∗ , ξ k )∥2
      + 2η 2 E∥∇F(X∗ , ξ k ) − ∇F(X∗ )∥2
  ≤ ∥Xk − X∗ ∥2 − 2η⟨Xk − X∗ , ∇F(Xk ) − ∇F(X∗ )⟩ + 4η 2 LVF (Xk , X∗ ) + 2nη 2 σ 2
  = ∥Xk − X∗ ∥2 − 2η⟨Xk − X∗ , ∇F(Xk )⟩ + 2η(F(Xk ) − F(X∗ ) − VF (Xk , X∗ ))
      + 4η 2 LVF (Xk , X∗ ) + 2nη 2 σ 2
  = ∥Xk − X∗ ∥2 − 2η(F(X∗ ) − F(Xk ) − ⟨X∗ − Xk , ∇F(Xk )⟩)
      − 2η(1 − 2ηL)VF (Xk , X∗ ) + 2nη 2 σ 2
  ≤ (1 − ηµ)∥Xk − X∗ ∥2 + 2nη 2 σ 2 ,
where the first equality uses the unbiasedness of stochastic gradient, the second inequal-
ity follows the expected Lipschitz property in Assumption 4.3.2, and the last inequality is
                                           1
due to the strong convexity and η ≤       2L
                                             .
    Now we use the power property to take the conditional expectation on the stochastic
gradient for (4.12) and plug the above inequality into it, then
                               ηµ                      2η 2
                            1−        E∥Zk+1 − Z∗ ∥2P +       E∥Dk+1 − D∗ ∥2Q
                                 2                        γ
                           + CM E∥Hk+1 − Z∗ ∥2 + E∥Ẑk+1 − Zk+1 ∥2I−P
                                                   2η 2
                       ≤ (1 − ηµ)∥Xk − X∗ ∥2 +          ∥Dk − D∗ ∥2Q− γ I
                                                    γ                 2
                           + (1 − α)CM ∥Hk − Z∗ ∥2 + 2nη 2 σ 2 .
                                                63


    Taking the total expectation and using (4.11), we get
                           ηµ                          2η 2
                        1−         E∥Zk+1 − Z∗ ∥2P +          E∥Dk+1 − D∗ ∥2Q
                             2                            γ
                       + CM E∥Hk+1 − Z∗ ∥2 + E∥Ẑk+1 − Zk+1 ∥2I−P
                                                    2η 2
                   ≤ (1 − ηµ)E∥Zk − Z∗ ∥2P +             E∥Dk − D∗ ∥2Q− γ I
                                                     γ                   2
                       + (1 − α)CM E∥Hk − Z∗ ∥2 + (1 − ηµ)E∥Ẑk − Zk ∥2(I−P)2
                       + 2nη 2 σ 2 .                                                            (4.16)
    Let Φk be defined as above, by (4.16), we have
                              
                  k+1                    ηµ          γ
              EΦ      ≤ max 1 −               , 1 − λmin (I − W), 1 − α,
                                       2 − ηµ        2
                                                                          
                                                           γλmax (I − W)
                                                (1 − ηµ)                    EΦk + 2nη 2 σ 2 .
                                                                  2
    Finally, by taking the telescopic sum, we complete the proof.
    When there is no compression, Prox-LEAD is reduced to the special case of PUDA
with the stochastic gradient. Corollary 4.4.1 shows the linear convergence to the neigh-
borhood of the optimal solution and when the gradient is deterministic, the convergence
rate matches that given in [106].
Corollary 4.4.1 (Stochastic PUDA). When there is no compression, i.e., C = 0, under Assump-
                                                                                        1
                                                                                          
tions 4.2.1, 4.3.1 and 4.3.2, we can pick α = 1 and γ = 1. Then, for any η ∈ 0, 2L
                                                                     
                                             ηµ          λmin (I − W)
                  EΦ  k+1
                          ≤ max 1 −                ,1 −                 EΦk + 2nη 2 σ 2 ,
                                          2 − ηµ                2
with
              ηµ  k        ∗ 2                        ∗ 2
                                                                    λmax (I − W)  ηµ k
   k
  Φ := 1 −                                  2    k
                    ∥Z − Z ∥ I+W + 2η ∥D − D ∥Q + 1 −                                   ∥Z − Z∗ ∥2 .
                2                   2                                       2        2
Proof. Similar to the proof of Theorem 4.4.1, when C = 0, we considers inequality (4.13)
instead. Notice that in this case (4.11) becomes
                                     ∥Xk+1 − X∗ ∥2 ≤ ∥Zk+1 − Z∗ ∥2P .
                                                     64


    Hence we can get
                                  ηµ                          2η 2
                              1−        E∥Zk+1 − Z∗ ∥2P +            E∥Dk+1 − D∗ ∥2Q
                                    2                             γ
                             + CM E∥Hk+1 − Z∗ ∥2
                                                           2η 2
                        ≤ (1 − ηµ)E∥Zk − Z∗ ∥2P +               E∥Dk − D∗ ∥2Q− γ I
                                                            γ                    2
                             + (1 − α)CM E∥Hk − Z∗ ∥2 + 2nη 2 σ 2 ,
               (1− γ2 λmax (I−W))ηµ
where CM =               2α
                                      .
    We can take α = 1 − ϵ for some ϵ ∈ (0, 1). Note that ∆(α) approaches 1 − ϵ as C goes
                                                         2
to 0. The upper bound of γ is reduced to           λmax (I−W)
                                                                 , which is strictly greater than 1 due to
Assumption 4.2.1. Hence we can take γ = 1 and the convergence rate becomes
                                                                            
                                              ηµ            λmin (I − W)
                               max 1 −              ,1 −                   ,ϵ .
                                            2 − ηµ                  2
    Let ϵ approach 0 and notice that Hk = Zk for all k if α = 1, then we complete the
proof.
    Corollary 4.4.1 can be viewed as the limit case of Theorem 4.4.1 with
                                   (1 − γ2 λmax (I − W))ηµ
                         CM =                                    , Ẑ = Z, H = Z.
                                               2α
                    1            1                       2
By taking α =   2(1+C)
                        ,η   =  2L
                                    and γ =    (1+16Cκf )λmax (I−W)
                                                                      , Theorem 4.4.1 shows the conver-
gence complexity of Prox-LEAD is O(C        e +κf +κg +Cκf κg ) when full gradient is used. This
complexity significantly improves that of LEAD, which is O((1              e     + C)(κf + κg ) + Cκf κg ).
A complete comparison can be found in Table B.1.
4.4.2  Finite-sum Setting with Variance Reduction
In the finite-sum setting, we assume that each fi is the average of m functions fij and
impose the commonly assumed L-Lipschitz continuity to ∇fij . We will keep using Φk
defined in Theorem 4.4.1. The following theorems implicitly assumes C > 0 while the
                                                     65


convergence results are extendable to the case without compression. The detailed proof
of two theorems can be found in supplemental material.
Theorem 4.4.2 (Loopless SVRG). Under Assumptions 4.2.1, 4.2.2, and 4.3.2, for any p ∈ (0, 1),
            1
set pij =   m
              , ∀i ∈ [n], ∀j ∈ [m],
                                   1                  1           1        2
                            η=       , α=                   , γ=                ,
                                  6L             2(1 + C)        λmax 1 + 48Cκf
then, in total expectation, we have
                                                                                   k+1
               k+1
                                                                            2 −1        e 0,
          EΦ e     ≤ 1 − max 12κf − 1, κg + 48Cκf κg , 1 + C, 6κf ,                        Φ
                                                                              p
                           Pn
where Φ e k := Φk +      2
                       9pL    i=1 Vfi (x̃ki , x∗ ).
                                                                                       1
Theorem 4.4.3 (SAGA). Under Assumption 4.2.1, 4.2.2 and 4.3.2, set pij =               m
                                                                                         , ∀i ∈ [n], ∀j ∈
[m],
                                   1                  1           1        2
                            η=       , α=                   , γ=                ,
                                  6L             2(1 + C)        λmax 1 + 48Cκf
then, in total expectation, we have
                                                                                −1 k+1
              k+1                                                                           e 0,
                                    
         EΦ e      ≤ 1 − max 12κf − 1, κg + 48Cκf κg , 1 + C, 6κf , 2m                      Φ
                           Pn Pm
where Φ e k := Φk +     2
                       9L    i=1    j=1 Vfij (x̃kij , x∗ ).
    Using O(·)e as the abbreviation of O((·) log(1/ϵ)), we simplify the above two theorems
as the following corollary.
Corollary 4.4.2. We can achieve ∥xki − x∗ ∥2 ≤ ϵ in expectation on each node after the number of
iterations
    •
                                               e C + κf + κg + Cκf κg + p−1
                                                                             
                                     K=O
       for Loopless SVRG and
                                                           66


    •
                                 K=O   e (C + κf + κg + Cκf κg + m)
      for SAGA.
Proof. From the definition of Φk , we have
                                                     12κf
                                   ∥Zk − Z∗ ∥2P ≤           Φk
                                                   12κf − 1
and
                          γ2 k
                             ∥Ẑ − Zk ∥2(I−W)2 ≤ ∥Ẑk − Zk ∥2I−P ≤ Φk .
                           4
    Using (4.11) in Lemma 4.4.1, we have
                    n
                 1X                    1                  γ2
                       E∥xki − x∗ ∥2 ≤ E∥Zk − Z∗ ∥2P + E∥Ẑk − Zk ∥2(I−W)2
                 n i=1                 n                  4n
                                          24κf − 1
                                     ≤              EΦk .
                                       n(12κf − 1)
    Lastly, the convergence complexity is proved from Theorem 4.4.2 and Theorem 4.4.3.
Remark 4.4.1. In particular, when there is no compression, i.e., C = 0 and the network is fully
connected, i.e., κg = 1, the complexity of Loopless SVRG is reduced to O   e (κf + p−1 ), which
matches that given in [99] and the complexity of SAGA is reduced to Oe (κf + m) shown in [98].
4.5    Numerical Experiments
In this section, we present numerical experiments to validate the convergence of the pro-
posed algorithms, including LEAD in the smooth case and Prox-LEAD in the nonsmooth
case, as well as their stochastic variants.
4.5.1   Experimental Setting
Baselines. To demonstrate the effectiveness of the proposed algorithms, we compare
them with the following baselines: 1) two state-of-the-art decentralized algorithms with
                                                67


compression: Choco [24] and LessBit [81]; 2) three non-compressed algorithms: DGD [27],
NIDS [2], and P2D2 [58]. Note that NIDS and P2D2 support the nonsmooth case. In
the stochastic case, we also include LessBit with Option C (LessBit-SGD) and Option D
(LessBit-LSVRG).
    Setup. We consider eight machines connected in a ring topology network. Each agent
can only exchange information with its two 1-hop neighbors. The mixing weight is simply
set as 1/3. For compression, we use the unbiased b-bits quantization method with ∞-
norm
                                                          (b−1)        
                                          ∥x∥∞ sign(x)        2     |x|
                            Q∞ (x) :=                     ·             +u ,           (4.17)
                                             2b−1               ∥x∥∞
where · is the Hadamard product, |x| is the elementwise absolute value of x, and u is a
random vector uniformly distributed in [0, 1]p . Only sign(x), norm ∥x∥∞ , and integers in
the bracket need to be transmitted. Note that this quantization method is similar to the
quantization used in QSGD [13] and Choco [24], but we use the ∞-norm scaling instead of
the 2-norm. This small change brings significant improvement on compression precision
as justified both theoretically and empirically in Appendix C in [33]. In this section, we
choose 2-bit quantization and quantize the data in a blockwise manner (block size = 256).
    For all experiments, we tune the stepsize η in the range [0.01, 0.1]. For LEAD and Prox-
LEAD, we simply fix α = 0.5 and γ = 1.0 for all experiments since they are very robust
to the parameter settings. The parameters γ in Choco and θ in LessBit are tuned from
{0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 1.0}.
    Logistic regression. We consider a regularized logistic regression problem with a
cross-entropy objective function:
                                      m   C
                                   1 XX
                     f (X) = −              yi,j log(a⊤                         2
                                                       i Xj ) + λ1 ∥X∥1 + λ2 ∥X∥2 ,
                                  m i=1 j=1
where ai ∈ Rp , y ∈ Rm×C , X ∈ Rp×C and C is the number of classes. We use the MNIST
dataset and distribute the samples equally to all the machines in a non-iid way, sorted by
their labels. Note that this is the heterogeneous data settings where the data distribution
                                                    68


from each agent is very different, which is more challenging than the homogeneous data
setting where all the agents share the same data distribution. In the smooth case, we set
λ1 = 0 and λ2 = 0.005, and in the nonsmooth case, we set λ1 = 0.005 and λ2 = 0.005. In the
stochastic case, the training data in each agent are evenly divided into 15 mini-batches.
The performance is measured by the the training suboptimality, i.e., ∥Xk − X∗ ∥2F , where
X∗ denotes the optimal solution.
                                                                                                                   DGD (32bit)
                                         10−1                                                                      NIDS (32bit)                               10−1
                training suboptimality                                                                                               training suboptimality
                                                                                                                   Choco (2bit)
                                                                                                                   LessBit (2bit)
                                                                                                                   LEAD (32bit)
                                                                                                                   LEAD (2bit)
                                         10−2                                                                                                                 10−2
                                                 0   500   1000   1500   2000   2500   3000    3500    4000                                                                0           2             4             6           8
                                                                        epoch                                                                                                                        bits                            1e7
                                                                   (a) Full gradient                                                                                              (b) Full gradient
                                                                                                                                                                    10−1
                                  10−1
         training suboptimality                                                                                                            training suboptimality
                                                                                                              DGD (32bit)
                                                                                                              Choco (2bit)
                                                                                                              LessBit-SGD (2bit)
                                                                                                                                                                    10−2
                                  10−2                                                                        LessBit-LSVRG (2bit)
                                                                                                              LEAD-SGD (32bit)
                                                                                                              LEAD-SGD (2bit)
                                                                                                              LEAD-SAGA (32bit)
                                  10−3                                                                        LEAD-SAGA (2bit)                                      10−3
                                                                                                              LEAD-LSVRG (32bit)
                                                                                                              LEAD LSVRG (2bit)
                                  10−4                                                                                                                              10−4
                                           0.0       0.2          0.4       0.6          0.8            1.0                                                             0.00    0.25   0.50   0.75   1.00       1.25   1.50   1.75   2.00
                                                                  #gradient                           1e8
                                                                                                                                                                                                         bits                         1e8
                                                            (c) Stochastic gradient                                                                                            (d) Stochastic gradient
Figure 4.1: Smooth logistic regression problem (λ1 = 0). In the full gradient case ((a) and
(b)), LEAD (2bit) and LessBit (2bit) converge similarly as NIDS (32bit) and LEAD (32bit)
in terms of epochs/iterations, but they requires much fewer bits in communication. In
the stochastic case ((c) and (d)), the 2bit variants of LEAD match well with their 32bit
variants in terms of the number of gradient evaluation, but they requires much fewer
bits. Note that LEAD-SAGA requires more memory and more iterations/communication
than LEAD-LSVRG, but it computes only one gradient in each iteration, while LEAD-
LSVRG computes at least two gradients in each iteration.
                                                                                                                     69


                                                                                                                             P2D2 (32bit)
                                                                                                                                                   training suboptimality
                                                                                                                             NIDS (32bit)                                   10−1
                      training suboptimality
                                               10−1
                                                                                                                             Prox-LEAD (32bit)
                                                                                                                             Prox-LEAD (2bit)
                                               10−2                                                                                                                         10−2
                                                      0       500   1000   1500   2000   2500       3000   3500   4000                                                                          0         2      4              6       8
                                                                                epoch                                                                                                                               bits                    1e7
                                                                             (a) Full gradient                                                                                                          (b) Full gradient
                                       100
                                                                                                                         Prox-LEAD-SGD (32bit)                                           10−1
                                                                                                                         Prox-LEAD-SGD (2bit)
        training suboptimality                                                                                                                                  training suboptimality
                                                                                                                         Prox-LEAD-SAGA (32bit)
                                 10−1                                                                                    Prox-LEAD-SAGA (2bit)
                                                                                                                         Prox-LEAD-LSVRG (32bit)
                                                                                                                                                                                         10−2
                                                                                                                         Prox-LEAD-LSVRG (2bit)
                                 10−2
                                                                                                                                                                                         10−3
                                 10−3
                                 10−4                                                                                                                                                    10−4
                                                 0        1         2       3        4          5          6         7                                                                              0      1    2           3       4        5
                                                                           #gradient                              1e7
                                                                                                                                                                                                                     bits                    1e8
                                                                        (c) Stochastic gradient                                                                                                     (d) Stochastic gradient
Figure 4.2: Nonsmooth logistic regression problem (λ1 = 0.005). In the full gradient
case ((a) and (b)), Prox-LEAD (2bit) converges similarly as NIDS and Prox-LEAD (32bit)
in terms of epochs/iterations, but it requires much fewer bits than the other three algo-
rithms. In the stochastic case ((c) and (d)), the 2bit variants match well with their 32bit
variants in terms of the number of gradient evaluation, but they requires much fewer bits.
Note that Prox-LEAD-SAGA requires more memory and more iterations/communication
than Prox-LEAD-LSVRG, but it computes only one gradient in each iteration, while Prox-
LEAD-LSVRG computes at least two gradients in each iteration.
   Smooth case. The experiments in the smooth case are showed in Fig. 4.1. From
Fig. 4.1a, we can observe that when full gradients are available, NIDS, LessBit and LEAD
converge linearly to the optimal solution, while DGD and Choco suffer from the con-
vergence bias. Fig. 4.1b demonstrates the benefit of communication compression when
considering the suboptimality with respect to the communcation bits. Note that the per-
formance of LEAD (32bit) matches well with LEAD (2bit), which validates that the com-
pression doesn’t hurt the convergence for LEAD, while the communication bits are sig-
                                                                                                                                  70


nificantly reduced.
    Fig. 4.1c and Fig. 4.1d show the performance of algorithms with stochastic gradi-
ents. We can make the following observations: 1) The performance of LEAD-SGD (2bit),
LEAD-SAGA (2bit) and LEAD-LSVRG (2bit) match well with LEAD-SGD (32bit), LEAD-
SAGA (32bit) and LEAD-LSVRG (32bit), respectively, which indicates that the compres-
sion in LEAD doesn’t hurt the convergence; 2) The linear convergence of the variance-
reduction variants, such as LEAD-SAGA (2bit) and LEAD-LSVRG (2bit), verifies our the-
oretical analyses1 ; 3) LEAD-SAGA (2bit) and LEAD-LSVRG (2bit) significantly outper-
form all baselines2 . The benefit of communication compression can be clearly illustrated
by Fig. 4.1d.
    Nonsmooth case. The experiments in the nonsmooth case are showed in Fig. 4.2.
Fig. 4.2a shows that Prox-LEAD (2bit) achieves linear convergence to the optimal solution
with full gradient, and its performance matches well with the non-compressed version
Prox-LEAD (32bit). It also converges similarly with other non-compressed baselines such
as P2D2 and NIDS. Fig. 4.2b demonstrates the tremendous advantages of communication
compression in Prox-LEAD (2bit) when considering the communication bits.
    Fig. 4.2c and Fig. 4.2d present the performance with stochastic gradients. It can be
observed from Fig. 4.2c that: 1) Prox-LEAD-SAGA (2bit) and Prox-LEAD-LSVRG (2bit)
maintains linear convergence with communication compression and stochastic gradients;
2) The compressed versions of Prox-LEAD all match well with the non-compressed ver-
sions. Fig. 4.2b shows that the advantages of communication compression in Prox-LEAD
are very significant in terms of communication bits.
    1
      LEAD-SAGA outperform LEAD-LSVRG in terms of the number of gradient computation since LEAD-
SAGA computes fewer gradient evaluations in each iteration by sacrificing the memory space. However,
LEAD-LSVRG outperforms LEAD-SAGA in terms of communication bits since the extra gradient computa-
tion in LEAD-LSVRG doesn’t increase communication cost but it improves the convergence speed. Similar
phenomenon is also observed in the nonsmooth case in Figure 4.2c and Fig. 4.2d.
    2
      LEAD-SGD (2bit) and LEAD-LSVRG (2bit) outperform LessBit-SGD (2bit) and LessBit-LSVRG (2bit),
which shows the advantages of the extra gradient descent step in LEAD, as discussed in Section B.2.
Though LessBit-LSVRG (2bit) has the same communication bit as LEAD-SAGA (2bit), LEAD-SAGA (2bit)
requires about half of the gradient evaluation as LessBit-LSVRG (2bit).
                                                     71


    To summarize, the experiments in this section verify the theoretical linear convergence
of the proposed algorithm when the nonsmooth objective, stochastic gradients and com-
munication compression are present. They also suggest the state-of-the-art performance
in the comparison with strong baseline algorithms.
4.6    Conclusion
In this chapter, we consider the decentralized stochastic composite optimization prob-
lem. A decentralized proximal stochastic gradient algorithm with communication com-
pression, Prox-LEAD, is proposed to improve the communication efficiency and con-
vergence rates. We provide rigorous theoretical analyses and convergence complexi-
ties for the proposed algorithm in the general stochastic setting and the finite-sum set-
ting. We establish the linear convergence rate with variance reduction schemes and well-
controlled compression error. Both the theorems and numerical experiments demonstrate
the effectiveness of Prox-LEAD in reducing the communication cost and the advantages
over existing algorithms. Moreover, our algorithmic framework builds bridges between
many known algorithms, and it potentially enlightens the communication compression
for other primal-dual algorithms.
                                            72


APPENDICES
    73


                                                                            APPENDIX A
                                                     SUPPLEMENTARY OF CHAPTER 3
A.1                Compression Error
The property of the compression operator indicates that the compression error is linearly
proportional to the norm of the variable being compressed:
                                                                     E∥Q(x) − x∥2 ≤ C∥x∥2 .
We visualize the norm of the variables being compressed, i.e., the gradient residual (the
worker side) and model residual (the master side) for DORE as well as error compensated
gradient (the worker side) and averaged gradient (the master side) for DoubleSqueeze.
As showed in Figure A.1, the gradient and model residual of DORE decrease exponen-
tially and the compression errors vanish. However, for DoubleSqueeze, their norms only
decrease to some certain value and the compression error doesn’t vanish. It explains why
algorithms without residual compression cannot converge linearly to the O(σ) neighbor-
hood of the optimal solution in the strongly convex case.
               4
          10                                                                                 3
                                                     DoubleSqueeze                      10                                         DoubleSqueeze
                                                     DoubleSqueeze (topk)                                                          DoubleSqueeze (topk)
          10
               1                                     DORE                               10
                                                                                             1                                     DORE
                                                                                             1
               2
                                                                                       10
         10
                                                                                             3
                                                                                       10
  Norm   10
               5                                                                Norm         5
                                                                                       10
                                                                                             7
         10
               8                                                                       10
                                                                                             9
                                                                                       10
              11
         10
                                                                                            11
                                                                                       10
                   0   2500   5000   7500 10000 12500 15000 17500 20000                          0   2500   5000   7500 10000 12500 15000 17500 20000
                                         Iteration                                                                     Iteration
                              (a) Worker side                                                                (b) Master side
                         Figure A.1: The norm of compressed variable in linear regression.
                                                                                74


A.2                      Parameter Sensitivity
Continuing the MNIST experiment in Section 3.5, we further conduct parameter analysis
on DORE. The basic setting for block size, learning rate, α, β and η are 256, 0.1, 0.1, 1,
1, respectively. We change each parameter individually. Figures A.2, A.3, A.4, and A.5
demonstrate that DORE performs consistently well under different parameter settings.
                                                                                       2.00
                                                         Block size: 128                                                        Block size: 128
                   2.0                                   Block size: 256               1.75                                     Block size: 256
                                                         Block size: 512                                                        Block size: 512
                                                                                       1.50
                   1.5
   Training Loss
                                                                                       1.25
                   1.0                                                     Test Loss   1.00
                                                                                       0.75
                   0.5                                                                 0.50
                                                                                       0.25
                   0.0                                                                 0.00
                         0   10   20   30      40   50        60     70                       0   10   20     30      40   50        60     70
                                            Epoch                                                                  Epoch
                                  (a) Training loss                                                         (b) Test loss
                                  Figure A.2: Training under different compression block sizes.
                                                                                       2.00
                                                                   : 0.1                                                                  : 0.1
                   2.0                                             : 0.2               1.75                                               : 0.2
                                                                   : 0.3                                                                  : 0.3
                                                                                       1.50
                   1.5
   Training Loss
                                                                                       1.25
                   1.0                                                     Test Loss   1.00
                                                                                       0.75
                   0.5                                                                 0.50
                                                                                       0.25
                   0.0                                                                 0.00
                         0   10   20   30      40   50        60     70                       0   10   20     30      40   50        60     70
                                            Epoch                                                                  Epoch
                                  (a) Training loss                                                         (b) Test loss
                                                    Figure A.3: Training under different α.
                                                                           75


                                                                                2.00
                                                           : 0.8                                                              : 0.8
                2.0                                        : 0.9                1.75                                          : 0.9
                                                           : 1.0                1.50
                                                                                                                              : 1.0
                1.5
Training Loss
                                                                                1.25
                1.0                                                 Test Loss   1.00
                                                                                0.75
                0.5                                                             0.50
                                                                                0.25
                0.0                                                             0.00
                      0   10   20   30      40   50   60     70                        0   10   20     30      40   50   60     70
                                         Epoch                                                              Epoch
                               (a) Training loss                                                     (b) Test loss
                                                 Figure A.4: Training under different β.
                                                           : 0.8                2.00                                          : 0.8
                2.0                                        : 0.9                                                              : 0.9
                                                                                1.75
                                                           : 1.0                                                              : 1.0
                                                                                1.50
                1.5
Training Loss
                                                                                1.25
                1.0                                                 Test Loss   1.00
                                                                                0.75
                0.5                                                             0.50
                                                                                0.25
                0.0                                                             0.00
                      0   10   20   30      40   50   60     70                        0   10   20     30      40   50   60     70
                                         Epoch                                                              Epoch
                               (a) Training loss                                                     (b) Test loss
                                                 Figure A.5: Training under different η.
                                                                   76


A.3     DORE in the Smooth Case
Algorithm A.1: DORE with R(x) = 0
 1: Input: Stepsize α, β, γ, η, initialize h0 = h0i = 0d , x̂0i = x̂0 , ∀i ∈ {1, . . . , n}.
 2: for k = 1, 2, · · · , K − 1 do
 3:    For each worker {i = 1, 2, · · · , n}:                12:    For the master:
 4:    Sample     gik   such that E[gik |x̂ki ] = ∇fi (x̂ki )13:    Receive ∆ ˆ k s from workers
                                                                                i
 5:    Gradient residual: ∆ki = gik − hki                    14:    ∆ˆ k = 1/n Pn ∆   ˆk
                                                                                    i   i
 6:    Compression: ∆         ˆ ki = Q(∆ki )                 15:    ĝ = h + ∆ {= 1/n ni ĝik }
                                                                      k     k     ˆ k
                                                                                                P
 7:    hk+1   =   h k
                    i  +   α ˆ
                             ∆ k
                               i                             16:    hk+1 = hk + α∆    ˆk
         i
 8:    { ĝik = hki + ∆   ˆk }                               17:    q = −γĝ + ηek
                                                                      k        k
                            i
 9:    Sent ∆ ˆ k to the master                              18:    Compression: q̂k = Q(qk )
                i
                      k
10:    Receive q̂ from the master                            19:    ek+1 = qk − q̂k
11:    x̂k+1
         i    = x̂ki + β q̂k                                 20:    Broadcast q̂k to workers
21: end for
22: Output: any x̂K       i
A.4     Proof of Theorem 3.4.1
We first provide two lemmas. We define EQ , Ek , and E be the expectation taken over the
quantization, the kth iteration based on x̂k , and the overall expectation, respectively.
Lemma A.4.1. For every i, we can estimate the first two moments of hk+1               i      as
                EQ hk+1 i     =(1 − α)hki + αgik ,                                                      (A.1)
     EQ ∥hik+1 − si ∥2 ≤(1 − α)∥hki − si ∥2 + α∥gik − si ∥2 + α[(Cq + 1)α − 1]∥∆ki ∥2 .                 (A.2)
Proof. The first equality follows from lines 5-7 of Algorithm 3.1 and Assumption 3.3.1.
For the second equation, we have the following variance decomposition
                                      E∥X∥2 = ∥EX∥2 + E∥X − EX∥2                                        (A.3)
for any random vector X. By taking X = hk+1             i     − si , we get
         EQ ∥hk+1      − si ∥2 = ∥(1 − α)(hki − si ) + α(gik − si )∥2 + α2 EQ ∥∆        ˆ ki − ∆ki ∥2 . (A.4)
                i
                                                         77


Using the basic equality
                  ∥λa + (1 − λ)b∥2 + λ(1 − λ)∥a − b∥2 = λ∥a∥2 + (1 − λ)∥b∥2                    (A.5)
for all a, b ∈ Rd and λ ∈ [0, 1], as well as Assumption 3.3.1, we have
  EQ ∥hk+1
         i   − si ∥2 ≤ (1 − α)∥hki − si ∥2 + α∥gik − si ∥2 − α(1 − α)∥∆ki ∥2 + α2 Cq ∥∆ki ∥2 , (A.6)
which is the inequality (A.2).
    Next, from the variance decomposition (A.3), we also derive Lemma A.4.2.
Lemma A.4.2. The following inequality holds
                                                                   n
                                                               Cq X                σ2
                   E[∥ĝk − h∗ ∥2 ] ≤ E∥∇f (x̂k ) − h∗ ∥2 +           E∥∆  k 2
                                                                           i ∥   +    ,        (A.7)
                                                               n2 i=1              n
                           Pn                     Pn
where h∗ = ∇f (x∗ ) =    1
                         n    i=1 h∗i and σ 2 = 1
                                                n    i=1 σi2 .
Proof. By taking the expectation over the quantization of g, we have
                          E∥ĝk − h∗ ∥2 = E∥gk − h∗ ∥2 + E∥ĝk − gk ∥2
                                                                   n
                                                k     ∗ 2      Cq X
                                         ≤ E∥g − h ∥ + 2              E∥∆ki ∥2 ,               (A.8)
                                                               n i=1
where the inequality is from Assumption 3.3.1.
    For ∥gk − h∗ ∥, we take the expectation over the sampling of gradients and derive
                      E∥gk − h∗ ∥2 = E∥∇f (x̂k ) − h∗ ∥2 + E∥gk − ∇f (x̂k )∥2
                                                                σ2
                                      ≤ E∥∇f (x̂k ) − h∗ ∥2 +                                  (A.9)
                                                                 n
by Assumption 3.4.1.
    Combining (A.8) with (A.9) gives (A.7).
Proof of Theorem 3.4.1. We consider xk+1 − x∗ first. Since x∗ is the solution of (1.1), it satis-
fies
                                       x∗ = proxγR (x∗ − γh∗ ).                               (A.10)
                                                   78


Hence
    E∥xk+1 − x∗ ∥2 =E∥proxγR (x̂k − γĝk ) − proxγR (x∗ − γh∗ )∥2
                      ≤E∥x̂k − x∗ − γ(ĝk − h∗ )∥2
                      =E∥x̂k − x∗ ∥2 − 2γE⟨x̂k − x∗ , ĝk − h∗ ⟩ + γ 2 E∥ĝk − h∗ ∥2
                      =E∥x̂k − x∗ ∥2 − 2γE⟨x̂k − x∗ , ∇f (x̂k ) − h∗ ⟩ + γ 2 E∥ĝk − h∗ ∥2 , (A.11)
where the inequality comes from the non-expansiveness of the proximal operator and the
last equality is derived by taking the expectation of the stochastic gradient ĝk .
    Combining (A.7) and (A.11), we have
         E∥xk+1 − x∗ ∥2 ≤E∥x̂k − x∗ ∥2 − 2γE⟨x̂k − x∗ , ∇f (x̂k ) − h∗ ⟩
                                  n                                n
                              γ2 X                   ∗ 2    Cq γ 2 X               γ2
                            +                  k
                                      E∥∇fi (x̂ ) − hi ∥ + 2          E∥∆ki ∥2 + σ 2 .       (A.12)
                               n i=1                         n i=1                 n
    Then we consider E∥x̂k+1 − x∗ ∥2 . According to Algorithm 3.1, we have:
                     EQ [x̂k+1 − x∗ ] = x̂k + βqk − x∗
                                      = (1 − β)(x̂k − x∗ ) + β(xk+1 − x∗ + ηek )             (A.13)
where the expectation is taken on the quantization of qk .
    By variance decomposition (A.3) and the basic equality (A.5),
          E∥x̂k+1 − x∗ ∥2
       ≤(1 − β)E∥x̂k − x∗ ∥2 + βE∥xk+1 + ηek − x∗ ∥2 − β(1 − β)E∥qk ∥2 + β 2 Cqm E∥qk ∥2
       ≤(1 − β)E∥x̂k − x∗ ∥2 + (1 + η 2 ϵ)βE∥xk+1 − x∗ ∥2 − β(1 − (Cqm + 1)β)E∥qk ∥2
                   1
          + (η 2 + )βCqm E∥qk−1 ∥2 ,                                                         (A.14)
                   ϵ
where ϵ is generated from Cauchy inequality of inner product. For convenience, we let
ϵ = η1 .
                                                  79


                                              1
    Choose a β such that 0 < β ≤            1+Cqm
                                                  . Then we have
          β(1 − (Cqm + 1)β)E∥qk ∥2 + E∥x̂k+1 − x∗ ∥2
       ≤(1 − β)E∥x̂k − x∗ ∥2 + (1 + η)βE∥xk+1 − x∗ ∥2 + (η 2 + η)βCqm E∥qk−1 ∥2 .                            (A.15)
    Letting si = h∗i in (A.2), we have
                                  n
                 (1 + η)cβγ 2 X
                                      E∥hk+1
                                           i    − h∗i ∥2
                        n        i=1
                                            n                                          n
                 (1 + η)(1 − α)cβγ 2 X k                  ∗ 2      (1 + η)αcβγ 2 X k
              ≤                                ∥hi − hi ∥ +                               ∥gi − h∗i ∥2
                             n            i=1
                                                                           n          i=1
                                                            n
                     (1 + η)α[(Cq + 1)α − 1]cβγ 2 X k 2
                 +                                             ∥∆i ∥ .                                       (A.16)
                                       n                   i=1
    Then we let Rk = β(1 − (Cqm + 1)β)E∥qk ∥2 and define Vk = Rk−1 + E∥x̂k − x∗ ∥2 +
(1+η)cβγ 2 Pn          k      ∗ 2
    n        i=1 E∥hi − hi ∥ . Thus, we obtain
   Vk+1 ≤(η 2 + η)βCqm E∥qk−1 ∥2 + (1 + ηβ)E∥x̂k − x∗ ∥2
                                                                                              n
                                          ∗                  ∗       (1 + η)(1 − α)cβγ 2 X
                                    k               k
              − 2(1 + η)βγE⟨x̂ − x , ∇f (x̂ ) − h ⟩ +                                             E∥hki − h∗i ∥2
                                                                                n            i=1
                              2h                                       n
                 (1 + η)βγ                      2
                                                                   i X
              +                 nc(C   q +  1)α   −  ncα    +  C q        E∥∆ki ∥2
                       n2                                             i=1
                                            n
                 (1 + η)(1 + cα) 2 X                                        (1 + η)(1 + ncα) 2 2
              +                       βγ       E∥∇fi (x̂k ) − h∗i ∥2 +                            βγ σ .     (A.17)
                           n              i=1
                                                                                       n
    The E∥∆ki ∥2 -term can be ignored if nc(Cq + 1)α2 − ncα + Cq ≤ 0, which can be guaran-
               4Cq (Cq +1)
teed by c ≥         n
                           and
                                        q                           q                   
                                                4Cq (Cq +1)                  4Cq (Cq +1)
                                  1−       1−        nc
                                                              1+ 1−               nc
                           α∈                              ,                            .
                                         2(Cq + 1)                    2(Cq + 1)
    Given that each fi is L-Lipschitz differentiable and µ-strongly convex, we have
                                                                                n
                                         µL                             1 1X
   E⟨∇f (x̂k ) − h∗ , x̂k − x∗ ⟩ ≥            E∥x̂k − x∗ ∥2 +                        E∥∇fi (x̂k ) − h∗i ∥2 . (A.18)
                                       µ+L                          µ + L n i=1
                                                         80


   Hence
Vk+1 ≤ρ1 Rk−1 + (1 + ηβ)E∥x̂k − x∗ ∥2 − 2(1 + η)βγE⟨x̂k − x∗ , ∇f (x̂k ) − h∗ ⟩
                                    n                                               n
           (1 + η)(1 − α)cβγ 2 X                   ∗ 2    (1 + η)(1 + cα) 2 X
         +                                   k
                                       E∥hi − hi ∥ +                         βγ        E∥∇fi (x̂k ) − h∗i ∥2
                      n            i=1
                                                                   n               i=1
           (1 + η)(1 + ncα) 2 2
         +                    βγ σ
                     n
                    h           2(1 + η)βγµL i
      ≤ρ1 Rk−1 + 1 + ηβ −                          E∥x̂k − x∗ ∥2
                                      µ+L
                                2 X n
           (1 + η)(1 − α)cβγ
         +                             E∥hki − h∗i ∥2
                      n            i=1
                                                          n
           h                           2(1 + η)βγ i 1 X
         + (1 + η)(1 + cα)βγ −   2
                                                              E∥∇fi (x̂k ) − h∗i ∥2
                                          µ+L         n i=1
           (1 + η)(1 + ncα) 2 2
         +                    βγ σ
                     n
                                                                  n
                                 ∗ 2     (1 + η)(1 − α)cβγ 2 X
      ≤ρ1 R  k−1          k
                 + ρ2 E∥x̂ − x ∥ +                                   E∥hki − h∗i ∥2
                                                    n            i=1
           (1 + η)(1 + ncα) 2 2
         +                    βγ σ                                                                   (A.19)
                     n
where
                                            (η 2 + η)Cqm
                                    ρ1 =                    ,
                                          1 − (Cqm + 1)β
                                                     2(1 + η)βγµL
                                    ρ2 =1 + ηβ −                      .
                                                          µ+L
Here we let γ ≤         2
                  (1+cα)(µ+L)
                              such that (1+η)(1+cα)βγ 2 − 2(1+η)βγ   µ+L
                                                                             ≤ 0 and the last inequality
holds. In order to get max(ρ1 , ρ2 , 1 − α) < 1, we have the following conditions
                                0 ≤ (η 2 + η)Cqm ≤1 − (Cqm + 1)β,
                                                       2(1 + η)γµL
                                                   η<                  .
                                                            µ+L
Therefore, the condition for γ is
                                η(µ + L)                        2
                                               ≤γ≤                         ,
                               2(1 + η)µL              (1 + cα)(µ + L)
                                                    81


which implies an additional condition for η. Therefore, the condition for η is
                           q                                                            
                   −Cqm + (Cqm )2 + 4(1 − (Cqm + 1)β)                     4µL
     η ∈ 0, min                        m
                                                              ,         2
                                                                                           .
                                       2Cq                      (µ + L) (1 + cα) − 4µL
                       4µL                      η(µ+L)            2
   where η ≤   (µ+L)2 (1+cα)−4µL
                                 is to ensure  2(1+η)µL
                                                         ≤  (1+cα)(µ+L)
                                                                        such that we don’t get an
empty set for γ.
   If we define ρ = max{ρ1 , ρ2 , 1 − α}, we obtain
                                             (1 + η)(1 + ncα) 2 2
                             Vk+1 ≤ ρVk +                        βγ σ                        (A.20)
                                                      n
and the proof is completed by applying (A.20) recurrently.
A.5    Proof of Theorem 3.4.2
Proof. In Algorithm A.1, we can show
               E∥x̂k+1 − x̂k ∥2 = β 2 E∥q̂k ∥2 = β 2 E∥Eq̂k ∥2 + β 2 E∥q̂k − Eq̂k ∥2
                                 = β 2 E∥qk ∥2 + β 2 E∥q̂k − qk ∥2                           (A.21)
                                 ≤ (1 + Cqm )β 2 E∥qk ∥2 .
and
  E∥qk ∥2 = E∥ − γĝk + ηek ∥2 ≤ 2γ 2 E∥ĝk ∥2 + 2η 2 E∥ek ∥2 ≤ 2γ 2 E∥ĝk ∥2 + 2Cqm η 2 E∥qk−1 ∥2 .
                                                                                             (A.22)
                                                82


   Using (A.21)(A.22) and the Lipschitz continuity of ∇f (x), we have
     Ef (x̂k+1 ) + (Cqm + 1)Lβ 2 E∥qk ∥2
                                              L
   ≤Ef (x̂k ) + E⟨∇f (x̂k ), x̂k+1 − x̂k ⟩ +    E∥x̂k+1 − x̂k ∥2 + (Cqm + 1)Lβ 2 E∥qk ∥2
                                              2
                                                 (1 + Cqm )Lβ 2
   =Ef (x̂k ) + βE⟨∇f (x̂k ), −γĝk + ηek ⟩ +                    E∥qk ∥2 + (Cqm + 1)Lβ 2 E∥qk ∥2
                                                        2
           k                k            k       k
                                                       3(Cqm + 1)Lβ 2
   =Ef (x̂ ) + βE⟨∇f (x̂ ), −γ∇f (x̂ ) + ηe ⟩ +                        E∥qk ∥2
                                                               2
                                     βη                    βη
   ≤Ef (x̂k ) − βγE∥∇f (x̂k )∥2 +        E∥∇f (x̂k )∥2 +      E∥ek ∥2
                                      2                     2
                            h                               i
        + 3(Cqm + 1)Lβ 2 γ 2 E∥ĝk ∥2 + Cqm η 2 E∥qk−1 ∥2
                 h       βη                       i
   ≤Ef (x̂k ) − βγ −          − 3(Cqm + 1)Lβ 2 γ 2 E∥∇f (x̂k )∥2
                          2
                                 n
         3Cq (Cq + 1)Lβ 2 γ 2 X
                 m
                                         k 2
                                                3(Cqm + 1)Lβ 2 γ 2 2
      +                             E∥∆  i ∥  +                    σ
                   n2           i=1
                                                         n
        h βηC m                             i
               q
      +            + (3Cqm + 1)Cqm Lβ 2 η 2 E∥qk−1 ∥2 ,                                       (A.23)
             2
where the last inequality is from (A.7) with h∗ = 0.
   Letting si = 0 in (A.2), we have
               EQ ∥hk+1
                      i  ∥2 ≤(1 − α)∥hki ∥2 + α∥gik ∥2 + α[(Cq + 1)α − 1]∥∆ki ∥2 .            (A.24)
   Due to the assumption that each worker samples the gradient from the full dataset,
we have
                        Egik = E∇f (x̂k ),    E∥gik ∥2 ≤ E∥∇f (x̂k )∥2 + σi2 .                (A.25)
                                                                                 Pn
   Define Λk = (Cqm + 1)Lβ 2 ∥qk−1 ∥2 + f (x̂k ) − f ∗ + 3c(Cqm + 1)Lβ 2 γ 2 n1    i=1 E∥hki ∥2 , and
                                                  83


from (A.23), (A.24), and (A.25), we have
                                                                                 n
                                                                              1X
           EΛk+1 ≤Ef (x̂k ) − f ∗ + 3(1 − α)c(Cqm + 1)Lβ 2 γ 2                      E∥hki ∥2
                                                                              n i=1
                          h          βη                                       i
                       − βγ −              − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 E∥∇f (x̂k )∥2
                                       2
                                                                                          n
                          (Cq + 1)Lβ 2 γ 2 h
                               m
                                                                      2
                                                                                       iX
                       +                            3nc(Cq + 1)α − 3ncα + 3Cq                E∥∆ki ∥2
                                     n2                                                  i=1
                                           (Cqm + 1)Lβ 2 γ 2 σ 2
                       + 3(1 + ncα)
                                                       n
                          h βηC m                                     i
                                   q
                       +               + 3(Cqm + 1)Cqm Lβ 2 η 2 E∥qk−1 ∥2 .                              (A.26)
                                2
                  4Cq (Cq +1)
   If we let c =       n
                              , then the condition of α in (3.5) gives
                                     3nc(Cq + 1)α2 − 3ncα + 3Cq ≤ 0
and
                                                                                     n
                                                                              2 21
                                                                                    X
                   k+1              k         ∗
               EΛ        ≤Ef (x̂ ) − f + 3(1 −            α)c(Cqm     + 1)Lβ γ          E∥hki ∥2
                                                                                  n i=1
                                h          βη                                     i
                            − βγ −              − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 E∥∇f (x̂k )∥2
                                             2
                                                (Cqm + 1)Lβ 2 γ 2 σ 2
                            + 3(1 + ncα)
                                                           n
                                 βηCqm
                            +[              + 3(Cqm + 1)Cqm Lβ 2 η 2 ]E∥qk−1 ∥2 .                        (A.27)
                                      2
                                       1
   Let η = γ and βγ ≤        6(1+cα)(Cqm +1)L
                                                 ,  we have
               βη                                              βγ
        βγ −       − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 =                  − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 ≥ 0.
                2                                               2
                          r
                                           m +1)2
                                48L2 β 2 (Cq
                   n −1+     1+           m
                                                                        o
                                        Cq                  1
   Take γ ≤ min             12Lβ(Cqm +1)
                                                   , 6Lβ(1+cα)(C m +1)    will guarantee
                                                                 q
                         h βηC m                                     i
                                  q
                                      +   3(Cqm   +   1)Cqm Lβ 2 η 2    ≤ (Cqm + 1)Lβ 2 .
                               2
   Hence we obtain
                 h βγ                                        i                               (Cqm + 1)Lβ 2 γ 2 σ 2
EΛk+1 ≤ EΛk −          − 3(1 + cα)(Cqm + 1)Lβ 2 γ 2 E∥∇f (x̂k )∥2 + 3(1 + ncα)                                     .
                    2                                                                                 n
                                                                                                         (A.28)
   Taking the telescoping sum and plugging the initial conditions, we derive (3.12).
                                                          84


A.6    Proof of Corollary 3.4.2
                    1              4Cq (Cq +1)
Proof. With α =  2(Cq +1)
                           and c =      n
                                               , 1 + ncα =     1 + 2Cq is a constant.
                                                  r
                                                          2
                                          n −1+ 1+ 48L Cm
                                                                                  o
                    1                                   q              1 √
    We set β =   Cqm +1
                          and γ = min              12L
                                                            ,                      . In general, Cqm is
                                                              12L(1+cα)(1+   K/n)
                                                                                   1 √
bounded which makes the first bound negligible, i.e., γ =                                    when K is
                                                                          12L(1+cα)(1+ K/n)
large enough. Therefore, we have
              β                                      1 − 6(1 + cα)Lγ               1
                 − 3(1 + cα)(Cqm + 1)Lβ 2 γ =                  m
                                                                           ≤      m
                                                                                         .       (A.29)
              2                                         2(Cq + 1)             4(Cq + 1)
    From Theorem 3.4.2, we derive
           K
        1 X
              E∥∇f (x̂k )∥2
        K k=1
        4(Cqm + 1)(EΛ1 − EΛK+1 ) 12(1 + ncα)Lσ 2 γ
      ≤                              +
                    γK                              n
                                                     1         1       (1 + ncα)σ 2 1
      ≤48L(Cqm + 1)(1 + cα)(EΛ1 − EΛK+1 )( + √                     )+                 √    ,     (A.30)
                                                     K         nK        (1 + cα)       nK
which completes the proof.
                                                    85


                                              APPENDIX B
                                  SUPPLEMENTARY OF CHAPTER 4
B.1      LEAD: Smooth Case of Prox-LEAD
Algorithm B.1: LEAD
Input: Stepsize η, parameterPn       (α, γ), X0 , H1 , D1 = (I − W)Z for any Z
               K                   K
Output: X or 1/n i=1 Xi
  1:  H1w = WH1                                             9: procedure C OMM(Y, H, Hw )
  2:  X1 = X0 − η∇F(X0 ; ξ 0 )                             10:   Q = C OMPRESSY − H
  3:  for k = 1, 2, · · · , K − 1 do                       11:   Ŷ = H + Q
  4:    Yk = Xk − η∇F(Xk ; ξ k ) − ηDk                     12:   Ŷw = Hw + WQ
  5:    Ŷk , Ŷwk , Hk+1 , Hk+1
                              w   = C OMM(Yk , Hk , Hkw )  13:   H = (1 − α)H + αŶ
                            γ
  6:    D  k+1
                = D + 2η (Ŷ − Ŷwk )
                      k         k                          14:   Hw = (1 − α)Hw + αŶw
  7:    Xk+1 = Xk − η∇F(Xk ; ξ k ) − ηDk+1                 15:   Return: Ŷ, Ŷw , H, Hw
  8: end for                                               16: end procedure
B.2      Connection with Existing Algorithms
In Section 4.2, we have discussed the motivation of LEAD and Prox-LEAD. We now turn
to the relation between LEAD and some other existing algorithms from the perspective
of the dual problem. We look at the problem (4.1) without the non-smooth regularizer,
i.e., R(X) = 0. Consider the Fenchel conjugate of F defined as F∗ (Y) = supX∈Rn×p Y⊤ X −
F(X), then we obtain the dual problem as
                                                  ∗
                                                       √
                                          minn×p
                                                 F  (−  I − WS).
                                         S∈R
If we apply the gradient descent method to the above problem, we need to evaluate the
               √                  √                      √
gradient − I − W∇F∗ (− I − WS). Let D = I − WS, then the iteration follows
                                   Dk+1 = Dk + θ(I − W)∇F∗ (−Dk ).
     When the gradient of the dual function is available, the communication proceeds after
the gradient evaluation. If we compress the only communication step, the algorithm leads
                                                     86


to Option A in [81] with the single exception that the quantization procedure is slightly
different. Since D belongs to the dual space of the original optimized variable, we derive
the solution of the primal problem via the relation
                                       Xk+1 = ∇F∗ (−Dk )                                 (B.1)
and the linear convergence is guaranteed under the strongly convex assumption on F.
    In most cases, it is difficult to evaluate the conjugate function, so we rewrite the rela-
tion (B.1) into the following minimization problem
                                Xk+1 = arg min F(X) + ⟨Dk , X⟩,
                                         X∈Rn×p
and get an inexact estimate using the gradient method. Applying one step of gradient de-
scent method to the subproblem and inserting the update into the original dual iterations,
we get the following primal-dual iteration
                                
                                 k+1
                                X
                                       = Xk − η∇F(Xk ) − ηDk ,
                                
                                  Dk+1 = Dk + θ(I − W)Xk+1 ,
where η, θ are stepsizes for primal and dual update respectively. It can be shown that this
iteration is a special case of the incremental primal-dual gradient method (PDGM) in [92]
and the linear convergence rate can be guaranteed. Furthermore, when the communi-
cation of Xk+1 in the dual update is conducted by the compression procedure (and the
stochastic gradient estimation is involved), the algorithm recovers Option B (Option C)
of LessBit in [81].
    If we proceed one more gradient descent step in the subproblem, we get
                             
                              k+1
                             X
                                     = Xk − η∇F(Xk ) − ηDk ,
                             
                              k+1
                             X       = Xk+1 − η∇F(Xk+1 ) − ηDk ,
                             
                             
                                Dk+1 = Dk + θ(I − W)Xk+1 .
                                                87


           Algorithm           Non-smooth R     ∇F    Compression    Convergence complexity
     Dual Gradient Descent             ✗         ✗        ✗                  O(κ
                                                                             e f κg )
        LessBit-Option A
                                       ✗         ✗        ✓           O(C
                                                                       e + κf κg + Cκf κ fg )
               [81]
             PDGM
                                       ✗        ✓         ✗               O(κ
                                                                           e f + κf κg )
               [92]
        LessBit-Option B
                                       ✗        ✓         ✓           O(C
                                                                       e + κf κg + Cκf κ fg )
               [81]
              NIDS
                                       ✗        ✓         ✗                 O(κ
                                                                            e f + κg )
               [31]
             LEAD
                                       ✗        ✓         ✓        O((1
                                                                   e    + C)(κf + κg ) + Cκf κg )
               [33]
             PUDA
                                       ✓        ✓         ✗                 O(κ
                                                                            e f + κg )
              [106]
          Prox-LEAD
                                       ✓        ✓         ✓          O(C
                                                                     e + κf + κg + Cκf κg )
         Algorithm 4.1
                                                                                           L
Table B.1: The comparison of algorithms mentioned in Section B.2; κf :=                    µ
                                                                                              , κg :=
λmax (I−W)              max        wij
λmin (I−W)
            and κeg := λmin(i,j)∈E
                             (I−W)
                                       for nonnegative W.
The addition of the step does not increase the computation of the gradient ∇F because it
can be reused in the next iteration. So, we switch the order of the iteration and derive
                                 
                                  k+1
                                 X
                                          = Xk − η∇F(Xk ) − ηDk ,
                                 
                                           = Dk + θ(I − W)Xk+1 ,
                                  k+1
                                 D
                                 
                                 
                                    Xk+1 = Xk − η∇F(Xk ) − ηDk+1 .
By setting θ = 1, the above algorithm recovers NIDS of the smooth problems in [2, 31]. It
has been shown that the additional step in NIDS improves the linear convergence rate of
the previous two algorithms in terms of the condition numbers of the objective function
and the network. The detailed comparison is listed in Table B.1.
    Compared to LEAD, Prox-LEAD improves the complexity by reducing O(C(κ            e       f + κg ))
to O(C).
    e      LessBit-Option B has complexity O(C     e + κf κg + Cκf κeg ), which has a better en-
tangled term involving C since κeg is less than κg . However, for all C ∈ [0, κg /(κg − κeg )),
the complexity of Prox-LEAD is always better than the one of LessBit-Option B.
                                                   88


B.3     Proof of Theorem 4.4.2
Proof. Linear convergence. The following proof is applicable to the general choice of {pij }
                                                                                                         1
and for simplicity, we only consider the uniform sampling case with pij =                                m
                                                                                                           .
   Lemma 4.4.2 still holds with different Gk = [g1k , · · · , gnk ]⊤ in procedure SGO, then we
focus on the following term
      E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2
      X n                       ∇f (xk ) − ∇f (x̃k )                                         2
                                      il             il
   =        E xki − x∗ − η                 i               i
                                                               + ∇fi (x̃ki ) − ∇fi (x∗ )
       i=1
                                             mpil
                               n                             k              k
                                                                                                             
                                                 ∗ ∇fil (xi ) − ∇fil (x̃i )
                             X
          k        ∗ 2                    k                                                   k            ∗
   = ∥X − X ∥ − 2η                 E xi − x ,                                    + ∇fi (x̃i ) − ∇fi (x )
                              i=1
                                                                mpil
                n                                                                  2
              X        ∇fil (xki ) − ∇fil (x̃ki )
      +η    2
                   E                              + ∇fi (x̃ki ) − ∇fi (x∗ )
               i=1
                                 mpil
                             X n
   = ∥Xk − X∗ ∥2 − 2η              ⟨xki − x∗ , ∇fi (xki ) − ∇fi (x∗ )⟩
                              i=1
                n X m                                                                     2
              X              ∇fij (xki ) − ∇fij (x̃ki )
      +η    2
                        pij                               + ∇fi (x̃ki ) − ∇fi (x∗ )
               i=1 j=1
                                         mpij
                             X n                                   X  n
                   ∗ 2                       ∗
          k
   = ∥X − X ∥ − 2η                 ⟨xki  −x    , ∇fi (xki )⟩  + 2η       (fi (xki ) − fi (x∗ ) − Vfi (xki , x∗ ))
                              i=1                                   i=1
                n X m                                 ∗             ∗                                               2
              X              ∇fij (xki )  − ∇fij (x ) fij (x ) − ∇fij (x̃ki )
      +η    2
                        pij                               +                               + ∇fi (x̃ki ) − ∇fi (x∗ )
               i=1 j=1
                                         mpij                           mpij
                                          n                             n X m
                             ∗
                                         X                            X              1
                      k
   ≤ (1 − µη)∥X − X ∥ − 2η                   Vfi (xki , x∗ )  + 2η 2
                                                                                          ∥∇fij (xki ) − ∇fij (x∗ )∥2
                                         i=1                          i=1 j=1
                                                                                 m2 p  ij
                  n Xm                                                                      2
                X             ∇fij (x∗ ) − ∇fij (x̃ki )
      + 2η    2
                         pij                                 + ∇fi (x̃ki ) − ∇fi (x∗ ) ,
                i=1 j=1
                                          mpij
where the inequality uses the strong convexity of fij in Assumption 4.3.2 and (a + b)2 ≤
2a2 + 2b2 .
                                                                          n                                             o
   Let ui be the random variable taking values in                             1
                                                                            mpil
                                                                                  (∇fil (x̃ki ) − ∇fil (x∗ )) : l ∈ [m]
with distribution Pi = {pil : l ∈ [m]}, then the last term is actually the summation of the
                                                             89


variance of ui s due to
                           m
                          X       pij
                 Eui =                   (∇fij (x̃∗i ) − ∇fij (x∗ )) = ∇fi (x̃ki ) − ∇fi (x∗ ).
                          j=1
                                mp    ij
   Applying the inequality E∥ui − Eui ∥2 ≤ E∥ui ∥2 to the last term, we get
                E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2
                                                         n                               n
                                           ∗
                                                      X                         4η 2 L X
                                k
             ≤ (1 − µη)∥X − X ∥ − 2η                        Vfi (xki , x∗ ) +                Vf (xk , x∗ )
                                                       i=1
                                                                               mpmin i=1 i i
                          n X  m
                        X                1
                 + 2η 2                         ∥∇fij (x̃ki ) − ∇fij (x∗ )∥2
                         i=1 j=1
                                      m2 p   ij
                                                      X  n
                                           ∗
                                k
             ≤ (1 − µη)∥X − X ∥ − 2η                        Vfi (xki , x∗ )
                                                       i=1
                       2       n                                    n
                    4η L X                                 4η 2 L X
                 +                  Vfi (xki , x∗ ) +                   Vf (x̃k , x∗ ),
                   mpmin i=1                              mpmin i=1 i i
where pmin := mini,j {pij } and the last inequality uses the Lipschitz smoothness of fij in
Assumption 4.3.2.
   From the update of x̃k+1  i      , we have
                                            Ex̃k+1
                                                 i    = pxki + (1 − p)x̃ki .
   Hence
                      EVfi (x̃k+1 i     , x∗ ) = pVfi (xki , x∗ ) + (1 − p)Vfi (x̃ki , x∗ ).
   Combine them with the inequality (4.12) and (4.11), after taking the total expectation,
we get
                                X  n
                  EΦ k+1
                           + c̃         EVfi (x̃k+1i    , x∗ )
                                 i=1
                                                                                      n
                                                                  4η 2 L
                                                                                 X
               ≤ (1 − ηµ)E∥Z −       k
                                            Z∗ ∥2P   − 2η −                 − c̃p        EVfi (xki , x∗ )
                                                                 mpmin               i=1
                                 n
                      4η 2 L X
                  +                   EVfi (x̃ki , x∗ ) + (1 − α)CM E∥Hk − Z∗ ∥2
                     mpmin i=1
                     2η 2          γ                      
                  +         1 − λmin (I − W) E∥Dk − D∗ ∥2Q
                      γ             2
                                                                             Xn
                                          k        k 2
                  + (1 − ηµ)E∥Ẑ − Z ∥(I−P)2 + c̃(1 − p)                          EVfi (x̃ki , x∗ ),
                                                                             i=1
                                                             90


            8η 2 L
where c̃ = pmpmin
                     .
   By the choice of η and {pij }, we have
                                 4η 2 L                 1                  1
                                          + c̃p =           (2 + 4) =         = 2η.
                                mpmin                18L                  3L
   Therefore,
                  X  n
      EΦk+1 + c̃       EVfi (x̃k+1
                               i    , x∗ ) ≤ (1 − ηµ)E∥Zk − Z∗ ∥2P + (1 − α)CM E∥Hk − Z∗ ∥2
                   i=1
                                                    2η 2       γ                   
                                                +          1 − λmin (I − W) E∥Dk − D∗ ∥2Q
                                                     γ          2
                                                               γλmax (I − W)
                                                + (1 − ηµ)                         E∥Ẑk − Zk ∥2I−P
                                                                        2
                                                                              n
                                                                    4η 2 L X
                                                     
                                                + c̃ 1 − p +                          EVfi (x̃ki , x∗ ).
                                                                  c̃mpmin i=1
   Note that
                                               4η 2 L            p              2
                                 1−p+                    =1− ,          c̃ =        ,
                                             c̃mpmin             2            9pL
then we get
                                n                                                  n
                           2 X                         ∗
                                                                            2 X                         
            EΦ   k+1
                       +                       k+1
                                   EVfi (x̃i , x ) ≤ ρ̃ EΦ +         k
                                                                                       EVfi (x̃ki , x∗ ) ,
                          9pL i=1                                          9pL i=1
where
                                                                                                             
                          ηµ           γ                                          γλmax (I − W)             p
    ρ̃ := max 1 −              , 1 − λmin (I − W), 1 − α, (1 − ηµ)                                     ,1 −     .
                       2 − ηµ          2                                                 2                  2
   Complexity. The linear convergence requires
                                             1
                                  α<             ,
                                         1+C
                                                 1                2∆(α)
                                  γ≤                                              .
                                         λmax (I − W) ∆(α) + 24αCκf
                   1                      α          1
   Take α =   2(1+C)
                       , then ∆(α) =       2
                                              =  4(1+C)
                                                           and
                                                      1               2
                                      γ≤                                      .
                                             λmax (I − W) 1 + 48Cκf
   By taking γ equal to the upper bound and plugging η, α, γ into ρ̃, we get
                                                                                               −1
                                      n                                                   2o
                   ρ̃ ≤ 1 − max 12κf − 1, κg + 48Cκf κg , 1 + C, 6κf ,
                                                                                          p
and the proof is complete.
                                                          91


B.4      Proof of Theorem 4.4.3
Proof. Linear convergence. The following proof can be adapted to to show the conver-
gence with the general {pij } while for simplicity, we focus on uniform sampling case.
    We start from the gradient term of the key inequality in Lemma 4.4.2 and replace
∇F(Xk , ξ k ) by Gk = [g1k , · · · , gnk ]⊤ ,
   E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2
     n                       ∇f (xk ) − ∇f (x̃k ) Pm ∇fij (x̃k ) − ∇fij (x∗ )  2
                                                                                        ij
   X                               il               il                j=1
               k      ∗                 i                 il
=        E xi − x − η                                          +
    i=1
                                          mp  il                                        m
                            n                                                            Pm                k               ∗ 
                                                                                            j=1 ∇fij (x̃ij ) − ∇fij (x )
                                                             k                  k
                                   
       k        ∗ 2
                           X
                                       k      ∗    ∇f  il (x i ) −   ∇f  il (x̃ il )
= ∥X − X ∥ − 2η                  E xi − x ,                                          +
                           i=1
                                                               mpil                                       m
            n                                        P    m              k                   ∗     2
                    ∇fil (xki ) − ∇fil (x̃kil )           j=1 ∇fij (x̃ij ) − ∇fij (x )
           X
         2
    +η          E                                +
           i=1
                              mpil                                      m
                           Xn
                ∗ 2
       k
= ∥X − X ∥ − 2η                 ⟨xki − x∗ , ∇fi (xki ) − ∇fi (x∗ )⟩
                           i=1
            n X  m                                             Pm
         2
           X              ∇fij (xki ) − ∇fij (x̃kij )              j=1 ∇fij (x̃kij ) − ∇fij (x∗ )         2
    +η               pij                                   +
           i=1 j=1
                                      mpij                                          m
                           Xn                                        Xn
                ∗ 2
       k
= ∥X − X ∥ − 2η                 ⟨xki  −x  ∗
                                            , ∇fi (xki )⟩    + 2η         (fi (xki ) − fi (x∗ ) − Vfi (xki , x∗ ))+
                           i=1                                       i=1
        n X  m                                                                              Pm
     2
       X               ∇fij (xki )               ∗               ∗
                                   − ∇fij (x ) fij (x ) − ∇fij (x̃kij )                          j=1 ∇fij (x̃kij ) − ∇fij (x∗ )  2
   η             pij                                 +                                    +                                        .
       i=1 j=1
                                 mpij                               mpij                                     m
    Using the strong convexity of fij in Assumption 4.3.2 and applying (a + b)2 ≤ 2a2 + 2b2
to the last term, we have
       E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2
                                           n                                 n X     m
                                          X                                 X               1
                        k
    ≤ (1 − µη)∥X − X ∥ − 2η    ∗
                                               Vfi (xki , x∗ )    + 2η  2
                                                                                                  ∥∇fij (xki ) − ∇fij (x∗ )∥2
                                          i=1                               i=1 j=1
                                                                                          m2 p ij
                  n X  m                                             Pm
              2
                 X              ∇fij (x∗ ) − ∇fij (x̃kij )              j=1   ∇fij (x̃kij ) − ∇fij (x∗ )       2
        + 2η              pij                                    +                                               .
                 i=1 j=1
                                           mpij                                           m
                                                                                 n                                                o
    Let ui be the random variable taking values in                                    1
                                                                                     mpil
                                                                                          (∇fil (x̃kil ) − ∇fil (x̃∗ )) : l ∈ [m]
with distribution Pi = {pil : l ∈ [m]}, then the last term of the above inequality can be
                                                                 92


upper bounded by
                Xn                            X n
           2η 2      E∥ui − Eui ∥2 ≤ 2η 2           E∥ui ∥2
                i=1                            i=1
                                                n X  m                                            2
                                              X                   1
                                       = 2η 2            pij        (∇fij (x̃kij ) − ∇fij (x̃∗ )) .
                                               i=1 j=1
                                                               mpij
  Combining the inequality with the upper bound, we get
            E∥Xk − X∗ − ηGk + η∇F(X∗ )∥2
                                              n                            n      m
                                           X                        4η 2 L X X 1
                          k
        ≤ (1 − µη)∥X − X ∥ − 2η  ∗
                                                  Vfi (xki , x∗ ) +                    Vf (xk , x∗ )
                                            i=1
                                                                     m2 i=1 j=1 pij ij i
                      n X m
                     X          1
            + 2η 2                    ∥∇fij (x̃kij ) − ∇fij (x∗ )∥2
                     i=1 j=1
                              m2 p ij
                                              n                            n      m
                                           X                        4η 2 L X X 1
                          k
        ≤ (1 − µη)∥X − X ∥ − 2η  ∗
                                                  Vfi (xki , x∗ ) +                    Vf (xk , x∗ )
                                            i=1
                                                                     m2 i=1 j=1 pij ij i
                        n X m
                4η 2 L X         1
            +                       Vf (x̃k , x∗ ),
                 m2    i=1 j=1
                                pij ij ij
where the last inequality uses the Lipschitz smoothness of fij in Assumption 4.3.2.
  From the update of x̃ij , we have
                                    ∗                        ∗                         ∗
                    EVfij (x̃k+1                       k                            k
                             ij , x ) = pij Vfij (xi , x ) + (1 − pij )Vfij (x̃ij , x ).
                                                       93


   Similar to the proof of Theorem 4.4.2, in total expectation, we can get
                           X n X   m
                                                             ∗
            EΦ  k+1
                      + c̃            EVfij (x̃k+1 ij , x )
                            i=1 j=1
                                                         n X   m
                                                       X            1
                               k
          ≤ (1 − ηµ)E∥X − X ∥ − 2η      ∗ 2
                                                                       EVfij (xki , x∗ )
                                                        i=1 j=1
                                                                   m
                          n    m                                       n    m
               4η 2 L X X 1                                ∗
                                                                     XX
             +                          EV   f ij
                                                  (x k
                                                     i , x   ) +  c̃             pij EVfij (xki , x∗ )
                  m2 i=1 j=1 pij                                     i=1 j=1
                    n X  m                                                      n    m
                  X                                                  4η 2 L X X 1
             + c̃            (1 −   pij )EVfij (x̃kij , x∗ )    +                             EVfij (x̃kij , x∗ )
                  i=1 j=1
                                                                      m2 i=1 j=1 pij
                    2 
               2η             γ                      
             +          1 − λmin (I − W) E∥Dk − D∗ ∥2Q + (1 − α)CM E∥Hk − Z∗ ∥2
                  γ           2
                                                  2η 2          γ                     
          ≤ (1 − ηµ)E∥Zk − Z∗ ∥2P +                       1 − λmin (I − W) E∥Dk − D∗ ∥2Q
                                                   γ             2
                    n    m                                                      n    m
                  XX
                                                             ∗       4η 2 L X X 1
             + c̃                                      k
                             (1 − pij )EVfij (x̃ij , x ) +               2
                                                                                              EVfij (x̃kij , x∗ )
                  i=1 j=1
                                                                      m      i=1 j=1
                                                                                         p ij
                                                            (1 − ηµ)γλmax (I − W)
             + (1 − α)CM E∥Hk − Z∗ ∥2 +                                                        E∥Ẑk − Z∗ ∥2I−P ,
                                                                             2
where the last inequality is guaranteed by
                                             2η          4η 2 L
                                                   − 2 − c̃pij ≥ 0.
                                              m m pij
                                                                 8η 2 L
   Define pmin = mini,j {pij } and take c̃ =                    m2 p2min
                                                                         , then by the choice of {pij } and η, the
above condition is satisfied due to
                           2η       4η 2 L                      1            1           2
                                − 2 − c̃pij =                         −            −           = 0.
                            m m pij                           3Lm 9Lm 9Lm
   Note that
                                                                           4η 2 L
                                                                                   !
                                        4η 2 L                             m2 pij
                                                                                                  1 
                     c̃(1 − pij ) + 2             = c̃ 1 − pij +                      = c̃ 1 −            .
                                       m pij                                 c̃                   2m
   Therefore,
                          n    m                                                        n     m
                     2 XX                                                         2 XX                            
         EΦk+1
               +                                 k+1      ∗
                                    EVfij (x̃ij , x ) ≤ ρ̃ EΦ +            k
                                                                                                EVfij (x̃kij , x∗ ) ,
                    9L i=1 j=1                                                    9L i=1 j=1
where
                                                                                                                     
                         ηµ             γ                                               γλmax (I − W)               1
   ρ̃ := max 1 −                , 1 − λmin (I − W), 1 − α, (1 − ηµ)                                           ,1 −     .
                      2 − ηµ            2                                                        2                 2m
                                                               94


   Complexity. The complexity analysis is identical to that in Theorem 4.4.2 with the
single exception that we replace p by m−1 so we omit it here.
                                           95


BIBLIOGRAPHY
      96


                                   BIBLIOGRAPHY
[1] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An exact first-order algorithm for
    decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966,
    2015.
[2] Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with
    network independent step-sizes and separated convergence rates. IEEE Transactions
    on Signal Processing, 67(17):4494–4506, 2019.
[3] Zhi Li and Ming Yan. New convergence analysis of a primal-dual algorithm with
    large stepsizes. Advances in Computational Mathematics, 47(1):1–20, 2021.
[4] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves
    Lechevallier and Gilbert Saporta, editors, Proceedings of COMPSTAT’2010, pages
    177–186, Heidelberg, 2010. Physica-Verlag HD.
[5] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
    Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manju-
    nath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit
    Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and
    Xiaoqiang Zheng. TensorFlow: A system for large-scale machine learning. In 12th
    USENIX Symposium on Operating Systems Design and Implementation, OSDI’16, pages
    265–283, 2016.
[6] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun
    Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A flexible and ef-
    ficient machine learning library for heterogeneous distributed systems. CoRR,
    abs/1512.01274, 2015.
[7] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja
    Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed
    machine learning with the parameter server. In Proceedings of the 11th USENIX
    Conference on Operating Systems Design and Implementation, OSDI’14, pages 583–598,
    Berkeley, CA, USA, 2014. USENIX Association.
[8] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Ima-
    geNet training in minutes. In Proceedings of the 47th International Conference on Par-
    allel Processing, ICPP 2018, pages 1:1–1:10, New York, NY, USA, 2018. ACM.
[9] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V.
    Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang,
    and Andrew Y. Ng. Large scale distributed deep networks. In Proceedings of the 25th
    International Conference on Neural Information Processing Systems - Volume 1, NIPS’12,
    pages 1223–1231, USA, 2012.
                                            97


[10] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel
     stochastic gradient for nonconvex optimization. In C. Cortes, N. D. Lawrence, D. D.
     Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing
     Systems 28, pages 2737–2745. Curran Associates, Inc., 2015.
[11] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. Parallelized
     stochastic gradient descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S.
     Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23,
     pages 2595–2603. Curran Associates, Inc., 2010.
[12] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient
     descent and application to data-parallel distributed training of speech DNNs. In
     Interspeech 2014, September 2014.
[13] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD:
     Communication-efficient SGD via gradient quantization and encoding. In Advances
     in Neural Information Processing Systems, pages 1709–1720, 2017.
[14] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree
     Anandkumar. signSGD: Compressed optimisation for non-convex problems. In
     International Conference on Machine Learning, pages 560–569. PMLR, 2018.
[15] Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with
     memory. In Proceedings of the 32Nd International Conference on Neural Information
     Processing Systems, NIPS’18, pages 4452–4463, USA, 2018. Curran Associates Inc.
[16] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Urban Stich, and Martin
     Jaggi. Error feedback fixes SignSGD and other gradient compression schemes. In
     Proceedings of the 36th International Conference on Machine Learning, volume 97 of Pro-
     ceedings of Machine Learning Research, pages 3252–3261. PMLR, 2019.
[17] Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik.
     Distributed learning with compressed gradient differences.                arXiv preprint
     arXiv:1901.09269, 2019.
[18] Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. DoubleSqueeze: Par-
     allel stochastic gradient descent with double-pass error-compensated compression.
     In Proceedings of the 36th International Conference on Machine Learning, ICML 2019,
     9-15 June 2019, Long Beach, California, USA, pages 6155–6165, 2019.
[19] Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression
     algorithm for efficient distributed learning. In International Conference on Artificial
     Intelligence and Statistics, pages 133–143. PMLR, 2020.
[20] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication
     compression for decentralized training. In Advances in Neural Information Processing
     Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS
     2018, 3-8 December 2018, Montréal, Canada., pages 7663–7673, 2018.
                                             98


[21] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani.
     An exact quantized decentralized gradient descent algorithm. IEEE Transactions on
     Signal Processing, 67(19):4934–4947, 2019.
[22] Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, and
     Ramtin Pedarsani. Robust and communication-efficient collaborative learning. In
     Advances in Neural Information Processing Systems, pages 8388–8399, 2019.
[23] Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, Tong Zhang, and
     Ji Liu. Deepsqueeze: Decentralization meets error-compensated compression.
     CoRR, abs/1907.07346, 2019.
[24] Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic
     optimization and gossip algorithms with compressed communication. In Proceed-
     ings of the 36th International Conference on Machine Learning, pages 3479–3487. PMLR,
     2019.
[25] Anastasia Koloskova, Tao Lin, Sebastian U Stich, and Martin Jaggi. Decentralized
     deep learning with arbitrary communication compression. In International Confer-
     ence on Learning Representations, 2020.
[26] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-
     agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
[27] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient
     descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
[28] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can
     decentralized algorithms outperform centralized algorithms? a case study for de-
     centralized parallel stochastic gradient descent. In Advances in Neural Information
     Processing Systems, pages 5330–5340, 2017.
[29] Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro. DLM: Decentralized lin-
     earized alternating direction method of multipliers. IEEE Transactions on Signal Pro-
     cessing, 63(15):4051–4064, 2015.
[30] Kun Yuan, Wei Xu, and Qing Ling. Can primal methods outperform primal-dual
     methods in decentralized dynamic optimization? IEEE Transactions on Signal Pro-
     cessing, 68:4466–4480, 2020.
[31] Yao Li and Ming Yan. On the linear convergence of two decentralized algorithms.
     Journal of Optimization Theory and Applications, 189(1):271–290, 2021.
[32] Yao Li, Xiaorui Liu, Jiliang Tang, Ming Yan, and Kun Yuan. Decentralized compos-
     ite optimization with compression. arXiv preprint arXiv:2108.04448, 2021.
[33] Xiaorui Liu, Yao Li, Rongrong Wang, Jiliang Tang, and Ming Yan. Linear convergent
     decentralized optimization with compression. In International Conference on Learning
     Representations, 2021.
                                              99


[34] Hanlin Tang, Yao Li, Ji Liu, and Ming Yan. ErrorCompensatedX: Error compen-
     sation for variance reduced algorithms. Advances in Neural Information Processing
     Systems, 34:18102–18113, 2021.
[35] Yao Li and Ming Yan. On the improved conditions for some primal-dual algo-
     rithms. arXiv preprint arXiv:2201.00139, 2022.
[36] S Sundhar Ram, Angelia Nedić, and Venugopal V Veeravalli. Distributed stochastic
     subgradient projection algorithms for convex optimization. Journal of optimization
     theory and applications, 147(3):516–545, 2010.
[37] Angelia Nedic. Asynchronous broadcast-based convex optimization over a net-
     work. IEEE Transactions on Automatic Control, 56(6):1337–1351, 2010.
[38] D. Jakovetic, J. Xavier, and J. Moura. Fast distributed gradient methods. IEEE
     Transactions on Automatic Control, 59:1131–1146, 2014.
[39] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the
     ADMM in decentralized consensus optimization. IEEE Transactions on Signal Pro-
     cessing, 62(7):1750–1761, 2014.
[40] Roland Glowinski and A Marroco. Sur l’approximation, par éléments finis d’ordre
     un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet
     non linéaires. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation
     Mathématique et Analyse Numérique, 9(R2):41–76, 1975.
[41] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Dis-
     tributed optimization and statistical learning via the alternating direction method
     of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
[42] E. Wei and A. Ozdaglar. On the O(1/k) convergence of asynchronous distributed
     alternating direction method of multipliers. In Global Conference on Signal and Infor-
     mation Processing (GlobalSIP), 2013 IEEE, pages 551–554. IEEE, 2013.
[43] Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. Multi-agent distributed
     optimization via inexact consensus ADMM. IEEE Transactions on Signal Processing,
     63(2):482–497, 2015.
[44] Mingyi Hong and Tsung-Hui Chang. Stochastic proximal gradient consensus over
     random networks. IEEE Transactions on Signal Processing, 65(11):2933–2948, 2017.
[45] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for
     distributed optimization over time-varying graphs. SIAM Journal on Optimization,
     27(4):2597–2633, 2017.
[46] Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimiza-
     tion. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2017.
                                            100


[47] Aryan Mokhtari, Wei Shi, Qing Ling, and Alejandro Ribeiro. A decentralized
     second-order method for dynamic optimization. In 2016 IEEE 55th Conference on
     Decision and Control (CDC), pages 6036–6043. IEEE, 2016.
[48] Minghui Zhu and Sonia Martínez. Discrete-time dynamic average consensus. Au-
     tomatica, 46(2):322–329, 2010.
[49] J. Xu, S. Zhu, Y. Soh, and L. Xie. Augmented distributed gradient methods for
     multi-agent optimization under uncoordinated constant stepsizes. In Proceedings of
     the 54th IEEE Conference on Decision and Control (CDC), pages 2055–2060, 2015.
[50] Paolo Di Lorenzo and Gesualdo Scutari. NEXT: In-network nonconvex optimiza-
     tion. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120–
     136, 2016.
[51] Angelia Nedić, Alex Olshevsky, Wei Shi, and César A Uribe. Geometrically conver-
     gent distributed optimization with uncoordinated step-sizes. In American Control
     Conference (ACC), 2017, pages 3950–3955. IEEE, 2017.
[52] Shi Pu, Wei Shi, Jinming Xu, and Angelia Nedić. A push-pull gradient method
     for distributed optimization in networks. In 2018 IEEE Conference on Decision and
     Control (CDC), pages 3385–3390. IEEE, 2018.
[53] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for dis-
     tributed optimization and learning—part i: Algorithm development. IEEE Transac-
     tions on Signal Processing, 67(3):708–723, 2018.
[54] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for
     distributed optimization and learning—part ii: Convergence analysis. IEEE Trans-
     actions on Signal Processing, 67(3):724–739, 2018.
[55] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié.
     Optimal algorithms for smooth and strongly convex distributed optimization in
     networks. In Proceedings of the 34th International Conference on Machine Learning-
     Volume 70, pages 3027–3036. JMLR, 2017.
[56] César A Uribe, Soomin Lee, Alexander Gasnikov, and Angelia Nedić. A dual ap-
     proach for optimal algorithms in distributed optimization over networks. In 2020
     Information Theory and Applications Workshop (ITA), pages 1–37. IEEE, 2020.
[57] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm
     for decentralized composite optimization. IEEE Transactions on Signal Processing,
     63(22):6013–6023, 2015.
[58] Sulaiman Alghunaim, Kun Yuan, and Ali H Sayed. A linearly convergent proximal
     gradient algorithm for decentralized optimization. In Advances in Neural Information
     Processing Systems, pages 2848–2858, 2019.
                                            101


[59] A. Chen and A. Ozdaglar. A fast distributed proximal-gradient method. In the
     50th Annual Allerton Conference on Communication, Control, and Computing (Allerton),
     pages 601–608, 2012.
[60] A. Nedić and A. Olshevsky. Distributed optimization over time-varying directed
     graphs. In The 52nd IEEE Annual Conference on Decision and Control, pages 6855–
     6860, 2013.
[61] C. Xi and U. Khan. On the linear convergence of distributed optimization over
     directed graphs. arXiv preprint arXiv:1510.02149, 2015.
[62] J. Zeng and W. Yin. ExtraPush for convex smooth decentralized optimization over
     directed networks. Journal of Computational Mathematics, Special Issue on Compressed
     Sensing, Optimization, and Structured Solutions, 35(4):381–394, 2017.
[63] Ying Sun, Gesualdo Scutari, and Daniel Palomar. Distributed nonconvex multia-
     gent optimization over time-varying networks. In 2016 50th Asilomar Conference on
     Signals, Systems and Computers, pages 788–794. IEEE, 2016.
[64] Qing Ling and Alejandro Ribeiro. Decentralized dynamic optimization through the
     alternating direction method of multipliers. In 2013 IEEE 14th Workshop on Signal
     Processing Advances in Wireless Communications (SPAWC), pages 170–174. IEEE, 2013.
[65] Angelia Nedić and Alex Olshevsky. Stochastic gradient-push for strongly convex
     functions on time-varying directed graphs. IEEE Transactions on Automatic Control,
     61(12):3936–3947, 2016.
[66] Angelia Nedić. Distributed optimization over networks. In Multi-agent Optimiza-
     tion, pages 1–84. Springer, 2018.
[67] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, vol-
     ume 87. Springer Science & Business Media, 2013.
[68] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud com-
     puting. In INTERSPEECH, pages 1488–1492, 2015.
[69] Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed
     learning with sparsity. In Doina Precup and Yee Whye Teh, editors, Proceedings of
     the 34th International Conference on Machine Learning, volume 70 of Proceedings of Ma-
     chine Learning Research, pages 3636–3645, International Convention Centre, Sydney,
     Australia, 06–11 Aug 2017. PMLR.
[70] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for
     communication-efficient distributed optimization. In Proceedings of the 32Nd Interna-
     tional Conference on Neural Information Processing Systems, NIPS’18, pages 1306–1316,
     USA, 2018. Curran Associates Inc.
[71] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and
     Hai Li. TernGrad: Ternary gradients to reduce communication in distributed deep
     learning. In Advances in neural information processing systems, pages 1509–1519, 2017.
                                            102


[72] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compen-
     sated quantized SGD and its applications to large-scale distributed optimization. In
     Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Con-
     ference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,
     pages 5325–5333, 10–15 Jul 2018.
[73] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gra-
     dient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural
     Language Processing, pages 440–445, Copenhagen, Denmark, September 2017. Asso-
     ciation for Computational Linguistics.
[74] Michael I Jordan, Jason D Lee, and Yun Yang. Communication-efficient distributed
     statistical inference. Journal of the American Statistical Association, 114(526):668–681,
     2019.
[75] Lam Nguyen, Phuong Ha Nguyen, Marten Dijk, Peter Richtárik, Katya Scheinberg,
     and Martin Takác. SGD and Hogwild! Convergence without the bounded gradi-
     ents assumption. In International Conference on Machine Learning, pages 3750–3758.
     PMLR, 2018.
[76] Robert M. Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin,
     and Peter Richtárik. SGD: General analysis and improved rates. 36th International
     Conference on Machine Learning, ICML 2019, 2019-June:9090–9112, 2019.
[77] Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Pe-
     ter Richtárik. Stochastic distributed learning with gradient quantization and vari-
     ance reduction. arXiv preprint arXiv:1904.05115, 2019.
[78] P. Elias. Universal codeword sets and representations of the integers. IEEE Transac-
     tions on Information Theory, 21(2):194–203, March 1975.
[79] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
     document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
[80] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
     for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recogni-
     tion (CVPR), pages 770–778, 2016.
[81] Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter Richtarik, and Sebastian
     Stich. A linearly convergent algorithm for decentralized optimization: Sending less
     bits for free! In International Conference on Artificial Intelligence and Statistics, pages
     4087–4095. PMLR, 2021.
[82] John Tsitsiklis, Dimitri Bertsekas, and Michael Athans. Distributed asynchronous
     deterministic and stochastic gradient optimization algorithms. IEEE transactions on
     automatic control, 31(9):803–812, 1986.
                                             103


[83] Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus Püschel. D-ADMM:
     A communication-efficient distributed algorithm for separable optimization. IEEE
     Transactions on Signal Processing, 61(10):2718–2723, 2013.
[84] Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2 : Decentralized
     training over decentralized data. In Proceedings of the 35th International Conference
     on Machine Learning, pages 4848–4856, 2018.
[85] Gesualdo Scutari and Ying Sun. Distributed nonconvex constrained optimization
     over time-varying digraphs. Mathematical Programming, 176(1):497–544, 2019.
[86] Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods. Math-
     ematical Programming, pages 1–49, 2020.
[87] Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Accelerated primal-dual algo-
     rithms for distributed smooth convex optimization over networks. In International
     Conference on Artificial Intelligence and Statistics, pages 2381–2391. PMLR, 2020.
[88] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Lee, and Laurent Massoulié.
     Optimal convergence rates for convex distributed optimization in networks. Journal
     of Machine Learning Research, 20:1–31, 2019.
[89] Ruggero Carli, Fabio Fagnani, Paolo Frasca, and Sandro Zampieri. Gossip consen-
     sus algorithms via quantized communication. Automatica, 46(1):70–80, 2010.
[90] Yucheng Lu and Christopher De Sa. Moniqua: Modulo quantized communication
     in decentralized SGD. In Proceedings of the 37th International Conference on Machine
     Learning, 2020.
[91] Sindri Magnússon, Hossein Shokri-Ghadikolaei, and Na Li. On maintaining linear
     convergence of distributed learning and optimization under limited communica-
     tion. IEEE Transactions on Signal Processing, 68:6101–6116, 2020.
[92] Sulaiman A Alghunaim and Ali H Sayed. Linear convergence of primal–dual
     gradient methods and their performance in distributed optimization. Automatica,
     117:109003, 2020.
[93] Zhuorui Li, Yiwei Liao, Kun Huang, and Shi Pu. Compressed gradient track-
     ing for decentralized optimization with linear convergence.              arXiv preprint
     arXiv:2103.13748, 2021.
[94] Yongyang Xiong, Ligang Wu, Keyou You, and Lihua Xie. Quantized distributed
     gradient tracking algorithm with linear convergence in directed networks. arXiv
     preprint arXiv:2104.03649, 2021.
[95] Zhuoqing Song, Lei Shi, Shi Pu, and Ming Yan. Compressed gradient tracking for
     decentralized optimization over general directed networks. IEEE Transactions on
     Signal Processing, 70:1775–1787, 2022.
                                              104


[96] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with
      the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
[97] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using pre-
      dictive variance reduction. Advances in neural information processing systems, 26:315–
      323, 2013.
[98] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental
      gradient method with support for non-strongly convex composite objectives. Ad-
      vances in neural information processing systems, 27, 2014.
[99] Dmitry Kovalev, Samuel Horváth, and Peter Richtárik. Don’t jump through hoops
      and remove those loops: Svrg and katyusha are better without the outer loop. In
      Algorithmic Learning Theory, pages 451–467. PMLR, 2020.
[100] Aryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic aver-
      aging gradient algorithm. The Journal of Machine Learning Research, 17(1):2165–2199,
      2016.
[101] Kun Yuan, Bicheng Ying, Jiageng Liu, and Ali H Sayed. Variance-reduced stochas-
      tic learning by networked agents under random reshuffling. IEEE Transactions on
      Signal Processing, 67(2):351–366, 2018.
[102] Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized
      stochastic optimization with accelerated convergence. IEEE Transactions on Signal
      Processing, 68:6255–6271, 2020.
[103] Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-efficient dis-
      tributed optimization in networks with gradient tracking and variance reduction.
      In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third In-
      ternational Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings
      of Machine Learning Research, pages 1662–1672. PMLR, 26–28 Aug 2020.
[104] Hadrien Hendrikx, Francis Bach, and Laurent Massoulié. Dual-free stochastic de-
      centralized optimization with variance reduction. Advances in Neural Information
      Processing Systems, 33:19455–19466, 2020.
[105] Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization
      based on gradient tracking revisited: Enhancing convergence rate via surrogation.
      SIAM Journal on Optimization, 32(2):354–385, 2022.
[106] Sulaiman A Alghunaim, Ernest Ryu, Kun Yuan, and Ali H Sayed. Decentralized
      proximal gradient algorithms with linear convergence rates. IEEE Transactions on
      Automatic Control, 2020.
[107] Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Distributed algorithms for
      composite optimization: unified framework and convergence analysis. IEEE Trans-
      actions on Signal Processing, 69:3555–3570, 2021.
                                               105


[108] Haishan Ye, Ziang Zhou, Luo Luo, and Tong Zhang. Decentralized accelerated
      proximal gradient descent. Advances in Neural Information Processing Systems, 2020,
      2020.
[109] Laurent Condat. A primal–dual splitting method for convex optimization involving
      lipschitzian, proximable and linear composite terms. Journal of optimization theory
      and applications, 158(2):460–479, 2013.
[110] Bang Cong Vu. A splitting algorithm for dual monotone inclusions involving coco-
      ercive operators. Advances in Computational Mathematics, 38(3):667–681, 2013.
[111] Peijun Chen, Jianguo Huang, and Xiaoqun Zhang. A primal-dual fixed point algo-
      rithm for minimization of the sum of three convex separable functions. Fixed Point
      Theory and Applications, 2016(1):1–18, 2016.
[112] Ming Yan. A new primal–dual algorithm for minimizing the sum of three functions
      with a linear operator. Journal of Scientific Computing, 76(3):1698–1717, 2018.
                                             106