Compressed Gradient Tracking Methods for Decentralized Optimization with Linear Convergence

Communication compression techniques are of growing interests for solving the decentralized optimization problem under limited communication, where the global objective is to minimize the average of local cost functions over a multi-agent network using only local computation and peer-to-peer communication. In this paper, we first propose a novel compressed gradient tracking algorithm (C-GT) that combines gradient tracking technique with communication compression. In particular, C-GT is compatible with a general class of compression operators that unifies both unbiased and biased compressors. We show that C-GT inherits the advantages of gradient tracking-based algorithms and achieves linear convergence rate for strongly convex and smooth objective functions. In the second part of this paper, we propose an error feedback based compressed gradient tracking algorithm (EF-C-GT) to further improve the algorithm efficiency for biased compression operators. Numerical examples complement the theoretical findings and demonstrate the efficiency and flexibility of the proposed algorithms.


I. INTRODUCTION
In this paper, we study the problem of decentralized optimization over a multi-agent network that consists of n agents. The goal is to collaboratively solve the following optimization problem: where x is the global decision variable, and each agent has a local objective function f i : R p → R. The agents are connected through a communication network and can only exchange information with their immediate neighbors in the network. Through local computation and local information exchange, they seek a consensual and optimal solution that minimizes the average of all the local cost functions. Decentralized optimization is widely applicable when central controllers or servers are not available or preferable, when centralized communication that involves a large amount of data exchange is prohibitively expensive due to limited communication resources, and when privacy preservation is desirable. Problem (1) has attracted much attention in recent years and has found a variety of applications in wireless networks, distributed control of robotic systems, and machine learning, etc [1]- [3]. Early work considered the distributed subgradient descent (DGD) method with a diminishing step-size policy [4]. Under a constant step-size, EXTRA [5] first achieved linear convergence rate for strongly convex and smooth cost functions by introducing an extra correction term to DGD. Distributed gradient tracking-based methods were later developed in [6]- [10], where the local gradient descent direction in DGD was replaced by an auxiliary variable that is able to track the average gradient of the local objective functions. As a result, each agent's local iterate is moving in the global descent direction and converges exponentially to the optimal solution for strongly convex and smooth objective functions [8], [9]. Compared with EXTRA, gradient tracking-based methods are also suitable for uncoordinated step-sizes [6], [11] and possibly asymmetric weight matrices while preserving linear convergence rates. Some variants were also proposed to deal with stochastic gradient information and time-varying or directed network topologies. For example, in [10], a distributed stochastic gradient tracking method was considered which exhibits comparable performance to a centralized stochastic gradient algorithm. Time-varying networks were considered in [8], [12]- [14], and more recent development on directed graphs can be found in [12], [15]- [19] and the references therein. In particular, the Push-Pull/AB methods considered by [16] and [17] used both row stochastic and column stochastic weight matrices to achieve linear convergence rate for strongly convex and smooth objective functions over general graphs.
In many application scenarios, it is vital to design communication-efficient protocols for distributed computation due to limited communication bandwidth and power constraints. Recently, in order to improve system scalability and communication efficiency, researchers have considered a variety of communication compression techniques, such as sparsification and quantization [20]- [31]. In the centralized setting, these methods were shown to maintain comparable convergence rates [20]- [27]. For decentralized optimization, several techniques were introduced to alleviate compression errors, including difference compression, extrapolation compression [32] and compression error compensation [26], [33]. A novel algorithm with communication compression, which combines with DGD and preserves the model average, was presented in [33], [34]. But the method converges sublinearly even when the objective functions are strongly convex. In [35], a linearly convergent decentralized optimization algorithm with compression (LEAD) was introduced for strongly convex and smooth objective functions. The method was based on NIDS [36], a sibling of EXTRA.
In light of the advantages of gradient tracking-based methods for decentralized optimization, it is natural to consider the marriage between gradient tracking and communication compression. The first such effort was made in [37] which considered a quantized gradient tracking method based on a special quantizer. It was shown to achieve linear convergence rate for strongly convex and smooth objective functions. However, the algorithm design is rather complicated and relies on a specific quantizer. In addition, the convergence conditions are not easy to verify.
In this paper, we first consider a novel gradient trackingbased method (C-GT) for decentralized optimization with communication compression. The algorithm compresses both the decision variables and the gradient trackers to provide a communication-efficient implementation. Unlike the existing methods which are mostly based on unbiased compressors or biased but contractive compressors, C-GT is provably efficient for a general class of compressors, including those which are neither unbiased nor biased but contractive, e.g., normsign compression methods. We show that C-GT achieves linear convergence for strongly convex and smooth objective functions under such a general class of communication compression techniques.
In the second part of the paper, we propose an error feedback based compressed gradient tracking algorithm (EF-C-GT) to further improve the algorithm efficiency for biased compression operators in particular. Compared with unbiased ones, biased compressors show advantages through average capacity of preserving information [29] or test accuracy [38]. However, the simple distributed gradient descent method may lead to divergence at an exponential rate if the compression operators are allowed to be biased, and a counter-example was provided for Top-1 compressor in [29]. More discussion and comparison between biased and unbiased compression operators can be found in [29]. Error feedback is the known technique that can fix the issue and cope with errors induced by biased, contractive compressors [23], [24], [28], [39]- [41]. For example, the type of methods called error compensation or error correction was developed earlier for a particular application in [20], [42], [43]. The performance of distributed stochastic gradient descent (SGD) with error feedback was analyzed on the heterogeneous data in [28]. We show that EF-C-GT also achieves linear convergence rate for strongly convex and smooth objective functions, and has superior performance over C-GT in numerical experiments.
The main contributions of the paper are summarized as follows: • We propose a novel compressed gradient tracking algorithm (C-GT) for decentralized optimization, which inherits the advantages of gradient tracking-based methods and saves communication costs at the same time. • The proposed C-GT algorithm is applicable to a general class of compression operators and works under arbitrary compression precision. In particular, the general condition on the compression operators unifies the commonly considered unbiased and biased but contractive compressors and also includes other compression methods such as norm-sign compressors. • C-GT provably achieves linear convergence for minimizing strongly convex and smooth objective functions under the general condition on the compression operators. • We propose EF-C-GT to improve the algorithm efficiency for biased compression methods and prove its linearly convergent property under strongly convex and smooth objective functions. • Simulation examples show that C-GT is efficient and widely applicable to various compressors, and EF-C-GT outperforms C-GT for biased compression methods such as Top-k and Random-k. The rest of this paper is organized as follows. We present the C-GT algorithm in Section III. In Section IV, we perform the convergence analysis for C-GT. We introduce EF-C-GT in Section V and its convergence result in Section V-A. Numerical examples are provided in Section VI. Finally, concluding remarks are given in Section VII.

A. Notation
Vectors are columns if not otherwise specified in this paper. Let each agent i hold a local copy x i ∈ R p of the decision variable and a gradient tracker (auxiliary variable) y i ∈ R p . At the k-th iteration, their values are denoted by x k i and y k i , respectively. For notational convenience, define where 1 is the column vector with each entry given by 1. At the k-th iteration, their values are denoted by X k , Y k , X k and Y k , respectively. Auxiliary variables of the agents (in an aggregative matrix form) H x , H y , Q x , Q y , X, Y, Q x and Q y are defined similarly. An aggregate objective function of the local variables is defined as: Denote The inner product of vectors a, b ∈ R p is written as a, b . For matrices A, B ∈ R n×p , we let A, B be the Frobenius inner product. We use · to denote the Frobenius norm of vectors and matrices by default. Specially, for square matrices, · represents the spectral norm. The spectral radius of a square matrix M is denoted by ρ(M).

II. PROBLEM FORMULATION
In this section, we provide the assumptions on the communication graphs and the objective functions. Then, we discuss different kinds of compression methods and provide a general description for compression operators.

A. Preliminaries
We start with introducing the conditions on the communication network/graph and the objective functions. Assume the agents are connected over a directed network G = (V, E), where V = {1, 2, . . . , n} is the set of vertices (nodes) and E ⊆ V × V consists of ordered pairs of vertices. The ordered pair (i, j) ∈ E indicates that there is a directed edge from agent i to agent j and thus the i-th agent can directly send information to the j-th agent. For an arbitrary agent i ∈ V, we define the set of its in-neighbors as N in i = j (j, i) ∈ E and the set of out-neighbors as N out i = j (i, j) ∈ E . The cardinality of N in i and N out i , denoted by Deg in i and Deg out i , is referred to as agent i's in-neighbor and out-neighbor degree, respectively. Regarding the network structure, we make the following standing assumption: Assumption 1: The directed graph G is strongly connected and permits a nonnegative doubly stochastic weight matrix W = [w ij ] ∈ R n×n . That is, agent i can receive information from agent j if and only if w ij > 0, and W1 = 1 and 1 W = 1 .
Remark 1: It is possible to construct a doubly stochastic weight matrix for any strongly connected directed graph in theory [44]. In practice, it is easier to construct such a matrix for certain types of directed graphs, while in general an iterative computation process is needed. For instance, if Deg in i = Deg out i for all i (e.g., when G is undirected), then a doubly stochastic weight matrix can be easily constructed as W = I − aL, where I is an identity matrix, L is the graph Laplacian and a > 0 is a tuning parameter.
We make the following assumption on the objective functions.
Assumption 2: The objective function f is µ-strongly convex, and each local cost function f i 's gradient is L i -Lipschitz continuous, i.e., for any x, x ∈ R p , From Assumption 2, the gradient of f is L-Lipschitz continuous, where L = max {L i }. Note that the problem (1) has a unique solution denoted by x * ∈ R 1×p under the assumption.

B. Compression Methods
In this subsection, we introduce some common assumptions on the compression operators and then present a more general and unified assumption.
1) Unbiased compression operators: Denote Compress the compression function. We first consider a general class of unbiased compression methods, in which the variance of the compression error has an upper bound that is linearly proportional to the norm of the variable of interest [21], [25], [27], [35], [45].
Assumption 3: The compression operator Q : R d → R d associated with function Compress satisfies EQ(x) = x, and for all x ∈ R d , there exists a constant C ≥ 0 such that Remark 2: The expectation is taken with respect to the random vector corresponding to the internal compression randomness of Q. Some instances of feasible stochastic compression operators satisfying Assumption 3 can be found in [27], [35] and the references therein. For example, we may consider the unbiased b-bits q-norm quantization compression method defined as follows: where sign() is the sign function, is the Hadamard product, |x| is the element-wise absolute value of x, and u is a random perturbation vector uniformly distributed in [0, 1] p . It has been shown that the compression operator is unbiased and its compression variance has an upper bound that is linearly proportional to the norm of the variables [35]. Note that the agents only need to transmit the norm x q , sign(x), and integers in the bracket for communication.
2) Biased compression operators: We also consider the following class of biased compression methods that are common in practice [28], [29], [33].
Assumption 4: The compression operator C δ : R d → R d associated with function Compress satisfies where δ ∈ (0, 1]. Remark 3: If δ = 1, there is no compression error, i.e., C δ (x) = x. Below we give two examples of biased compression operators, i.e., Top-k and Random-k, where δ satisfies δ = k p (see e.g., [29]). • Top-k: A subset of x which corresponds to the k largest absolute values of x is chosen, i.e., C top (x) = x e, where the element of e is set to 1 if the corresponding index is selected, otherwise it is set to 0. Specifically, we reorder the elements of x as |x i1 | ≥ |x i2 | ≥ · · · ≥ |x ip | and thus e satisfies e i l = 1 for l ≤ k and e i l = 0 for l > k. • Random-k: A set of randomly selected k elements of x is transmitted, i.e., C rnd (x) = x e, where the elements of e satisfy e i = 1, with probability k p ; 0, with probability 1 − k p .
3) General compression operators: We now present a general assumption on the compression operators, which includes Assumptions 3 and 4 as special cases.
Assumption 5: The compression operator C : R d → R d associated with function Compress satisfies and the r-scaling of C satisfies for some constants δ ∈ (0, 1] and r > 0. Remark 4: On one hand, if C < 1, Assumption 5 degenerates to Assumption 4 by setting r = 1 and δ = 1 − C. On the other hand, if C is unbiased, i.e., EC(x) = x, then Assumption 5 degenerates to Assumption 3 by setting r = C+1 and δ = 1 C+1 . In short, Assumption 5 gives a unified description of unbiased and biased compression operators and thus Assumptions 3 and 4 can be regarded as its special cases.
However, there also exist compression operators where C is biased and C ≥ 1 in Assumption 5, that is, they do not satisfy Assumptions 3 and 4. For instance, for norm-sign compression operators, i.e., it can be verified that To see why (10) holds, we denote the r-scaling C as C r (x) = C(x) r for notational convenience subsequently. Taking r = p, we have Then For q = 1, 2, ∞, we give the concrete values of C and δ in Table I.

Remark 5:
Although some compression operators (e.g., norm-sign) can be rescaled so that the new compression operator C p satisfies the contractive condition in Assumption 4, applying the rescaled operator may hurt the performance of the algorithm when compared with directly using the original compression operator C (see the numerical example in Section VI-C). Considering Assumption 5 provides us with more flexibility in choosing the most suitable compression method.

III. A COMPRESSED GRADIENT TRACKING ALGORITHM
In this section, we introduce the compressed gradient tracking algorithm (C-GT) and its communication-efficient implementation. We also give some interpretations as well as how C-GT connects to existing works.

A. A Compressed Gradient Tracking Algorithm (C-GT)
The proposed compressed gradient tracking algorithm (C-GT) is presented in Algorithm 1. 1 The function Compress is the compression operator that independently compresses the variables for each agent at every iteration. In Line 2, the difference between X k and the auxiliary variable H k x is compressed and then added back to H k x in Line 3 for obtaining X k . Then H k+1 x is computed as the weighted average of its previous value H k x and X k with mixing weight α x . Similar updates are conducted for the gradient tracker Y k in Lines 5-7.
C-GT performs an implicit error compensation operation that mitigates the impact of the compression error, as can be

Algorithm 1 A Compressed Gradient Tracking (C-GT) Algorithm
Input: stopping time K, step-size η, consensus step-size γ, scaling parameters α x , α y , and initial values X 0 , H 0 3: seen from the following argument. The decision variable is updated as where E k := X k − X k measures the compression error for the decision variable. The additional term (I − W)E k implies that each agent i transmits its total compression error − j∈N out i ∪{i} w ji e k i = −e k i to its neighboring agents and compensates this error locally by adding e k i , where e k i ∈ R 1×p is the i-th row of E k . Similarly, the compression errors for the gradient trackers are also mitigated.
Another key property of C-GT is that gradient tracking is efficient regardless of the compression errors, as for all k ≥ 0, we have The second equality holds because 1 (I − W) = 0, and the last equality is obtained by induction under the initial condition Y 0 = ∇F(X 0 ). Therefore, as long as y k i reaches (approximate) consensus among all the agents, each y k i is able to track the average gradient 1 ∇F(X k )/n. Moreover, by multiplying 1 and dividing n on both sides of Line 8, we obtain with ∇F(X k ) defined in (3). Hence the update of X k does not contain any compression error. If all the individual state variables converge to the consensual solution, i.e., and (15) reduces to the exact gradient descent update. Remark 6: If no communication compression is performed in the algorithm, i.e., X k = X k and Y k = Y k , then C-GT recovers the typical gradient tracking algorithm in [6], [8], [9]. To see such a connection, note that C-GT reads and where we substitute X k = X k and Y k = Y k in Line 8 and Line 9 of C-GT, respectively. By denoting W := (1 − γ)I + γW, C-GT takes the same form as the typical gradient tracking method. Remark 7: Compared with directly quantizing the decision variables [37], the proposed C-GT algorithm reduces the impact of the compression errors through difference compression. It is worth noting that in C-GT, the agents may choose different, uncoordinated step-sizes, which differs from the LEAD algorithm in [35].

B. A Communication Efficient Implementation
Note that the communication processes associated with computing (I − W) X k and (I − W) Y k in C-GT require transmitting the full-precision variables X k and Y k and thus do not enjoy the benefits of compression. In this subsection, we present an equivalent but communication efficient implementation of C-GT for practical consideration, while Algorithm 1 is mainly used for theoretical analysis and explanation.
The main idea of Algorithm 2 lies in the procedure COMM(Z, H, H w ), in which two new variables H w and Z w are introduced. Indeed this procedure is the same as the one considered in [35]. After the initialization H 0 x . Therefore, we ensure H k x,w = WH k x and X k w = W X k by induction for all k through the simple communication of the compressed variable Q k x . Similarly, there hold H k y,w = WH k y and Y k w = W Y k for all k. The procedure COMM(Z, H, H w ) works as follows. In Line 10, the difference between Z and the auxiliary variable H is compressed into Q. Then the variable Z is estimated based on H and the encoded low-bit representation Q in Line 11. Through the local communication in Line 12, we obtain Z w , and this is the only required communication step. Finally, given Z and Z w , the auxiliary variables H and H w are updated in Lines 13 and 14, respectively.
To see why Algorithm 1 and Algorithm 2 are equivalent, note that after the compression and communication procedures in Lines 4 and 5 respectively, we would have X k w = W X k and Y k w = W Y k . Thus, the state variable update in Line 6 becomes

Algorithm 2 A communication-efficient version of C-GT
Input: stopping time K, step-size η, consensus step-size γ, scaling parameters α x , α y , and initial values X 0 , H 0

15:
Return: Z, Z w , H, H w 16: end procedure a In Lines 4 and 5, αz in the compression function is replaced by αx for decision difference compression and αy for gradient difference compression, respectively. and the gradient tracker update in Line 7 is given by Remark 8: In Algorithm 2, computing X k w = W X k and Y k w = W Y k does not require the explicit transmission of the full-precision variables X k and Y k , and thus communication efficiency is guaranteed.

IV. CONVERGENCE ANALYSIS FOR C-GT
In this section, we study the convergence properties of the proposed compressed gradient tracking algorithm for minimizing strongly convex and smooth cost functions. Our analysis relies on constructing a linear system of inequalities that is related to the optimization error

A. Supporting Lemmas
In order to derive the main results, we introduce some useful lemmas first.
Lemma 1: Under Assumption 2, for all k ≥ 0, there holds In addition, if η < 2/(µ + L), then we have Lemma 2: Suppose Assumption 1 holds. Let ρ w denote the spectral norm of the matrix W − 1 n 11 . Then, for all ω ∈ R n×p , we have ρ w < 1 and Proofs of Lemma 1 and Lemma 2 can be found in Lemma 10 of [9].
Remark 9: Given any γ ≤ 1, let W = (1 − γ)I + γW and denote s : We have for all ω ∈ R n×p that The following lemma presents a linear system of inequalities, which is central in our convergence analysis.
Lemma 3: Suppose Assumptions 1-2 and 5 hold and η < min 2 µ+L , 1 3µ . We have the following inequalities for Algorithm 1: where positive constants c 1 -c 8 and c x , c y , t x , t y are given in Appendix B-A.
Proof: See Appendix B-A Remark 10: When η and γ are taken as in Lemma 3, we have the following linear system of inequalities: The inequality is to be taken component-wisely, and the elements of the transition matrix A = [a ij ] are corresponding to the parameters in Lemma 3. In light of the inequalities, the errors Ω k o , Ω k c , Ω k g , Ω k cx , and Ω k cy all converge to 0 at the linear rate O(ρ(A) k ) if the spectral radius of ρ(A) satisfies ρ(A) < 1.
The following lemma provides a sufficient condition that ensures ρ(A) < 1.

B. Main Results
We introduce the main convergence result for C-GT algorithm under Assumption 5 in the following theorem, which demonstrates the linearly convergent property of C-GT for the general compression operators.
Theorem 1: Under Assumption 5, suppose the scaling parameters are given by α x , α y ∈ (0, 1 r ] and consensus step-size satisfies γ ≤ min 1, where s = 1 − ρ w , constants c x , c y , m 1 -m 4 are given in Appendix B-A-B-B, and 1 -5 satisfy (71) (see Appendix B-B). Then for a fixed step-size η that satisfies the spectral radius of A, i.e., ρ A , is less or equal to 1− 1 2 ηµ, and hence the optimization error Ω k o and the consensus error Ω k c both converge to 0 at the linear rate O((1 − 1 2 ηµ) k ). Proof: See Appendix B-B. Remark 11: Regardless of the compression constant C, we can find parameters α x , α y and γ to obtain the linear convergence rate for C-GT with a fixed step-size η. This implies that C-GT can be used to deal with communication compression with arbitrary precision, i.e., for any C > 0.
Remark 13: Note that the parameter settings given in Theorem 1 are only sufficient and relatively conservative. For practical consideration, it is often possible to find better parameters to achieve faster convergence.

V. AN ERROR FEEDBACK BASED COMPRESSED GRADIENT TRACKING ALGORITHM
In this section, we propose an error feedback based compressed gradient tracking algorithm (EF-C-GT) to further improve upon the algorithm efficiency of C-GT for biased compression methods particularly. The algorithm is presented in Algorithm 3.
Compared to C-GT, in EF-C-GT we consider the difference compression with additional error feedback in Lines 3-4 and 8-9, and X k , Y k are updated by Q k x and Q k y , respectively. The new variables E k x , E k y are used to accumulate the compression errors for each node in the network. Then each agent can make use of its history information to correct the bias induced by the biased compressors on the cumulative compression error. The price to pay by considering error feedback lies in twice the number of compression operations in each iteration of the algorithm.
Remark 14: For unbiased compressors, there is no need to consider error feedback since E[E k+1 x ] = 0, ∀k ≥ 0 from Lines 3-4, and the performance of C-GT and EF-C-GT are comparable. In [41], it was also shown that error

Algorithm 3 An Error Feedback Based C-GT Algorithm (EF-C-GT)
Input: stopping time K, step-size η, consensus step-size γ, scaling parameters α x , α y , and initial values X 0 , H 0 feedback does not provide better performance for the unbiased compressors under the distributed setting. Indeed, the main idea of considering error feedback in this paper is similar to the usage in [28].
For the sake of completeness, we also present the communication efficient implementation of EF-C-GT in Algorithm 4.

A. Convergence Analysis for EF-C-GT
We now perform the convergence analysis for EF-C-GT under the same cost functions and Assumption 4, that is, we assume the compression operator is biased but contractive. It is worth noting that EF-C-GT also works under unbiased compressors satisfying Assumption 3, and with a slight modification on the updates, it can deal with biased compression operators that are not contractive.
Similar to the analysis of C-GT, we still consider constructing a linear system of inequality and introduce two extra terms, i.e., error feedbacks for the decision variable and the gradient tracker compression, Ω k ex := E E k x 2 and Ω k ey := E E k y 2 respectively. To derive the main convergence results for EF-C-GT, we first provide a key lemma as follows.

Algorithm 4 A communication-efficient version of EF-C-GT
Input: stopping time K, step-size η, consensus step-size γ, scaling parameters α x , α y , and initial values X 0 , H 0 H ← H + α z Q 15: where positive constants d x , d y , d 1 -d 4 , and t x , t y are given in Appendix C-A-C-B.
Proof: See Appendix C-A. Based on the above lemma, we present the main convergence result for EF-C-GT in the following theorem, which demonstrates the linearly convergent property of EF-C-GT.
Theorem 2: Under Assumption 4, suppose the scaling parameters are given by α x , α y ∈ (0, 1] and the consensus step-size satisfies where positive constants d x , d y , 1 -7 , and m 3 , m 4 are given in Appendix C-A-C-B. Then for a fixed step-size η that satisfies the spectral radius of B (see (91) in Appendix C-B), i.e., ρ B , is less or equal to 1 − 1 2 ηµ, and hence the optimization error and the consensus error both converge to 0 at the linear rate O((1 − 1 2 ηµ) k ). Proof: See Appendix C-B.

VI. NUMERICAL EXAMPLES
In this part, we provide some numerical examples to confirm our theoretical results and compare with a few different algorithms under various network settings. Consider the ridge regression problem: where ρ > 0 is a penalty parameter. The pair (u i , v i ) is a sample that belongs to the i-th agent, where u i ∈ R p represents the features and v i ∈ R represents the observations or outputs. In the simulations, pairs (u i , v i ) are pre-generated: input u i ∈ [−1, 1] p is uniformly distributed, and the output v i satisfies v i = u ix i + ε i , where ε i are independent Gaussian noises with mean 0 and variance 25, andx i are predefined parameters evenly located in [0, 1] p . Then, the i-th agent can calculate the gradient of its local objective function The unique optimal solution of the problem is x * = ( n i=1 u i u i + nρI) −1 n i=1 u i v i . In our experimental settings, we consider penalty parameter ρ = 0.01. The number of nodes is n = 10, and the dimension of variables is p = 20. Meanwhile, x 0 i is randomly generated in [0, 1] p and other initial values satisfy H 0 For the weight matrix W, the weights are defined as follows:

A. Unbiased Compression Operators
In this case, we consider the unbiased b-bits q-norm quantization method with b = 2 and q = ∞ in (6), since the ∞norm provides the smallest upper bound for the compression variance [35]. The scaling parameters and the consensus stepsize are all set to 1.
We first compare C-GT with the known linearly convergent algorithm with communication compression, LEAD [35], for decentralized optimization over fixed undirected networks.
Note that undirected graphs are special cases of directed graphs that satisfies (i, j) ∈ E if and only if (j, i) ∈ E. In particular, we consider a ring network topology. The step-sizes are set to η = 0.09 for C-GT and 0.12 for LEAD, which are optimal respectively. It can be seen from Fig. 1 that the optimization errors for both algorithms decrease exponentially fast, which verifies the linearly convergent property of the competing algorithms. Meanwhile, C-GT converges slightly slower than LEAD, but achieves a smaller final error. The two algorithms can be considered comparable in such a case. Then we consider a directed ring network, for which C-GT is the only known applicable algorithm that achieves linear convergence with communication compression. By setting the step-size to η = 0.0047 for the directed ring graph, we show in Fig. 2 that C-GT still converges linearly to the optimal solution and achieve almost the same performance as the gradient tracking algorithm without communication compression (GT). These results further demonstrates the effectiveness and flexibility of C-GT.

B. Biased, Contractive Compression Operators
In this subsection, we consider two sparsification compression methods, i.e., Top-k and Random-k. We provide simulation results for an undirected network and a directed ring network, respectively.
For an undirected ring network, the parameter settings are presented in Table II. From Fig. 3(a) and Fig. 4(a), we can see that C-GT and EF-C-GT both outperform LEAD for both compressors. In addition, EF-C-GT performs slightly better than C-GT for Top-1 compression and similarly to C-GT for Random-1 compression. For a directed ring network where the corresponding parameter settings are given in Table  III, we can see from Fig. 3(b) and Fig. 4(b) that EF-C-GT greatly outperforms C-GT for both compressors, which demonstrates the advantages of using error feedback under biased compression operators.
To understand why C-GT performs as well as EF-C-GT for undirected networks, recall the discussion in Section III-A where we mention that each agent transmits its total compression error to its neighboring agents and compensate the error locally. Due to the symmetry of undirected networks, such a "network-based" error compensation is efficient, and thus an additional error feedback term may not be necessary. On the contrary, for a highly asymmetric directed graph, the error feedback term results in a significant improvement on the algorithm performance. It is worth noting that Random-1 compression method amounts to randomized coordinate descent (RCD), and therefore RCD is compatible with the proposed methods. Meanwhile, Top-1 compression method results in a greedy coordinate algorithm which is also suitable under our setting.

C. Biased, Non-Contractive Compression Operators
We now consider the biased norm-sign compression operator (11) and the rescaled norm-sign compressor (12) with q = ∞. Note that the norm-sign compression operator only satisfies Assumption 5, while its rescaled version satisfies Assumption 4 by taking r = p. To make EF-C-GT suitable for the compressors that are not contractive, we make a slight modification on the compression updates: after multiplying Similarly, we modify the gradient tracker compression update by multiplying  E k y with β y . The constants β x and β y both belong to (0, 1] and should be adjusted according to the constant C in Assumption 5. Indeed, the linear convergence of the modified algorithm can be demonstrated similarly to EF-C-GT.
We let β x = 0.01, β y = 0.01, and the other parameters are provided in Table IV. In the simulation, we use "N" and "R" to represent the norm-sign compressor and its rescaled version, respectively.
From Fig. 5, we find that C-GT and EF-C-GT both work well for the norm-sign compression operator, and EF-C-GT outperforms C-GT. In comparison, using the rescaled compressor leads to slower convergence. These results suggest that rescaling the compression operator to satisfy the typical contractive requirement (i.e., Assumption 4) may harm the algorithm performance, and considering Assumption 5 provides us with more freedom in choosing the best compression method.

VII. CONCLUSIONS
In this paper, we consider the problem of decentralized optimization with communication compression over a multiagent network. Specifically, we first propose a compressed gradient tracking algorithm, termed C-GT, and show the algorithm converges linearly for strongly convex and smooth objective functions. C-GT not only inherits the advantages of gradient tracking-based methods, but also works with a wide class of compression operators. To further improve the algorithm efficiency for biased compression methods, we present an error feedback based compressed gradient tracking algorithm (EF-C-GT) and also show its linear convergent property. Simulation examples demonstrate the effectiveness of C-GT for undirected networks and balanced directed networks, and EF-C-GT outperforms C-GT for some biased compressors such as Top-1 and Random-1. Future work will consider decentralized compression algorithms that work on more general network topologies. We will also consider equipping C-GT with accelerated techniques such as Nesterov's acceleration and momentum methods. Finally, non-convex objective functions are also of future interest.

A. Vector and Matrix Inequalities
The following results are often invoked. Lemma 6: Suppose U, V ∈ R n×p , then we have the following inequality: for any constant τ > 0. In particular, taking τ = δ Lemma 7: For any U, V ∈ R n×p , the following inequality is satisfied: where τ > 1. In addition, for any U 1 , and U 1 + U 2 + U 3 2 ≤ 3 U 1 2 + 3 U 2 2 + 3 U 3 2 . Lemma 8: For any U, V ∈ R n×p and α ∈ [0, 1], we have as the conditional expectation with respect to the compression operator given F k .

B. Proof of Theorem 1
In terms of Lemma 3 , we consider the following linear system of inequalities: where := [ 1 , 2 , L 2 3 , 4 , L 2 5 ] , the first three columns of A is given by and the last two columns of A is given by (59):

4) Fourth inequality in
It is equivalent that (59):

A. Proof of Lemma 5
Before deriving the linear system of inequality, we bound E X k − X k 2 F k and E Y k − Y k 2 F k , respectively. From Lines 4 and 5 in Algorithm 3, we know Taking τ 1 = 1 1−δ/2 , we obtain Similarly, we have 1) Optimality error: Based on the same computation 2) Consensus error: Based on the same derivation as (44), we have Thus, recalling (76), we get 3) Gradient tracker error: Following from the same computation as (45), we get As before, we obtain Based on the same derivation as (46)-(48), we have where the last inequality is from (76). Therefore, we get Plugging (81) and (77) into (78), we conclude that

APPENDIX D SUPPLEMENTARY ALGORITHMS A. Agents' view
In the main parts, for the sake of simplicity, we only describe the algorithms in matrix forms. Here we further present C-GT and EF-C-GT algorithms from each agent's perspective.
Algorithm 5 C-GT algorithm from agents' view Input: stopping time K, step-size η, consensus step-size γ, scaling parameters α x , α y , and initial values Send q k xi , q k yi , q k xi , and q k yi to agent l ∈ N out i and receive q k xj , q k yj , q k xj , and q k yj from agent j ∈ N in i Communication 17: x k i,w = h k xi,w + j∈N in i ∪{i} w ij q k xj 18: y k i,w = h k yi,w + j∈N in i ∪{i} w ij q k yj 19: h k+1 xi,w = h k xi,w + α x j∈N in i ∪{i} w ij q k xj 20: h k+1 yi,w = h k yi,w + α y j∈N in i ∪{i} w ij q k yj 21: