Distributed Stochastic Subgradient Projection Algorithms Based on Weight-Balancing over Time-Varying Directed Graphs

We consider a distributed constrained optimization problem over graphs, where cost function of each agent is private.Moreover, we assume that the graphs are time-varying and directed. In order to address such problem, a fully decentralized stochastic subgradient projection algorithm is proposed over time-varying directed graphs. However, since the graphs are directed, the weight matrix may not be a doubly stochastic matrix. Therefore, we overcome this difficulty by using weight-balancing technique. By choosing appropriate step-sizes, we show that iterations of all agents asymptotically converge to some optimal solutions. Further, by our analysis, convergence rate of our proposed algorithm isO(lnΓ/Γ) under local strong convexity, where Γ is the number of iterations. In addition, under local convexity, we prove that our proposed algorithm can converge with rateO(lnΓ/√Γ). In addition, we verify the theoretical results through simulations.


Introduction
In this paper, we focus on distributed constrained optimization problems, which have arisen in many applications. For instance, large-scale machine learning [1][2][3][4], resource allocation [5,6], sensor networks [7][8][9][10], and multiagent systems [11]. To address such problems, the designs of distributed optimization algorithms are necessary. Moreover, the goal is to minimize the sum of cost functions of all agents over networks, where each agent only knows its own information and can receive the information from its neighbors.
Distributed optimization algorithms are originally introduced in seminal work [12] and have a vast literature devoted to them in recent years [13][14][15][16][17][18][19]. In this literature, the distributed (sub)gradient methods are used over networks. Moreover, the performance limitations and convergence rates of these algorithms are well understood. In addition, the distributed Newton methods and other descent methods are used to solve the distributed optimization problems [20,21], and their rates of convergence are also analyzed.
However, these methods cited above assumed that information exchange among agents takes place over either fixed or undirected graphs. Nevertheless, in some communication networks such as mobile sensor networks, the communication between agents is unidirectional because different agents have different interference and noise patterns, the information are broadcasted at different power level, and the communication links between agents are directed in these networks [22][23][24]. Hence, the directed network topology is a natural assumption. In addition, the time-varying communication network topology is also a valid assumption in wireless networks, where each agent can move or the communication links may be destroyed randomly. For these reasons, we assume that these network topologies can be modeled as time-varying directed graphs. In this paper, we assume that each out-degree is known to each agent at each round, where the assumption cannot be removed [25]. To obtain knowledge of the out-degree, we can bidirectionally exchange "Hello" messages in a single communication process.
In fact, the recent works [22,23] provide subgradientpush distributed algorithms to minimize the cost function over time-varying directed network in discrete-time. Similarly, [24] proposes a distributed subgradient algorithm, which used weight-balancing in discrete-time. In addition, 2 Complexity [26] considers a continuous-time optimization algorithm over underlying time-varying directed networks. Moreover, the distributed algorithms in [22,24] converge with rate (ln Γ/ √ Γ), whereas in [23], their algorithm converges with rate (ln Γ/Γ) by assuming that the local cost functions are differentiable and strongly convex. Nevertheless, these works consider the unconstrained optimization problem. In [27], the authors propose the D-DPS algorithm over a directed network with convergence rate (ln Γ/ √ Γ). However, we assume that the graph is time-varying and directed in this paper. To overcome the asymmetry caused by directed graphs, we employ the weight-balancing technique in the paper. Hence, a distributed stochastic subgradient projection algorithm is proposed, based on weight-balancing technique. Assume that each local cost function is strongly convex; even if all agents have access to their own noisy subgradient, our proposed algorithm is asymptotically convergent with rate (ln Γ/Γ). Besides, our proposed algorithm also asymptotically converges with rate (ln Γ/ √ Γ) for generally convex cost functions. Compared with the best previous convergence rate of (1/Γ), which is achieved in centralized way, where the local functions are strongly convex and differentiable, the variance of noisy subgradient is bounded [28,29]. Thus, our convergence results are quite close to it. However, we need not assume that local cost functions are differentiable. In addition, we assume that local cost functions do not establish Lipschitz continuous gradients.
Our goal is to design a distributed optimization algorithm and analyze the properties of the proposed algorithm, based on weight-balancing over time-varying directed networks. This work has the following contributions: (i) We propose a distributed stochastic subgradient projection algorithm based on weight-balancing over time-varying directed networks. Furthermore, each local cost function is private information for other agents. Hence, each agent only utilizes its own private information. Moreover, noisy subgradient s of local cost function is known to each agent for = 1, . . . , . In addition, the algorithm is implemented without any centralized control, and every agent need not to know network topology and only requires to know its out-degree at each round.
(ii) By some standard assumptions, we show that our proposed algorithm asymptotically converges to some optimal solutions.
(iii) For strongly convex cost functions, we prove that the convergence rate of (ln Γ/Γ) is achieved. In addition, we also show that the convergence rate is (ln Γ/ √ Γ) for locally convex cost functions.
Organization. In Section 2, the constrained optimization problem is described, we also give some assumptions, and then a distributed stochastic subgradient projection algorithm is proposed over time-varying directed networks. We state the main results of this paper in Section 3. We provide the proofs of the main results in Section 4. In Section 5, simulations are also presented. The conclusion of the paper is provided in Section 6.
Notation. We use lowercase boldface to denote the vectors in R and use lowercase normal font to denote scalars or vectors, which are not -dimensional vectors. For instance, u ( ) denotes a vector in R at agent at round , while the notation ( ) is a scalar in R. The vector such as ( ) in R is obtained by stacking all scalars ( ) for = 1, . . . , . In addition, we use the natation "T" to denote transpose operation. ‖u‖ is the Euclidean norm of u. Besides, the notation ‖u‖ 1 is 1-norm of u. The notation 1 denotes a -dimensional vector whose all elements are 1, and is identity matrix, whose size is × . E[⋅] denotes the expectation operator. Besides, the notation Π U [u] means that a vector u is projected onto the constraint set U. The notation ⊗ denotes Kronecker product operator.

Problem Setup, Algorithm Description, and Assumptions
We use a graph to model a network, which consists of agents. Moreover, each agent represents a node. Further, we also consider the case that network topology is time-varying and directed. Hence, we use the notation G( ) = (V, E( )) to denote network topology at each round , where V denotes the agent set and E( ) denotes the directed edge set. ( , ) ∈ E( ) means that agent can send message to agent at round . If two agents can directly exchange information, then we say the agents are neighboring. Furthermore, we use N out ( ) and N in ( ) to represent the set of out-neighbors and the set of in-neighbors of agent at round , respectively. Formally, N out ( ) = { | ( , ) ∈ E( ), ∈ V} and N in ( ) = { | ( , ) ∈ E( ), ∈ V}, respectively. Besides, we denote the out-degree and in-degree by out ( ) = |N out ( )| and in ( ) = |N in ( )|, respectively.
In this paper, the constrained optimization problem is described as follows: where : R → R denotes local cost function of agent and U ⊆ R denotes constraint set.
Our goal is to solve problem (1) by cooperative and fully decentralized way over time-varying and directed networks. Further, each local cost function can be only known to each agent and all agents know constraint set U. Moreover, each agent can share its own iteration with its outneighbors.
In this paper, we assume that the network topology is time-varying and directed. Note that the directed graph may cause the asymmetry. Thus, to overcome the asymmetry, we employ the weight-balancing technique in the paper. Following from [24], we give the definition of balancing weights over time-varying directed networks as follows.
Definition 1 (balancing weights). The weight , ( ) of agent , ∈ V, balances a time-varying directed network at round if, for any agent , the agent weight , ( ) satisfies , ( ) where ( ) is the index of relationships of neighbors at time .
From Definition 1, the total weight incoming from agent (which is ∑ ∈N in ( ) , ( ) ) is equal to the total outgoing weight of agent (which is , ( ) out ( )) at round over a timevarying directed network.
In order to solve problem (1) over a time-varying directed graph G( ), we first give some standard assumptions as follows.
Assumption 2. Assume that the time-varying directed network sequence G( ) is strongly connected at each round .
Assumption 2 ensures that each agent can receive information from the other agents at round in network G( ).
Assumption 3. Let constraint set U ⊆ R be closed and convex. In addition, assuming that local cost function (u) is convex with ≥ 0, for all ∈ V, i.e., for u, k ∈ U, each local cost function satisfies where ∇ (k) denotes a (sub)gradient of (u) at u = k. If > 0, (k) is -strongly convex. Otherwise, (k) is convex.
Assumption 3 ensures that the set of subgradients is nonempty. Besides, the assumption on the subgradient is provided as follows. The similar assumption can be found in [13,24,27].
Next, we describe our proposed optimization algorithm which is executed over a time-varying directed network. Assume that u ( ) ∈ U is the iteration of agent at round . Moreover, the iteration u ( ) is updated as follows: for all ∈ V, where ( ) denotes a step-size sequence and we use s ( ) to abbreviate the notation s (u ( )), which represents a noisy subgradient of (u) at u = u ( ). Following from (4)-(5), each agent first linearly fuses from its own estimate and the estimates of in-coming neighbor agents and updates the estimate in opposite direction of its own noisy subgradient. Finally, a new estimate of agent is obtained by projecting the updated estimate onto the constraint set U. Moreover, the above update equations can be executed by simple broadcast communication.
From (5), we need to make some assumptions about the noisy subgradient. Specifically, we describe the noisy subgradient s (u ( )) as follows: where ( ) ∈ U denotes a stochastic subgradient error and ∇ (u ( )) denotes a subgradient of (u) at u = u ( ). Let F denote all the information generated by the distributed stochastic subgradient projection algorithms (4)-(5) for all ≥ 0. Hence, the assumption for stochastic subgradient error ( ) is as follows.
Assumption 5. For every , = 1, . . . , , we assume that the stochastic subgradient error In our proposed algorithm, ( ) is weight of agent at round . As with [24], every agent updates its weight ( ) over time-varying directed networks as follows: In this paper, let the optimal set of the proposed algorithms be nonempty, which can be defined as follows: where * ≜ (u * ) ≜ min u∈U (u). First, we formally introduce the constrained optimization problem in this section, and then we also give some valid assumptions. Simultaneously, a distributed stochastic subgradient projection algorithm is proposed to solve constrained optimization problem (1) over time-varying directed networks. The main results of the paper are presented in the next section.
By Theorem 6, the iterations asymptotically converge to some optimal solutions over time-varying directed networks. Namely, by our proposed algorithm, we can obtain some optimal solutions with probability 1.
We now state convergence rate of our proposed algorithm. Under different assumptions about local cost functions , we establish the different convergence rate. To this end, we first introduce the weighted average of the estimate sequence {u ( )} ≥0 , which is defined as [23] for all ≥ 2. Hence, for all ≥ 1, we have the following recursive relation: for all ∈ V, where ( ) = ( − 1)/2, for all ≥ 2, and u (1) = u (0).
From Theorem 7, we establish the convergence rate when local cost functions are strongly convex. Moreover, (û (Γ)) converges to (u * ) for any agent with probability 1. Further, following from (12), the numerator has an ln Γ or a constant and the denominator has a Γ. Therefore, our algorithm converges to some optimal solutions with probability 1 at rate of (ln Γ/Γ).
Theorem 8 gives the convergence rate of our proposed algorithm for local convex functions, which is (ln Γ/ √ Γ). Further, by Theorems 7 and 8, we can see that the timevarying and directed network topology does not affect the convergence rate.
In this section, we show that our proposed algorithm is asymptotically convergent. For strongly convex functions, we derive a convergence rate (ln Γ/Γ) for our algorithm over time-varying directed networks. Furthermore, we also present that our algorithm converges with rate (ln Γ/ √ Γ) under general convexity. In the next section, we will give the detailed proofs of main results.

Analysis of Convergence Results
We give relevant proof for our main results, namely, Theorems 6, 7, and 8. For this purpose, a basic iterate relation is established. Then, we use the basic relation to prove our main results.
For the sake of analysis, we first describe the scalar version of (4)-(5), where the variables ( ) and V ( ) are scalar variables, for all ∈ V. Thus, the estimate ( ) of agent is updated by for ≥ 0 and all ∈ V, where ( ) = ∇ ( ( )) + ( ) and U ⊆ R denotes a constraint set.

Lemma 9.
We assume that Assumption 2 holds. Then, for any ≥ 0, the matrix ( ) is column stochastic. Moreover, for all , ∈ V, there exist some positive constants and ∈ (0, 1), and the matrix Δ( : ) satisfies for any and with ≥ .
According to [24] (the proofs of Lemma 3 and Proposition 1 in [24]), we immediately obtain this lemma.
In addition, we also give the properties of the projection operator Π U [⋅]. Following from [14], we have the following.

Lemma 10.
Assume that constraint set U is closed and convex in R . Furthermore, we assume that U is nonempty. Thus, we conclude that for any u ∈ R , Besides, we present some auxiliary results as follows, which are used to prove the relative conclusions.
Lemma 11 (see [30]). We assume that is a scalar for all positive integer .
In order to obtain a convergence result of algorithms (16)-(17), we first establish the following lemma.
Informally, the distributed stochastic subgradient algorithm based on weight-balancing ensures that all ( ) track the running average 1 T ( )/ with a geometric rate .

Corollary 13.
For every agent , ∈ V, the vector variables u ( ) and s ( ) in R are given in (4)- (5); then, we have For any vector, 1-norm is greater than or equal to standard Euclidean norm. Therefore, from Lemma 12(a), which is applied to each coordinate of R , Corollary 13 holds immediately.
In Lemma 12 and Corollary 13, the perturbation term is a crucial role. Therefore, we bound the perturbation term in the following lemma.

Lemma 14. Under Assumptions 2-5, the perturbations vectors
Proof. Following from Lemma 10(4), we have Furthermore, by Assumptions 4 and 5, we have for all ∈ V. Hence, we obtain with probability 1 According to the definition of perturbations vector r ( ), we have Therefore, this lemma is proved completely.
Since the matrix ( ) is column stochastic, we have, for all ℓ ∈ D, We follow from the definition of [ ℓ ( )] ; then, We also establish the following lemma, which is crucial in our proofs.
Proof. By (48), we have By taking expectation on the both sides of (50) with respect to F , furthermore, we follow from (6) and the fact 8 Complexity and we have that Next, we bound the term E[‖ ∑ =1 r ( )‖ 2 ] in (52) as follows: where we use the inequality (∑ =1 ) 2 ≤ ∑ =1 2 to obtain the above relation. In order to obtain the upper bound of E[‖ ∑ =1 r ( )‖ 2 ], we need to bound the term E[‖r ( )‖ 2 ] in (53). Following from the similar arguments as Lemma 14, we have for all ∈ V. Hence, combining (53) and (54), we obtain Substituting (55) into (52), we have We next estimate the term ∇ (u ( )) T (u( ) − w) in (56). By (3) in Assumption 3, we have that Further, we also obtain where we use Cauchy-Schwarz inequality to obtain the last inequality. Moreover, the term (u ( )) − (w) in (57) can be rewritten as (60) Hence, combining (57), (58), and (60), we obtain Substituting (61) into (56), we have Following from the definition of the global cost function (u), and then using (62), the lemma is completely proved.
To prove Theorem 6, we employ the following lemma, which is given by [31].

Lemma 16.
We assume that {V }, { }, { }, and { } are random sequences. Moreover, we further suppose that elements of these sequences are nonnegative random variables and with probability 1 satisfy the following relation: for all ∈ S * and ≥ 0 with probability 1, where ≥ 0 for all ≥ 0. Further, we assume that ∑ ∞ =0 < ∞ and ∑ ∞ =0 = ∞ hold. Then, the random sequence { } converges to some optimal solutions * ∈ S * with probability 1.
We now start to prove Theorem 6 by using the conclusions of Lemmas 15 and 17.
Proof of Theorem 6. Since ( ) → 0 as → ∞, we obtain which follows from Lemma 10(2). Therefore, using (70), we have We also have where Assumptions 2 and 3 are used. Thus, following from the decay conditions of ( ), we have for ∈ V. Following from Lemma 10 (3), we obtain Hence, we obtain Let w = u * for u * ∈ U * in Lemma 15, and we have According to the conclusion of Lemma 10(1), we obtain Hence, we have with probability one 10 Complexity By using (75), we have Since ∑ ∞ =1 ( ) = ∞ and ∑ ∞ =1 2 ( ) < ∞ hold by the assumption of ( ), the conditions of Lemma 17 hold. Following from Lemma 17, we can see that {u( )} asymptotically converges to u * . Further, following from (71), {u ( )} also asymptotically converges to u * for all ∈ V, i.e., with the notion that probability 1 holds. Therefore, Theorem 6 is proved completely. Now, we establish the following lemma, which is an important relation for the proof of the Theorem 7.
Next, we start to prove Theorem 7 by using the results of Lemmas 15 and 18.
Proof of Theorem 7. We assume that is a positive constant for all ∈ V. Hence, for each agent ∈ V, local cost function is a -strongly convex function. Moreover, the function (u) = ∑ =1 (u) has a unique global minimization solution u * . Hence, let w = u * in Lemma 15, and we have where we employ inequality ( ) T (u( ) − u * ) ≤ 0. Next, we first bound the term (u( )) − (u * ) in (86). By using the fact that the function is -strongly convex and ∇ (u * ) = 0, we have In addition, note that is the strongly convex function. Hence, by using the boundedness of subgradient ∇ (u), we also have that (u ( )) − (u * ) = (u ( )) − (u ( )) + (u ( )) where ≜ ∑ =1 . Therefore, following from (87) and (88), we obtain Therefore, substituting (89) into (86), we obtain with probability 1 Following from the expression of ( ) = /( + 1), where is given by (11), we have that Multiplying the both sides of (91) by ( + 1), we have By iterating the expression of (92), we obtain for all ∈ V and Γ ≥ 2. Due to ≤ Γ − 1, so the relations hold. Hence, following from the conclusion of Lemma 18, we obtain Dividing both sides of (96) by Γ(Γ − 1), then we have that where we use the relation ∑ Γ−1 =1 ( /( + 1)) ≤ Γ − 1 for Γ ≥ 2. Hence, by a few algebraic operators, we obtain Due to convexity of the function , using the definition of u (Γ) and convexity of norm function, we obtain Taking expectation in both sides of (99), then Therefore, substituting (98) into (100) and by a few algebraic operators, we derive the desired result.
Next, we give the proof of Theorem 8, where we assume that the functions are generally convex for all ∈ V.

Proof of Theorem 8. Let
= 0 for all = 1, . . . , in Assumption 3. Moreover, we introduce the following variable, which is the ergodic average of the estimates u ( ): According to (86), we have Combining (88) with (102), and after some elementary algebra operations, we obtain By summing the preceding inequality and dividing by Since ( ) = 1/ √ + 1, we have for all Γ ≥ 1. Moreover, we also have In addition, using the result of Corollary 13, we have where the inequality follows from Lemma 14. Since ( ) = 1/ √ + 1 ≤ 1, so we obtain Further, we also have Combining (107), (108), and (109), we obtain Thus, following from the preceding relation, we obtain Besides, by convexity of function , and then following from the definition ofũ (Γ) in (101), then we conclude that for all ∈ V and Γ ≥ 1. Therefore, combining (112) with (113), the conclusion of this theorem is obtained completely.
In this section, we prove our main results, which are presented in Section 3. If step-sizes ( ) are positive and satisfy decay conditions (26), our proposed algorithm is asymptotically convergent by appropriately choosing step-sizes ( ). Further, our algorithm converges with rate (ln Γ/Γ) under strong convexity and with rate (ln Γ/ √ Γ) under general convexity.

Simulations
In this section, we consider the multiclass classification problem in machine learning. In the problem, each agent ∈ V generates a decision matrix ( ) = [u T 1 ; . . . ; u T ] T ∈ R × through a data example e ( ) ∈ R that is one of the classes C = {1, . . . , }. The decision is used to predict the class label via arg max ℓ∈C u T ℓ e ( ). The loss function of each agent ∈ V is defined as where ( ) denotes the true class label. In this problem, the domain U is given by U = { ∈ R × | ‖ ‖ tr ≤ }, in which ‖ ‖ tr represents the nuclear norm of . In our simulations, we set = 10, the step-size is set as ( ) = 1/√ . Moreover, we use a multiclass dataset, which is called news20 (http://www.csie.ntu.edu.tw/∼cjlin/ libsvmtools/datasets/.). This dataset has 62, 061 features, 20 classes, and 15, 935 instances. The testing size of news20 is 3, 993. In addition, the time-varying graph at round is shown in Figure 1. We use the following quantity to measure the performance of the algorithms: ( (u ( )) − (u * )) .
In the first simulation experiment, we compare the performances of the algorithm with communication and without communication. As shown in Figure 2, we can see that the performance with communication is better than that without communication.
In the second simulation experiment, we compare the convergence performances with D-DPS [27]. As shown in  Figure 3, for the similar setting, these algorithms have similar convergence rates. However, our algorithm considers the noisy subgradient. Moreover, the cost functions may be nondifferentiable and we consider the case that the graph is timevarying. Therefore, our algorithm has better applicability to real environments.

Conclusion
We have proposed a fully decentralized stochastic subgradient projection algorithm to solve distributed constrained optimization problem over time-varying directed networks. We employ weight-balancing technique to overcome the influence of directed graphs and assume that every agent knows its own out-degree at each time epoch in our algorithm. Moreover, all agents have access to their own noisy subgradient. We have proved that our proposed algorithm is asymptotically convergent with suitable chosen step-sizes ( ) over time-varying directed networks. Furthermore, for local strongly convex functions, we have proved that our algorithm converges with the rate (ln Γ/Γ). Meanwhile, for generally convex functions, we have also showed that the convergence rate (ln Γ/ √ Γ) is achieved. Additionally, the performance of the proposed algorithm has also been evaluated by simulations.

Data Availability
No data were used to support this study.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.