A simplified convergence theory for Byzantine resilient stochastic gradient descent

In distributed learning, a central server trains a model according to updates provided by nodes holding local data samples. In the presence of one or more malicious servers sending incorrect information (a Byzantine adversary), standard algorithms for model training such as stochastic gradient descent (SGD) fail to converge. In this paper, we present a simplified convergence theory for the generic Byzantine Resilient SGD method originally proposed by Blanchard et al. [NeurIPS 2017]. Compared to the existing analysis, we shown convergence to a stationary point in expectation under standard assumptions on the (possibly nonconvex) objective function and flexible assumptions on the stochastic gradients.


Introduction
Distributed learning is a branch of machine learning in which large datasets are stored on several client devices, and a central server controls the learning process [32,15]. In this setting, the local data is never shared or transferred across clients, or shared with the central server, and the actual optimization can be performed either by the central server or separately on each client [9,6]. In the first case, learning is performed by the central server requesting (stochastic) gradient information from each client and aggregating this information within its optimization procedure [23,26].
From a mathematical optimization perspective, distributed learning presents several challenges including communication overheads [20,10] and asynchronicity [23]. This may be further complicated in the case of federated learning, where the client datasets are assumed to be heterogeneous [27]. Here, we consider the case of distributed learning in the presence of adversarial attacks; the reliance on many clients and network communication makes this a more relevant concern than in traditional learning. Specifically, we consider the case of Byzantine adversaries, which alter the gradient information sent from (potentially multiple) clients to the server, for example through data poisoning [2,7], where individual clients' datasets are altered, or model update poisoning [1], where the information sent to the central server is directly corrupted. The goal of adversarial attacks can be untargeted, attempting to decrease the model's accuracy overall [2], or targeted, aiming to change the model's behavior on only a specific subset of inputs [7,1]. A comprehensive taxonomy of adversarial attacks is given in [16,Section 5].
In the presence of adversarial attacks, standard optimization methods such as stochastic gradient descent (SGD) can fail [3], and so the development of optimization algorithms which are Byzantine resilient is of critical importance to the success of distributed learning. Where the individual clients' datasets are i.i.d., algorithms such as SGD can be made Byzantine resilient using robust averaging techniques. In this paper, we introduce a simplified convergence analysis of Byzantine Resilient SGD (BRSGD) [3] for nonconvex learning problems. Our analysis covers the same range of approaches for aggregating gradient information from i.i.d. client datasets, but uses more standard smoothness assumptions on the objective function and the non-corrupted stochastic gradients. As a trade-off, we show convergence to first-order optimality in expectation, rather than almost surely as in [3]. We also show a convergence rate of O(1/K (1−p)/2 ) when the learning rate sequence is chosen as O(k −p ) for p ∈ (1/2, 1), which was not provided in [3].
Related Work A general analysis of SGD in a federated learning setting (which generalizes distributed learning) for nonconvex objectives can be found in [19]. In the case of Byzantine resilient SGD for i.i.d. client datasets, several different robust averaging techniques have been proposed. These include the geometric median [8,28,22], coordinate-wise medians and trimmed means [30,31], neighborhood-based averaging [3], iterative filtering [25,31] and combinations of multiple such approaches [14].
The types of convergence theory of these different methods varies. Convergence only for strongly convex objectives is considered in [8,22,30], and in [25] for the full population loss (rather than the empirical loss). For nonconvex objectives, high probability convergence to approximate first-and second-order optimal points is given in [30] and [31] respectively. Alternatively, [3] considers BRSGD applied to nonconvex objectives with decreasing learning rates, which proves almost sure convergence to stationary points rather than to a neighborhood, similar to the standard SGD setting [5]. This analysis applies to a generic robust aggregator satisfying specific assumptions. Different aggregations methods which satisfy this assumption have been proposed in [3,28,14].
We conclude by noting that [29,11,12] consider the case of Byzantine resilient federated learning in the case of non-i.i.d. datasets and have developed specific robust aggregators suited for this setting (with associated convergence theory). This is a more difficult problem, made visible for instance by [11] requiring a maximum of 25% of clients be corrupted (rather than approximately 50% for the i.i.d. case).
Contributions In this paper, we present a simplified convergence result for BRSGD. Unlike the original analysis in [3] (which was based on the analysis of SGD in [4]), we use standard smoothness assumptions on the objective, closer in spirit to the standard analysis of SGD [5]. For the (non-corrupted) stochastic gradient estimates, we use a general expected smoothness assumption based on [17]. Under these conditions, we prove the convergence of BRSGD to stationary points in expectation: we note that this is weaker than the almost-sure result from [3], coming from our more standard problem assumptions and simpler analysis. Our result and proof technique has some similarities to [24,Lemma 4.3], which shows almost sure convergence of standard SGD under general expected smoothness conditions on stochastic gradients.
As described above, since BRSGD in [3] is a generic framework, our results are applicable to any aggregation function satisfying the same assumptions, including all those in [3,28,14].
Structure We begin by describing the general model of distributed learning with Byzantine adversaries and describe the BRSGD method in Section 2. Our new convergence analysis is given in Section 3. The corresponding convergence rates are shown in Section 4, and we conclude in Section 5.
Notation We use · to be the Euclidean norm and ·, · to be the corresponding inner product on R d , and let [m] := {1, . . . , m}.

Problem & Byzantine Resilient SGD
We begin by describing the Byzantine adversarial model problem and the Byzantine resilient SGD algorithm from [3].

Byzantine Adversarial Model Problem
In the distributed learning problem, our data is split across m nodes (or devices) while our model is centralized [16]. For our i th node, {((x i ) j , (y i ) j )} n i j=1 is our dataset, where n i is the number of data-points stored at that node. We will assume that each element is drawn from a distribution that is common across all nodes, ((x i ) j , (y i ) j ) ∼ Ω. As this distribution is theoretical, we replace our unknown distribution with the known empirical distribution Ω ′ i , where we select each element of our dataset with probability 1 n i . Having been given this dataset, from a space of functions parametrized by w ∈ R d , we wish to find a model function f for which f (x) ≈ y, for all i ∈ [m] and (x, y) ∼ Ω ′ i . Hence, we look for a function f (·; w) that minimizes the average empirical risk across our m nodes: where F i is the empirical risk of the model f (w) on the i th node. Specifically: where l(·, ·) is a loss function that quantifies the difference between the two values. Note that both the loss function and the model remain constant across the nodes. We solve (2.1) with iterative methods converging to a neighbourhood of stationary points of our problems. These iterative methods involve each node sending an estimation of the gradient of our function at the current point to the central server. We model

10:
Set w k+1 = w k − α k A k . 11: end for component failure or corruption by setting some of our nodes to be Byzantine adversaries [3], who may send arbitrary values to the central server. In our problem, of our m nodes, q will be Byzantine adversaries as defined below. Typically we require q < m/2, however there are slight differences in some methods.
.., (g m ) k } be the set of correct local gradient estimators calculated by each node for the k th iteration of an algorithm. If q out of m vectors are Byzantine adversaries, then, the set of correct vectors at every iteration are partially replaced by vectors {(g 1 ) k , (g 2 ) k , ..., (g m ) k }, according to: 3) The indices of the adversaries may change across iterations and the value of the Byzantine gradient, (B i ) k , may be a function dependent on {(g 1 ) k , (g 2 ) k , ..., (g m ) k }, the current value of the model x k , the current step-size α k , or any previous information.
In the taxonomy of [16,Section 5], this framework corresponds to dynamic, white box adversaries using within-update collusion, where the adversary functions via data poisoning or model update poisoning.

BRSGD Algorithm
We now outline the Byzantine Resilient SGD (BRSGD) algorithm framework for solving (2.1) in the presence of Byzantine adversaries. In iteration k of this method, initially from [3], the central server initially collects the (possibly corrupted) local gradient information . It then aggregates these m vectors to produce a final gradient estimate A k ∈ R d using some aggregation function Agg ((g 1 ) k , (g 2 ) k , ..., (g m ) k ): we will later give specific requirements on Agg. We then take a gradient descent-type step with a pre-specified learning rate α k > 0. The full BRSGD method is given in Algorithm 1.
Since the non-corrupted local gradient estimators (g i ) k can be stochastic gradient estimates, we also outline our formal stochastic approach to this problem. We introduce a probability space, (P, F, Ω), with an associated filtration (F k ) k∈N to model our gradient information as an adapted process on a filtration, based on framework from [21]. Our sample space will be Ω ′ : Let the random variable modelling the sample drawn at the i th node for the k th iterate be (ζ k ) i := (ζ k ) i (ω). The value of the gradient estimator at the i th node for the k th iterate will be modelled by is the random variable modelling gradient updates at point w and it will be denoted G(w). Since our datasets are assumed to be i.i.d., our function G(w k , ζ) is not dependent on the node, and hence, we have the following property: (2.4) Our filtration F k will be the σ-algebra generated by our previous random variables, i.e.
Therefore, both our aggregated and individual gradient estimates are an adapted processes on the filtration. Furthermore, the iterates of our algorithm will have the property: where we denote the conditional expectation of a random variable with respect to filtra- Choice of aggregation function For our convergence theory to hold, the aggregation function Agg must satisfy the following assumption, from [3].
Then there exists a constant α ∈ [0, π/2) such that for all k we have bounded above by a linear combination of terms: Remark 2.3. We note that while [3] requires r = 2, 3, 4 in Assumption 2.2(2), our convergence result only requires r = 2. In fact, our result requires only the weaker condition for some constant E. The r = 2 case of Assumption 2.2(2) implies (2.6) via for some constants C 1 , C 2 > 0, and where the second inequality follows from Jensen's inequality for conditional expectations.
The works [3,28,14] give multiple examples of suitable functions Agg which satisfy Assumption 2.2, where α typically depends on Agg, the number of clients m and the number of corrupted clients q. Specifically, the Agg functions proposed in these works are: • Krum [3], which returns the vector from {(g 1 ) k , (g 2 ) k , ..., (g m ) k } that minimizes the distance between it and its nearest neighbours; • Variations of the median [28], including the marginal (i.e. component-wise) median, the geometric median 8) and the 'mean-around-median', the (component-wise) mean of the closest vectors to the marginal median; • Bulyan [14], which uses an existing Agg function such as the above to generate a set of candidate gradients and then apply a 'mean-around-median' aggregation to that set.

Existing Convergence Theory
We now describe the underlying assumptions and state the existing convergence theory for BRSGD (Algorithm 1) as given in [3]. The requirements on the objective (2.1) are based on [4]: Assumption 2.4. The objective function F is C 3 , bounded below, and there exist constants D, ǫ ≥ 0 and 0 ≤ β < π 2 such that, for all w ∈ R d with w 2 ≥ D, we have: Next, the requirement on the stochastic gradients is given by the following two assumptions.
Assumption 2.5. The gradient estimator G(w k ) is unbiased, E k [G(w k )] = ∇F (w k ), and for all r ∈ {2, 3, 4}, there exist non-negative constants A r and B r such that We are now able to state the main convergence result from [3].
We conclude by noting the technical condition α + β < π/2, relating the errors in the stochastic gradients and the smoothness of the objective. We do not need a condition like this in our analysis. The condition (2.10) relates the stochastic gradients to the aggregation function (provided there are not too many adversaries). Although our analysis does not require this condition explicitly, similar conditions are required to show Byzantine resilience of aggregation functions such as Krum [3] and the geometric median [28].

New Convergence Analysis
We now present our new, simplified analysis of BRSGD. Compared to Theorem 2.6, our result uses simpler and more common assumptions on both the objective F and the stochastic gradient estimator G(w, ζ), and a specific choice of learning rate. As a result, the conclusion is weaker: we get convergence to stationary points in expectation rather than almost surely. It has the same requirements on the aggregating function Agg.
In particular, we note that Theorem 2.6 requires (2.9), essentially that F is 'convex enough' outside a certain bounded region. This has the downside, for example, of excluding the model space of neural networks with soft-max activation functions, a common model space in distributed learning, as the activation function has flat asymptotes [4].
Assumption 3.1. Each term in the objective function F is L-smooth (i.e. continuously differentiable and ∇F is L-Lipschitz continuous) and bounded below by some F low .
The key implication of F being L-smooth is the upper bound For our stochastic gradient estimator, we will use a version of the expected smoothness property [17, Assumption 3.2]. We note that [17] shows that variants of SGD including minibatching, importance sampling, gradient combinations and their combinations all satisfy expected smoothness.

Assumption 3.2. (Expected Smoothness) For all k, the gradient estimator
and there exist non-negative constants A, B and C (independent of k), such that: To prove our result, we will need the below technical lemma, a generalization of [17, Lemma 2], which corresponds to the case of a k ≡ a for all k.
Then, the following identity holds: where W k is a weighting sequence defined by Proof. We begin by taking (3.4) and multiplying through by W k a k for each k ∈ N. Thus, Noting that a k−1 , we simplify: This allows us to apply a telescoping sum, adding together the first K − 1 inequalities we receive: LetŴ := K−1 k=0 W k and divide through byŴ on both sides: To simplify further we note: where (3.11) holds as W K−1 , d K , a K−1 andŴ are all strictly positive. Hence: We now note a k a k−1 ≤ 1, as {a k } n∈N∪{−1} is non-increasing. Furthermore, 1 1+LAa 2 k ≤ 1, as L, A and a k > 0. Therefore: 13) and hence, {W n } {n∈N∪{−1}} is non-increasing. Using this, we note: (3.14) Substituting (3.14) into the right-hand side of (3.12) and multiplying by 2 provides our result.
We are now ready to prove our main result. Proof. We begin by applying (3.1) and simplifying using w k+1 = w k − α k A k : We then take the expectation of the above conditioned on F k from our filtration: To simplify, we apply Assumption 2.2(2) to get: and, for some constant E, Applying (3.19) and (3.20), and subtracting F low from both sides, we simplify: We now simplify using Assumption 3.2: Collecting terms, we will rewrite the upper bound for our final term as: Thus: Taking total expectations and using the Tower Property: Defining δ k := E[F (w k ) − F low ], (3.29) becomes: Furthermore, our requirements on the learning rate sequence α k guarantee 1−sin(α)− , hence we define r k := (1 − sin(α))E[ ∇F (w k ) 2 ] and get: We now apply Lemma 3.3 to our problem using the following weighting sequence: (3.32) Hence: where we define α −1 := α 0 for convenience. We now show lim K→∞ (min 0≤k≤K−1 r k ) = 0. To do this, we return to the definition of our weighting sequence and note: (3.34) Hence: In order to simplify, we will bound a pair of infinite products and a summation. We will first show that: Recall that, for a sequence of positive real numbers {y n } n∈N , ∞ k=0 y k converges if and only if ∞ k=0 log(y k ) converges. From our specification of α k , we know ∞ k=0 α 2 k < ∞. Let: When y > −1, log(1 + y) ≤ y. Hence: Secondly, we note that, for all k ∈ N: as, LA ′ α 2 k > 0, for all k ∈ N. This, in turn, allows us to simplify another summation. Specifically: (3.40) We now take the limit of both sides of (3.35) and recover our result.
where P is defined after (3.38). By assumption, we have lim K→∞ We note that our assumptions on the learning rate sequence {α k } k allows for sequences which decay as α k ∼ k −p for any p ∈ (1/2, 1).

Iterate Selection
The above result shows that there is a subsequence of iterates which converges in expectation. We now give a probabilistic procedure to select a single iterate with small expected gradient, inspired by the analysis in [18,Theorem 6.1]. In this context, we assume that Algorithm 1 has been run for K iterations, and we randomly select an iterate from w 0 , . . . , w K−1 according to a specific probability distribution depending on the learning rate sequence α k . However, compared to [18, Corollary 6.1], the choice of α k does not depend on K, and so Algorithm 1 can always be continued from its previous endpoint if the desired accuracy is not achieved.
This analysis requires only small modifications of Lemma 3.3 and Theorem 3.4, which we present here. Lemma 3.5. Suppose the assumptions of Lemma 3.3 hold, including the definition of the weighting sequence W k . If, for any K ≥ 0, we define the random variable R K ∈ {0, . . . , K − 1} by (3.47) Proof. The proof of this result is identical to that of Lemma 3.3, except that (3.11) is replaced by from which we conclude in place of (3.12).
Corollary 3.6. Suppose that the assumptions of Theorem 3.4 hold, and for any K ≥ 0 we define the random variable R K as per (3.46). Then Proof. The proof of this result is identical to that of Theorem 3.4, but we replace (3.33) with which follows from Lemma 3.5. Hence instead of (3.44) we reach from which the result follows by Jensen's inequality,

Convergence Rate
The previous section gives a convergence analysis for Algorithm 1 under general assumptions on the decreasing learning rate sequence α k . We now specialize these results to give a convergence rate for the case α k ∼ k −p for p ∈ (1/2, 1).

Conclusion
Having algorithms for distributed learning in the presence of Byzantine adversaries is an important part of improving the utility of distributed learning. In this work we presented a simplified analysis of Byzantine resilient SGD (BRSGD) as developed in [3], proving convergence in expectation and corresponding convergence rates under more realistic assumptions on the objective function and the (non-corrupted) stochastic gradient estimators. Since BRSGD is a generic algorithm, allowing the use of any aggregation function satisfying Assumption 2.2, our analysis applies to all the specific choices given in [3,28,14]. Directions for future work include extending our analysis to more flexible learning rate regimes and other Byzantine resilient learning algorithms not based on the framework from [3], such as those in [30,31].