ZO-JADE: Zeroth-order Curvature-Aware Multi-Agent Convex Optimization

In this work we address the problem of convex optimization in a multi-agent setting where the objective is to minimize the mean of local cost functions whose derivatives are not available (e.g. black-box models). Moreover agents can only communicate with local neighbors according to a connected network topology. Zeroth-order (ZO) optimization has recently gained increasing attention in federated learning and multi-agent scenarios exploiting finite-difference approximations of the gradient using from $2$ (directional gradient) to $2d$ (central difference full gradient) evaluations of the cost functions, where $d$ is the dimension of the problem. The contribution of this work is to extend ZO distributed optimization by estimating the curvature of the local cost functions via finite-difference approximations. In particular, we propose a novel algorithm named ZO-JADE, that by adding just one extra point, i.e. $2d+1$ in total, allows to simultaneously estimate the gradient and the diagonal of the local Hessian, which are then combined via average tracking consensus to obtain an approximated Jacobi descent. Guarantees of semi-global exponential stability are established via separation of time-scales. Extensive numerical experiments on real-world data confirm the efficiency and superiority of our algorithm with respect to several other distributed zeroth-order methods available in the literature based on only gradient estimates.


Problem Formulation
In this paper we consider the distributed optimization scenario, in which a set of agents N = {1, . . . , n} are connected in a network topology and can communicate with each others through a set E of links. Each agent i has access to a local cost function f i : R d → R, and the goal of the network is to cooperatively seek the optimal solution of the unconstrained minimization problem We consider the setting in which only zeroth-order information is available, namely it is only possible to evaluate the value of the objective functions at the desired points, without access to gradients or higher order derivatives. In particular, we are interested in consensus-based algorithms, in which all the agents converge to the same optimal value of the variable to be optimized by just communicating with their neighbors. To do so, each agent updates its local variables using a weighted average of its neighbors' information, where the weights are provided by a mixing matrix P . Throughout the rest of the paper we rely on the following simplifying assumptions:

Assumption 1 (Strong Convexity and Smoothness)
Let the global cost f be three times continuously differentiable with Lipschitz continuous derivatives and m-strongly convex, i.e. there exist positive constants L 1 , L 2 , L 3 and m such that Assumption 2 (Communication between agents) Let the communication network be represented by the time-invariant graph G = (N , E) whose vertices N and edges E are the set of agents and the available communication links, respectively. Assume that the graph is undirected and connected, and that each node can only communicate with its single-hop neighbors. Let the network be associated with the consensus matrix P ∈ R n×n , whose generic entry p ij is positive if edge (i, j) ∈ E and zero otherwise. Choose P to be symmetric and doubly stochastic, i.e. such that P 1 = 1 and 1 T P = 1 T . A matrix that satisfies these requirements can be constructed using e.g. the Metropolis-Hastings weights [12].

Zeroth-order Oracles
Our algorithm requires to evaluate the first and second derivative of the local cost functions at the current point. Since in the zeroth-order setting these information are not available, we estimate them using central finite-differencing schemes along orthogonal directions. Given a generic function f and a small positive scalar µ we adopt the following approximations of the gradient and of the Hessian diagonal: Obtaining these estimates involves 2d + 1 function queries, which can be computationally demanding for high dimensional problems. However, numerical simulations suggest that knowing the slope and curvature information along all the coordinates may significantly improve the convergence rate, resulting in a overall comparable or lower number of function evaluations with respect to randomized schemes. We notice that many algorithms, such as [13] and [14], require 2d function queries only for the gradient estimation. In this work instead, by querying the function also at the current iterate, which is in any case necessary to evaluate the progress, we estimate also the Hessian diagonal almost for free. Using Taylor expansions it is easily seen that in case of noiseless function evaluations the approximation errors of the estimators are bounded by and that the Jacobian of the gradient estimator takes the form Remark 1: The approximation errors of the estimators and the absolute value of the elements of the matrix R(µ, x) are bounded by functions of µ 2 . This is due to the symmetry of central difference-scheme, and shows that the estimates are very reliable.
Since for a quadratic function L 2 = L 3 = 0, we have the following result: Lemma 1. In case of quadratic functions the gradient estimator and the Hessian estimator are exact, i.e. they coincide with the true gradient and with the diagonal of the true Hessian matrix, respectively.
As shown below, both the estimators are Lipschitz continuous, and we will use this fact to prove the convergence of the algorithm.
Let x be the unique global minimizer of f (x). The following Lemma states that for any value of µ there exists a unique point Γ(µ) for which the gradient estimator is the zero vector. Moreover, the distance between Γ(µ) and the optimal solution depends on µ 2 and can be made arbitrarily small.
Lemma 2. Let f (x) be a function which satisfies Assumption 1 and∇f (µ, x) be defined by (3).
For reasons of space the proof of the Lemma, as well as the derivation of the bounds presented so far, can be found in Appendix B. To lighten the notation, from now on we will denote the gradient and Hessian estimators as∇f (x) and∇ 2 f (x), keeping the dependence from the parameter µ implicit.

Proposed algorithm
In this subsection we present ZO-JADE to efficiently solve the distributed convex optimization problem (1). Our method can be seen as the ZO counterpart of a special case of [15], with the additional difference that consensus is performed also directly on the variable to be optimized. In particular, we employ a Newton-type update which only requires the diagonal of the Hessian, sometimes referred to as Jacobi descent [16], from which the name ZO-JADE (JAcobi DEcentralized). This allows to benefit from the accelerated convergence rate which is typical of the second-order methods, while just estimating the curvature along the canonical basis. Also, since the Hessian estimate is a vector, the storage requirement and the communication overhead are linear in d, and computing the Hessian inverse is straightforward.
We now provide an intuitive explanation of the algorithm. The idea is to approximate each local function with a d-dimensional parabola, for suitable and match at the current iterate {x i } n i=1 the first and second derivative of the model with the estimates provided by the ZO oracles: We then want to move towards the minimum of the approximate global function, namely To do so, we define the local variables ∈ R d and we estimate the sums in (6) using gradient and Hessian tracking. As the name suggests, this technique introduces two additional consensus variables which track the average of the derivatives over the network: Indeed, using recursion it is easy to verify that by initializing Assuming now that the mean of {x i } n i=1 is almost constant, the local variables {y i } n i=1 and {z i } n i=1 will eventually reach consensus, becoming equal to the numerator and denominator of (6). The almost stationarity of the mean of {x i } n i=1 is satisfied by adopting the time-scale separation framework, under which a sub-system can be regarded at steady state, provided that it evolves sufficiently fast compared to the rest of the system. Equivalently, a fast system can consider a slow one as a fixed constraint. In our case, we need the consensus (8) to be part of the fast dynamics, while the dynamics   must be sufficiently slow. We will prove that there exists a positive scalar¯ such that for any ≤¯ the time-scales separation is verified and the algorithm converges exponentially fast. Moreover, property (9) guarantees that we are moving towards the minimum (6) of the global objective, allowing the convex combination coefficient to be constant.
Numerical simulations show that the consensus on the current iterates in the update rule of x i (t), which is not present in [15], brings significant performance improvements to the algorithm, at the price of an additional communication variable. However, the main concern in the zeroth-order setting is to minimize the number of function queries, which in our case is directly proportional to the number of iterations. Accordingly, the cost of the communication overhead becomes negligible when compared to the gain in terms of convergence rate.

Main result
The proof of convergence of our algorithm relies on the following Theorem 1, whose proof is given in Appendix A Theorem 1 (Semi-global exponential stability). Consider the system and assume that is well defined for any generic variables x ∈ R n , y ∈ R n and k ∈ N. Also assume that the following assumption are satisfied for all x ∈ R n , y ∈ R n , k ∈ N : (i) The map ϕ(y, x) satisfies all conditions of Theorem 6.3 in [17] and in particular there is The map φ is locally uniformly Lipschitz and φ (0, y * (0)) = 0.
(iii) There exists a twice differentiable function V (x) such that The map Ψ is locally uniformly Lipschitz and Ψ (x, y * (x)) = 0 for all x.
Then for each r > 0, there exist r , C r , γ r possibly function of r, such that for all y 0 − y * (0) 2 + x 0 2 < r 2 and ∈ (0, r ), we have Below we present the main result, whose proof is based on a reformulation of the algorithm which leads to a discrete-time system compatible with Theorem 1.
Theorem 2 (Convergence of ZO-JADE). Under Assumption 1, there exists a positive¯ such that for any ≤¯ ZO-JADE is guaranteed to converge semi-globally and exponentially fast to Γ(µ).
Proof. To simplify the notation let us stack the local variables and define which are the mean of x(t) and the displacement from it. This allows us to rewrite the algorithm as a dynamical system and to decouple the fast dynamics where we used the fact that Since x(t) =x(t) +x(t) and g(t), h(t), y(t), z(t) ultimately depend only on quantities at time t − 1, the system is autonomous and can be further rewritten as We are in the position to apply Theorem 1, which guarantees the exponential stability of the system for a suitable choice of the parameter . We now show that all the assumptions of such Theorem are verified. The fact that the gradient and Hessian estimators are differentiable and Lipschitz continuous guarantees that also the functions ϕ, φ and Ψ are continuously differentiable and globally uniformly Lipschitz. Considering the boundary-layer system ξ(k) = ϕ (ξ(k − 1),x), obtained by setting = 0, we note thatx vanishes to zero and all the other components of ξ reach consensus. In particular, the steady-state values of y, z are This verifies the assumption that there exists ξ such that ξ (x) = ϕ (ξ (x),x) for any choice of x. For the same reason, it is easy to see that Ψ (ξ (x),x) = 0 for allx. Noticing that and recalling Lemma 2, we have φ (Γ(µ), ξ (Γ(µ))) = 0. To satisfy assumption (iii) of Theorem 1 we assume without loss of generality that the target solution Γ(µ), which is arbitrarily close to the global minimum of f (x), coincides with the origin. This corresponds to solving an equivalent problem in which the objective function is translated by an offset.
Finally, it is sufficient to choose as Lyapunov function , which is always strictly positive except for V (0) = 0. Indeed, it is possible to deduce the following bounds for V (x), whose proofs are contained in Appendix B.
To ensure that the lower bound on V (x) is not trivial and that we are moving along a descent direction, when L 3 = 0 we require that µ ≤ min {µ 1 , µ 2 }, where Since all the required conditions are verified, the proof is complete.

Remark 2:
We recall that while the algorithm converges to Γ(µ), in virtue of Lemma 2 it is possible to make such point arbitrarily close to the actual global minimum of f (x) by reducing µ.

Numerical results
Let us present some numerical results, which validate the effectiveness of ZO-JADE and show its superiority with respect to several state-of-the-art ZO algorithms for multi-agent optimization.
In particular, we compare our method with the following distributed ZO algorithms: PSGF [18], which employs a push-sum strategy; the accelerated version of ZOOM, ZOOM-PB [19], which requires the Laplacian matrix of the network and uses a powerball function; ZONE-M [20], a primal-only version of the Method of Multipliers; the primal-dual method proposed in [21], dubbed here as ZOASDNO using the capital letters in the title of the paper, based on the Laplacian framework; the algorithm in [13] using a 2d-points gradient estimator and gradient tracking, denoted here with the acronym DZOANMO. For all these algorithms, we tune the possible parameters as suggested in the respective papers. In case of randomized gradient estimators which consider only a subset of coordinates, we set the number of search directions to d, which is also the value that provides the best results in our experiments. We average over multiple runs, starting from different randomly generated initial points x i (0), which are the same for all the algorithms. We consider a network of n = 10 agents satisfying Assumption 2, where the consensus matrix P is built using the Metropolis-Hastings weights, and we test the algorithms on two types of cost functions. In the first scenario we consider quadratic objectives, and specifically we perform ridge regression on the condition-based maintenance dataset [22] to predict the health state of a machine. In the second test we consider more general convex functions, addressing a binary classification problem via logistic regression. In particular we run one-vs-all classification, in which one target class needs to be distinguished from all the others, on the well-known MNIST dataset [23]. Adopting the approach followed by [24] on a similar dataset, we apply PCA to reduce the size of the problem to d = 20 and provide each agent with a balanced set D i = {s 1 , . . . , s |Di| } where half of the samples belong the target class, and the other half to the remaining classes, equally mixed. The objective functions are the regularized log-losses where l k ∈ {−1, 1} is the label corresponding to the predictor sample s k ∈ R d−1 and the quadratic term with w > 0 ensures strong convexity. To take into account the possible differences between the local solutions x i , we evaluate the performance of the algorithms using the loss function where x is the unique minimizer of f . The speed of convergence is measured in terms of number of function evaluations, which in the zeroth-order optimization field are assumed to be the most expensive computations. The numerical simulations clearly show that ZO-JADE outperforms the other algorithms in the considered scenarios, as it achieved an equivalent or lower loss value using a smaller amount of function evaluations. This suggests that our method is suitable both in the case where a very precise solution is desired and in the case where a limited amount of function queries are available.

Conclusion
We presented a novel zeroth-order algorithm for distributed convex optimization, which is the first of its kind to exploit the second order derivative. We provided an intuitive explanation of our method and we formally proved its semi-global stability and exponentially fast convergence. The theoretical results are validated by numerical simulations on publicly available real-world datasets, in which ZO-JADE is shown to outperform several zeroth-order distributed algorithms present in the literature. We foresee interesting possibilities for future work, such as extensions to online and stochastic cases.

A Separation of the time-scales
In this appendix we use Lyapunov theory to analyze the stability of a particular class of discretetime systems. In particular we provide a proof of Theorem 1, which extends Proposition 8.1 in [17] to include the case in which also the fast dynamics depends on the step-size parameter , allowing the application of the time-scales separation principle to a broader class of problems. We use the notation and terminology introduced by [17] and we refer to results contained therein. For an equivalent treatment of the continuous-time case we refer the reader to the chapter 'Singular Perturbations' in [25].
Proof. Assume that all the assumptions in the statement of Theorem 1 are verified. Let y (k) be defined as y (k) := y(k) − y * (x(k)) We can write that y (k + 1) = y(k + 1) − y * (x(k + 1)) = ϕ(y(k), x(k)) + Ψ(y(k), x(k)) − y * (x(k + 1)) = ϕ (y (k) + y * (x(k)), x(k)) + Ψ(y (k) + y * (x(k), x(k)) − y * (x(k) + φ (x(k), y (k) + y * (x(k)))) Then the dynamics of the original system can be written in this new coordinate system as: By assumption (i), then there exists a Lyapunov function W (y, x) such that By assumption (iii), Proposition 7.3 in [17] guarantees the existence of a Lyapunov function V (x) and constant c ∈ (0, 1] such that Let us define the extended state vector z = x T y T T and consider now the global Lyapunov function: which has the property Now, defining α := min {a 1 , b 1 } and β := max {a 2 , b 2 } let us consider the sets which are closed and compact by continuity. Also note that where we used the fact that y 0 = y 0 − y * (0). The following basic bounds follow immediately from the Lipschitz and vanishing properties of the various functions when the state is restricted to belong to the compact set (x, y ) ∈ B r0 .
To simplify the notation let us indicate x = x(k), x + = x(k + 1) and χ = x + φ (x, y * (x)). We first want to find an upper bound for the Lyapunov function relative to the slow dynamics: To simplify the notation let us indicate y = y (k), y + = y(k + 1) and Using the properties of the Lyapunov function W and the several bounds obtained before we get: Now, considering that x and y are smaller than ar and that y * is a twice differentiable function of x, we have that y * is bounded itself by ar, and so with B W2 := b 5 1 ( 3 + 5 ) 2 (r+r) and C W2 := b 5 2 ( 3 + 5 ) 2 (r+r). Using also Assumption (iv): Summing up all the previous inequalities, we have for suitable positive constants w A1 , w B1 , w B2 , w C1 , w C2 , w C3 . Now it is possible to evaluate ∆U (x, y ) = U (x + , y + ) − U (x, y ). It holds ∆U (x, y ) = ∆W (y , x) + ∆V (y , x) , and the following bound easily follows This quadratic form can be rewritten as Now the aim is to show that ∆U (x, y ) is always a negative quantity for a small enough choice of the parameter . It is therefore necessary to study whether matrix Q U is negative definite: In order to verify whether Q U is negative definite, it is enough to verify whether the first principal minor is negative and the second one is positive for some choices of . These quantities, in Landau notation, are respectively and so, there exists r such that for < r matrix Q U is negative definite. As a consequence, there exists a positive constant U such that From the latter inequality it follows, for some > 0, and since x(0) 2 + y (0) ≤ r, there exists a C r > 0 such that

B Properties of the derivative estimators
This appendix contains the derivation of the properties of the two derivative estimators, for a function f (x) : R d → R that satisfies Assumption 1. For convenience we rewrite here the expression of the estimators: We also define∇ and similarly for its derivatives. Given two matrices A, B ∈ R n×n , we will often use the chains of inequalities where σ min (·) and σ max (·) are the minimum and maximum singular values of the argument.

B.1 Gradient estimator
Using Taylor expansion with integral remainder on a generic element of the gradient estimator from which we can see that the estimator is Lipschitz continuous: where the approximation error is bounded by

B.2 Hessian estimator
Using Taylor expansion with integral remainder on a generic element of the Hessian estimator from which we can see that the estimator is Lipschitz continuous: The approximation error is bounded by We can now prove Lemma 1: Proof. Consider a generic quadratic function

B.3 Jacobian of the gradient estimator
A generic element of the Jacobian satisfies Define R(µ, x) as the d × d matrix which contains the integral terms above, so that

B.4 Zeros of the gradient estimator
Here we provide a proof of Lemma 2.
Proof. Let x be the unique global minimizer of the strongly convex function f (x), for which ∇f (x ) = 0. Given µ, we want to characterize the set of solutions to the equation∇f (µ, x) = 0. Note that∇f (µ, x) is twice differentiable in both its arguments x and µ = 0. Recall that and similarly for its derivatives, so that∇f (0, x ) = 0. The Jacobian matrix of∇f (µ, x) is We want to apply the Implicit Function theorem, and to do so we need D x∇ f (0, x ) to be invertible. Indeed, and therefore D x∇ f (0, x ) = ∇ 2 f (x ) > 0 is invertible. We also note that where we applied Leibniz integral rule and we used the fact that by Lipschitz continuity of the Hessian By the Implicit Function theorem, there existμ, δ > 0 and a continuously differentiable function Γ : R → R d such that for all |µ − 0| = µ ≤μ there is a unique x = Γ(µ) ∈ R d such that Γ(µ) − x ≤ δ and∇f (µ, Γ(µ)) = 0.

B.5 Square norm of the gradient estimator
In the proof of Theorem 2 we use the square norm of the gradient estimator as a Lyapunov function, and to do so we need to analyze some properties of this quantity. In the following we use the notation Γ µ = Γ(µ).

B.5.1 Upper bound
Using Lipschitz continuity, the norm of the gradient estimator can be bounded by

B.5.2 Lower bound
We consider the Taylor expansions Recalling that∇f (Γ µ ) = 0 and ∂∇f (x) ∂x = ∇ 2 f (x) + R(µ, x), and using the notation Plugging this result into the quantity of interest and using we get which for L 3 = 0 is a positive number provided that µ < 6 L 2 1 + m 2 − L 1 dL 3 .

B.5.3 Norm of the partial derivative
We want to obtain an upper bound on