SHED: A Newton-type algorithm for federated learning based on incremental Hessian eigenvector sharing

There is a growing interest in the distributed optimization framework that goes under the name of Federated Learning (FL). In particular, much attention is being turned to FL scenarios where the network is strongly heterogeneous in terms of communication resources (e.g., bandwidth) and data distribution. In these cases, communication between local machines (agents) and the central server (Master) is a main consideration. In this work, we present SHED, an original communication-constrained Newton-type (NT) algorithm designed to accelerate FL in such heterogeneous scenarios. SHED is by design robust to non i.i.d. data distributions, handles heterogeneity of agents' communication resources (CRs), only requires sporadic Hessian computations, and achieves super-linear convergence. This is possible thanks to an incremental strategy, based on eigendecomposition of the local Hessian matrices, which exploits (possibly) outdated second-order information. The proposed solution is thoroughly validated on real datasets by assessing (i) the number of communication rounds required for convergence, (ii) the overall amount of data transmitted and (iii) the number of local Hessian computations. For all these metrics, the proposed approach shows superior performance against state-of-the art techniques like GIANT and FedNL.


Introduction
With the growing computational power of edge devices and the booming increase of data produced and collected by users worldwide, solving machine learning problems without having to collect data at a central server is becoming very appealing (Shi et al., 2020).
One of the main reasons for not transferring users' data to cloud central servers is due to privacy concerns.Indeed, users, such as individuals or companies, may not want to share their private data with other network entities, while training their machine learning algorithms.In addition to privacy, distributed processes are by nature more resilient to node/link failures and can be directly implemented on the servers at network edge, i.e., within multi-access edge computing (MEC) scenarios (Pham et al., 2020).
This distributed training framework goes under the name of federated learning (FL) (McMahan et al., 2017;Li et al., 2020a), and it has attracted much research interest in recent years.Direct applications of FL can be found, for example, in the field of healthcare systems (Rieke et al., 2020;Huang et al., 2020) or of smartphone utilities (Hard et al., 2018).
One of the most popular FL algorithms is federated averaging (FedAvg) (McMahan et al., 2017), which achieves good results, but which also comes without convergence guarantees and has been shown to diverge under heterogeneous data distributions (Li et al., 2020b).
Indeed, among the open challenges of FL, a key research question is how to provide efficient decentralized optimization algorithms in scenarios with heterogeneous communication links (different bandwidth) and non i.i.d.data distributions (Kairouz et al., 2021;Zhu et al., 2021;Smith et al., 2017;Zhao et al., 2018).These aspects are found in a variety of applications, such as the so-called federated edge learning framework, where learning is moved to the network edge, and which often involves unstable and heterogeneous wireless connections (Shi et al., 2020;T. Dinh et al., 2022;Nguyen et al., 2021;Chen et al., 2020).In addition, the MEC and FL paradigms are of high interest to IoT scenarios such as data retrieval and processing within smart cities, which naturally entail non i.i.d.data distributions due to inherent statistical differences in the underlying spatial processes (e.g., vehicular mobility, user density, etc.) (Liu et al., 2020).
The bottleneck represented by communication overhead is one of the most critical aspects of FL.In fact, in scenarios with massive number of devices involved, inter-agent communication can be much slower than the local computations performed by the FL agents themselves (e.g., the edge devices), by many orders of magnitude (Li et al., 2020a).The problem of reducing the communication overhead of FL becomes even more critical when the system is characterized by non i.i.d.data distributions and heterogeneous communication resources (CRs) (Li et al., 2020b).In this setting, where the most critical aspect is inter-node communications, FL agents are usually assumed to have good computing capabilities, and wisely increasing the computation effort at the agents is a good strategy to obtain a faster convergence.For this reason, Newton-type approaches, characterized by strong robustness and fast convergence rates, even if computationally demanding, have been recently advocated (Gupta et al., 2021;Safaryan et al., 2021).
In the present work, we propose a Newton-type FL framework that is naturally robust to non i.i.d.data distributions and that intelligently allocates the (per-iteration) communication resources of the involved FL agents, allowing those with more CRs to contribute more towards improving the global convergence rate.Differently from prior art, the proposed framework requires FL agents to locally compute the Hessian matrix sporadically.Its super-linear convergence is here proven by studying the CRs-dependent convergence rate of the algorithm by analyzing the dominant Lypaunov exponent of the estimation error.With respect to other Newton-type methods for FL, our empirical results demonstrate that the proposed framework is (i) competitive with state-of-the-art approaches in i.i.d.scenarios and (ii) robust to non i.i.d.data distributions, for which it outperforms competing solutions.

Related work
Next, to put our contribution into context, we review previous related works on first and second order methods for decentralized learning.
First-order methods.Among the FL contributions considering first-order methods, some work has been recently proposed to face the problem of non i.i.d.data and heterogeneous networks.SCAFFOLD (Karimireddy et al., 2020), FedProx (Li et al., 2020b) and the work in Yu et al. (2019) provide modification to the FedAvg algorithm to face system's heterogeneity.To deal with non i.i.d.datasets, strategies like the ones by Zhao et al. (2018) and Jeong et al. (2018) have been proposed, although these approaches require that some data is shared with the central server, which does not fit with the key privacy requirements of FL.To reduce the FL communication overhead, the work by Chen et al. (2018) leverages the usage of outdated first-order information by designing simple rules to detect slowly varying local gradients.With respect to the heterogeneous time-varying CRs problem, Amiri and Gündüz (2020) presented techniques based on gradient quantization and on analog communication strategies related to the additive nature of the wireless channel, while Chen et al. (2020) studied a framework to jointly optimize learning and communication for FL in wireless networks.
Second-order methods.Newton-type (NT) methods exploit second-order information of the cost function to provide accelerated optimization, and are therefore appealing candidates to accelerate FL.NT methods have been widely studied in distributed learning: GIANT (Wang et al., 2018) is an NT approach that exploits the harmonic mean of local Hessian matrices in decentralized settings.Other related approaches were proposed in LocalNewton (Gupta et al., 2021), DANE (Shamir et al., 2014), AIDE (Reddi et al., 2016), DiSCO (Zhang and Lin, 2015), DINGO (Crane and Roosta, 2019).These algorithms, however, were all designed for i.i.d.data distributions.The work in T. Dinh et al. (2022), DONE, has proposed an NT approach for FL, similar to GIANT, to be applied in federated edge learning scenarios.Communication efficient NT methods like GIANT and DONE exploit an extra communication round to get an estimation of the global Hessian as the harmonic mean of the local Hessian matrices.However, as we show empirically in this contribution, these approaches are sensitive to non i.i.d.data configurations.A recent study, FedNL (Safaryan et al., 2021), proposed algorithms based on matrix compression that exploit theory developed in Islamov et al. (2021) to perform decentralized optimization by iteratively learn the Hessian matrix at the optimum.A similar approach based on matrix compression techniques has been studied in Basis Learn (Qian et al., 2021), proposing variants of FedNL exploiting change of basis in the space of matrices to improve the quality of the compressed Hessian.However, these approaches do not consider heterogeneity in the CRs, and require the computation of the local Hessian at each iteration.Another NT approach has been recently proposed in Liu et al. (2021), based on the popular L-BFGS (Liu and Nocedal, 1989) quasi-Newton method.Although some preliminary NT approaches have been proposed for the FL framework, there is still large space for improvements with respect to both non i.i.d.data distributions and system and CRs heterogeneity.

Contribution
In this work, we design a communication-constrained NT optimization algorithm for the FL framework that is suitable for networks with heterogeneous CRs and non i.i.d.data.The proposed approach requires only sporadic Hessian computation and it is shown to be an effective enabler for accelerated convergence in FL convex problems, reducing communication rounds and communication load in networks with heterogeneous CRs and non i.i.d.data distributions.
In the proposed algorithm, local machines, the agents, compute the local Hessian and its singular value decomposition (SVD), equivalent to eigendecomposition for positive definite matrices.Then, according to their available communication resources (CRs), agents share some of the most relevant Hessian eigenvectors with the server (the Master).Together with the eigenvectors agents send to the Master a scalar quantity -computed locally -that allows a full-rank Hessian approximation.The obtained approximations are averaged at the Master in order to get an approximated global Hessian, to be used to perform a global NT descent step.The approximation of the Hessian matrix at the Master is then improved by incrementally sending further eigenvector-eigenvalue pairs (EEPs).In particular, for a general convex problem, the proposed algorithm is designed to make use of outdated Hessian approximations.Thanks to this, local Hessian computation is required only sporadically.We analyze the convergence rate via the analysis of the dominant Lyapunov exponent of the objective parameter estimation error.In this novel framework, we show that the usage of outdated Hessian approximations allows improving the global convergence rate.Furthermore, we show that the EEPs of the outdated local Hessian approximations can be shared heterogeneously with the Master, so the algorithm is versatile with respect to per-iteration heterogeneous CRs.We also show that, thanks to the incremental strategy, the algorithm enjoys a super-linear convergence rate.
Differently from existing approaches, the proposed algorithm allows devices with more CRs (such as larger bandwidth or better channel quality), to provide a per-iteration more significant contribution (by sharing more EEPs) to the optimization acceleration with respect to devices whose CRs are limited.This latter characteristic is particularly significant in heterogeneous networks, where CRs are not uniformly distributed among the agents and change dynamically through time.
With respect to the problem of non i.i.d.data distributions, our approach is supported by theoretical results that do not require i.i.d.assumptions.Indeed, our approach, differently, for example, from methods based on the harmonic mean, is not impacted at all by the considered non i.i.d.data distributions.
In order to be applied, the proposed algorithm requires agents to be able to locally compute the Hessian matrix and its SVD, that is the same requirement of competing approaches like FedNL, see Safaryan et al. (2021).

Organization of the paper
The rest of the paper is organized as follows: in Section 2 we describe the problem formulation and the idea of the algorithm.In Section 3 we present the algorithm and the theoretical results in the linear regression (least squares) case, describing the convergence rate with respect to the algorithm parameters.Section 4 is instrumental to extend the algorithm to a general strongly convex problem, that is done in Section 5.At the end of Section 5, based on the theoretical results, we propose some heuristic choices for tuning the algorithm parameters.Finally, in Section 6 we show the empirical performance of the proposed approach on real datasets.In particular, we validate the theoretical results and compare against state-of-the art approaches.In Appendices, we include the longer proofs and some additional experiments.
Notation: Vectors and matrices are written as lower and upper case bold letters, respectively (e.g., vector v and matrix V).The operator • denotes the 2-norm for vectors and the spectral norm for matrices.diag(v) denotes a diagonal matrix with the components of vector v as diagonal entries.I denotes the identity matrix.

Problem formulation
In Fig. 1 we illustrate the typical framework of a federated learning scenario.Local machines (agents), A 1 , ..., A M , hold a fraction of the global dataset and cooperate -communicating with a Master through communication links, L 1 , ..., L M -to solve the optimization problem.
In the rest of the paper, we denote the optimization parameter by θ ∈ Θ, and a generic cost function by f : Θ → R. In what follows, Θ = R n , where n is the dimension of the feature data vector x ∈ R n .Let us denote a dataset by D = {x j , y j } N j=1 , where N is the global number of data samples.x j ∈ R n denotes the j-th data sample and y j ∈ R the response of sample j.In the case of classification problems, y j would be an integer specifying the class to which sample x j belongs.We consider the problem of regularized empirical risk minimization of the form: with where each l j (θ) is a convex function related to the i-th element of D and µ a regularization parameter.In particular, l j (θ) = l(x T j θ, y j ), where l : R → R is a convex function.Examples of convex cost functions used in this work are: • linear regression with quadratic cost (least squares):

Master
Figure 1: The typical decentralized optimization framework: agents A 1 , ..., A M cooperate to solve a common learning problem.After receiving the global parameter θ t , at the current iteration t, they share their optimization sets U (1) t , ..., U (M ) t so that a new global update can be performed at the Master.
• logistic regression: Logistic regression belongs to the class of generalized linear models.In this paper we consider optimization problems in which the following assumption on the cost function f holds: Assumption 1 Let H(θ) := ∇ 2 f (θ).f is twice continuously differentiable, K-smooth, κ-strongly convex and L-Lipschitz continuous, i.e.: In the paper, we denote the data matrix by X = [x 1 , ..., x N ] ∈ R n×N , and the label vector as Y = [y 1 , ..., y N ] ∈ R 1×N .We refer to the dataset D as the tuple D = (X, Y).We denote by M the number of agents involved in the optimization algorithm, and we can write , where N i is the number of data samples of the i-th agent.We denote the local dataset of agent i by The following assumption is needed to prove some of the theoretical results related to the proposed algorithm.
As a consequence of the above assumption, letting L = max i∈{1,...,M } L i , we have that

A Newton-type method based on SVD
The Newton method to solve (1) works as follows (Newton update): where t denotes the t-th iteration, g t = g(θ t ) = ∇f (θ t ) is the gradient at iteration t and η t is the step size (also called learning rate) at iteration t.H t = ∇ 2 f (θ t ) denotes the Hessian matrix at iteration t.Compared to gradient descent, the Newton method exploits the curvature information provided by the Hessian matrix to improve the descent direction.We define p t := H −1 t g t .In general, Newton-type (NT) methods try to get an approximation of p t .In a decentralized scenario, assuming for simplicity that all M agents have the same amount of data, we have that: where = ∇f (i) (θ t ) denote local Hessian and gradient of the local cost f (i) (θ t ) of agent i, respectively.To get a Newton update at the master, in a decentralized setting one would need each agent to transfer the whole matrix H (i) t of size O(n 2 ) to the master at each iteration, that is considered a prohibitive communication complexity in a federated learning setting, especially when the size of the feature data vectors, n, increases.Because of this, in this work we propose an algorithm in which the Newton-type update takes the form: where Ĥt is an approximation of H t .In some previous contributions, like Wang et al. (2018); T. Dinh et al. (2022), Ĥt is the harmonic mean of local Hessians, obtained at the master through an intermediate additional communication round to provide each agent with the global gradient g t .In the recent FedNL algorithm (Safaryan et al., 2021), each agent at each iteration sends a compressed version of a Hessian-related matrix.In the best performing version of the algorithm, the rank-1 approximation using the first eigenvector of the matrix is sent.Authors show that thanks to this procedure they can eventually get the Hessian matrix at the optimum, which provides super-linear convergence rate.Instead, our proposal is to incrementally obtain at the master an average of full-rank approximations of local Hessians through the communication of the local Hessian most relevant eigenvectors together with a carefully computed local approximation parameter.In particular, we approximate the Hessian matrix H t exploiting singular value decomposition (SVD) in the following way: the symmetric positive definite Hessian H t can be diagonalized as H t = V t Λ t V T t , with Λ t = diag(λ 1,t , ..., λ n,t ), where λ k,t is the eigenvalue corresponding to the k-th eigenvector, v k,t .H t can be approximated as with Λt := Λt (ρ t , q t ) = diag(λ 1,t , ..., λ qt , ρ t , ..., ρ t ).The scalar ρ t > 0 is the approximation parameter (if ρ t = 0 this becomes a low-rank approximation).The integer q t = 1, ..., n denotes the number of eigenvalue-eigenvector pairs (EEPs) {λ k,t , v k,t } qt k=1 being used to approximate the Hessian matrix.We always consider eigenvalues ordered so that λ 1,t ≥ λ 2,t ≥ ... ≥ λ n,t .The approximation of Eqn.(4) was used in Erdogdu and Montanari (2015) for a subsampled centralized optimization problem.There, the parameter ρ t was chosen to be equal to λ qt+1 .In the decentralized setting of FL, we use approximation shown in Eqn.(4) to approximate the local Hessian matrices of the agents.In particular, letting Ĥ(i) t be the approximated local Hessian of agent i, the approximated global Hessian is the average of the local approximated Hessian matrices: where Ĥ(i) t := Ĥt (ρ t and of the number of eigenvalue-eigenvector pairs q (i) t shared by agent i, which are denoted by

Idea of the algorithm
The idea of the algorithm is that agents share with the Master, together with the gradient, some of their local Hessian EEPs, according to the available CRs.They share the EEPs in a decreasing order dictated by the value of the positive eigenvalues corresponding to the eigenvectors.At each iteration, they incrementally add new EEPs to the information they have sent to the Master.In a linear regression problem, in which the Hessian does not depend on the current parameter, agents would share their EEPs incrementally up to the n-th.When the n-th EEP is shared, the Master has the full Hessian available and no further second order information needs to be transmitted.In a general convex problem, in which the Hessian matrix changes at each iteration, as it is a function of the current parameter, our algorithm is designed in a way in which agents perform a renewal operation at certain iterations, i.e., they re-compute the Hessian matrix and re-start sharing the EEPs from the most relevant ones of the new matrix.

Linear regression (least squares)
In this section we illustrate our algorithm and present the convergence analysis considering the problem of solving (1) via Newton-type updates (3) in the least squares (LSs) case, i.e., in the case of linear regression with quadratic cost.Before moving to the decentralized case, we prove some results for the convergence rate of the centralized iterative least squares problem.First, we provide some important definitions.In the case of linear regression with quadratic cost, the Hessian is such that H LS := H(θ), ∀θ, i.e., the Hessian does not depend on the parameter θ.Because of this, when considering LSs, we write the SVD as H LS = VΛV T , with Λ = diag(λ 1 , ..., λ n ) without specifying the iteration t when not needed.Let θ * denote the solution to (1).In the following, we use the fact that the cost for parameter θ t can be written as f Similarly, the gradient can be written as g t = H LS (θ t − θ * ).
In this setup, the update rule of Eqn.(3) can be written as a time-varying linear discretetime system: where

Centralized iterative least squares
In this sub-section, we study the optimization problem in the centralized case, so when all the data is kept in a single machine.We provide a range of choices for the approximation parameter ρ t that are optimal in the convergence rate sense.We denote the convergence factor of the descent algorithm described by Eqn.
(3) by making its dependence on the tuple (ρ t , η t , q t ) explicit.
Theorem 1 Consider solving problem (1) via Newton-type updates (3) in the least squares case.At iteration t, let the Hessian matrix H LS be approximated as in Eqn.(4) (centralized case).The convergence rate is described by Let q t ∈ {0, 1, ..., n}.Then: 1.A sufficient condition for the convergence of system (5) is 2. The best achievable convergence factor is where Proof From (4), writing H LS = VΛV T and Ĥt = V Λt V T , with Λt = diag(λ 1 , ..., λ qt , ρ t , ..., ρ t ), define θ t+1 ρt,ηt := θ t − η t Ĥ−1 t g t .Recalling that g t = H LS (θ t − θ * ), we have: (10) 1) Given the structure of A t , we have that A t < 1, ∀t if and only if all the eigenvalues of A t are smaller than one in magnitude.It is straightforward to see that to have A t < 1 it needs to be η t ∈ (0, 2).For any ρ t ∈ (0, +∞), if η t ∈ (0, 2), a necessary and sufficient condition for having 2) For some given q t ∈ {1, ..., n}, r t is a function of two tunable parameters, i.e., the tuple (η t , ρ t ).We now prove that r * t can be achieved if and only if The convergence rate is determined by the eigenvalue of (I − η t Λ−1 t Λ) with the greatest absolute value.First, we show that there exists an optimal η * t for which r * t is achieved.If ρ t < λ n , the choice of η t minimizing the maximum absolute value of (I and the achieved factor is We see that the definition of the set D * immediately follows. In the above Theorem, we have shown that the best convergence rate is achievable, by tuning the step size, as long as ρ t ∈ [λ n , λ qt+1 ].In the following Corollary, we provide an optimal choice for the tuple (ρ t , η t ) with respect to the estimation error.
Corollary 2 Among the tuples (ρ t , η t ) ∈ D * , the choice of the tuple (ρ * t , 1), with ρ * t defined in (8), is optimal with respect to the estimation error θ t+1 ρt,ηt − θ * , for any θ t and for any t, in the sense that , where η * t is defined in (11) and ρ * t in (8).Now define z t := V T (θ t −θ * ) and z t+1 ρt,ηt := V T (θ t+1 ρt,ηt −θ * ), where (θ t+1 ρt,ηt − θ * ) is defined in (10).We have because the cross term is 2(B * t z t ) T (δB t z t ) = 0. We see that, for any t and for any θ t , We remark that the bound in ( 6) is tight for r t = r * t .If q t increases, the convergence factor r * t decreases until it becomes zero, when q t = n − 1, thus we can have convergence in a finite number of steps.

Decentralized iterative least squares
We now consider the decentralized scenario described in section 2, in which M agents keep their local data and share optimization parameters to contribute to the learning algorithm.
Assumption 3 In the rest of the paper, we assume that each agent has the same amount of data samples, N i = N M .This allows us to express global functions, such as the gradient, as the arithmetic mean of local functions (e.g., This assumption is made only for notation convenience.It is straightforward to show that all the results are valid also for N i different for each i.To show it, it is sufficient to replace the arithmetic mean of local functions with the weighted average, weighting each local function with p i = N i /N .As an example, the global gradient would be written in the following way: In this subsection we introduce the algorithm for the LSs case (Algorithm 1), that is a special case of Algorithm 4, described in Sec. 5, which is designed for a general convex cost.We refer to Algorithm 1 as Incremental Newton (IN) and it works as follows: at iteration t, each device shares with the Master some of its local Hessian eigenvectors scaled by quantities related to the corresponding eigenvalues.The eigenvectors are shared incrementally, where the order in which they are shared is given by the corresponding eigenvalues.For example, at t = 1 agent i will start by sharing its first Hessian eigenvectors v according to its communication resources (CRs) and then will incrementally send up to v (i) n−1 in the following iterations.To enable the approximation of the local Hessian via a limited number of eigenvectors using (4), each eigenvector is sent to the Master together with λ (i) j and the parameter ρ (i) t .The Master averages the received information to obtain an estimate of the global Hessian as follows: Algorithm 1: Least Squares -Incremental Newton (IN) , g t , ρ t to the Master.Compute Ĥt (as in eq. ( 12)) and g t . 19 Perform Newton-type update (3) with η t = 1.
where Ĥ(i) in which q t is the number of the local Hessian eigenvectors-eigenvalues pairs that agent i has already sent to the Master at iteration t.We denote by d (i) t the increment, meaning the number of eigenvectors that agent i can send to the Master at iteration t.Given the results shown in Section 3.1, in this Section we fix the local approximation parameter to be ρ n )/2 and the step size to be η t = 1.The following results related to the convergence rate allow the value q (i) t to be different for each agent i, so we define By construction, the matrix Ĥt is positive definite, being the sum of positive definite matrices, implying that −p t = − Ĥ−1 t g t is a descent direction.Theorem 3 Consider the problem in (1) in the least squares case.Given Ĥt defined in (12), the update rule defined in (3) is such that, for ρ If Algorithm 1 is applied, c t+1 ≤ c t ∀t, and, if for all i it holds that q The last inequality follows from two inequalities: LS the local Hessian at agent i.We have where the last equality holds because and because ρ This theorem shows that our approach in the LSs case provides convergence in a finite number of iterations, if q (i) t keeps increasing through time for each agent i.Indeed, as in the centralized case, if q (i) t increases for all i, the factor c t decreases until it becomes zero.Furthermore, each agent is free to send at each iterations an arbitrary number of eigenvector-eigenvalue pairs, according to its CRs, and by doing so it can improve the global convergence rate.

From least squares to general convex cost
We want to extend the analysis and algorithm presented in the previous sections of the paper to a general convex cost f (θ t ).With respect to the proposed approach, the general convex case requires special attention for two main reasons: (i) the update rule defined in (3) requires tuning of the step-size η t , usually via backtracking line search, and (ii) the Hessian matrix is in general a function of the parameter θ.In this section, still focusing on the least squares case, we provide some results that are instrumental to the analysis of the general convex case.

Backtracking line search for step size tuning
We recall the well-known Armijo-Goldstein condition for accepting a step size η t via backtracking line search: where α ∈ (0, 1/2).The corresponding line search algorithm is the following: Algorithm 2: Backtracking line search algorithm Input: α ∈ (0, 1/2), β ∈ (0, 1), θ t , p t , g t , f Output: Lemma 4 Consider the problem in (1) in the least squares case.Let p t = Ĥ−1 t g t , with Ĥt defined in (12).A sufficient condition for a step size η t to satisfy Armijo-Goldstein condition (18), for any α ∈ (0, 1/2), is Proof The quadratic cost in θ t can be written as Given that f (θ * ) does not depend on θ t , we can focus on f (θ t ).We have that where we have used identity H LS (θ t − θ * ) = g t and the fact that p T t g t = p T t Ĥt Ĥ−1 t g t = p T t Ĥt p t to get equality (1) and ( 2), respectively.We see that if then condition ( 18) is satisfied.Indeed, in that case, So, we need to find a sufficient condition on η t for (21) to be true.We see that ( 21) is , and, ∀i, all the elements of the diagonal matrix Λ (see Eq. ( 17)), and we see that the choice (19) provides a sufficient condition to satisfy (21), from which we can conclude.
Corollary 5 In the least squares case, choosing ρ , ∀i, Armijo backtracking line search (Algorithm 2) would choose a step size η t = 1.Proof The proof is straightforward from Eqn. ( 19) of Lemma 4.
The results of Lemma 4 and of Corollary 5 are crucial for the design of the algorithm in the general convex case.Indeed, as illustrated in Section 5, a requirement for the theoretical results on the convergence rate is that the step size becomes equal to one, which is not guaranteed by the Armijo backtracking line search, even when considering the least squares case, if ρ . For this reason, the algorithm in the general convex case is designed with ρ

Algorithm with periodic renewals
Now, we introduce a variant of Algorithm 1 that is instrumental to study the convergence rate of the proposed algorithm in the general convex case.The variant is Algorithm 3. The definition of I = {1, T, 2T, ...} implies that every T iterations the incremental strategy is restarted from the first EEPs of H LS , in what we call a periodic renewal.Differently from Algorithm 1, Algorithm 3 can not guarantee convergence in a finite number of steps, because T < n and thus it could be that c t > 0, ∀t (see Theorem 3).We study the convergence rate of the algorithm by focusing on upper bounds on the Lyapunov exponent (Lyapunov, 1992) of the discrete-time dynamical system ruled by the descent algorithm.
The Lyapunov exponent characterizes the rate of exponential (linear) convergence and it is defined as the positive constant a * > 0 such that, considering h(θ t ) := (θ t − θ * )a −t , if a > a * then h(θ t ) vanishes with t, while if a < a * , for some initial condition, h(θ t ) diverges.
The usual definition of Lyapunov exponent for discrete-time linear systems (see Czornik et al., 2012) is, considering the system defined in (5), From ( 14), we have that, for each k, A k ≤ c k = (1 − λn /ρ k ).This implies that, defining it is a * ≤ ā.The following Lemma formalizes this bound and provides an upper bound on the Lyapunov exponent obtained by applying Algorithm 3.
Proof We see that a * ≤ ā from definition (22), because for each k, t−1 + 1, ∀i, t, then ā = āT .To show this latter inequality, let us consider the logarithm of the considered values, so we prove log ā = log āT .Indeed, if q where R = t/T and T = t − RT .

General decentralized convex problem
Given the previous analysis and theoretical results for linear regression with quadratic cost, we are now ready to illustrate our Newton-type algorithm (Algorithm 4) for general convex problems, of which Algorithm 1 is a special case.We refer to the algorithm as IOS, that stands for Incremental algorithm exploiting (possibly) Outdated Second-order information.
Since in a general convex problem the Hessian depends on the current parameter, θ t , we denote by H(θ t ) the global Hessian at the current iterate, while we denote by Ĥt the global approximation, defined similarly to (12), with the difference that now eigenvalues and eigenvectors depend on the parameter for which the Hessian was computed.
The expression of Ĥt thus becomes: where k (i) t ≤ t denotes the iteration in which the local Hessian of agent i was computed.The parameter θ k (i) t is the parameter for which agent i computed the local Hessian, that in turn is being used for the update at iteration t.
The idea of the algorithm is to use previous versions of the Hessian rather than always recomputing it.This is motivated by the fact that as we approach the solution of the optimization problem, the second order approximation becomes more accurate and the Hessian changes more slowly.Hence, recomputing the Hessian and restarting the incremental approach provides less and less advantages as we proceed.From time to time, however, we need to re-compute the Hessian corresponding to the current parameter θ t , because H(θ kt ) could have become too different from H(θ t ).We call this operation a renewal.We denote by I the set of iteration indices at which a renewal takes place.In principle, each agent could have its own set of renewal indices, and decide to recompute the Hessian matrix independently.In this work, we consider for simplicity that the set I is the same for all agents, that means that all agents use the same parameter for the local Hessian, i.e., k At the end of this section we describe a heuristic strategy to choose I with respect to the theoretical analysis.We remark that in the case of a quadratic cost, in which the Hessian is constant, one chooses I = {1}, and so Algorithm 1 is a special case of Algorithm 4.

20
Get η t via decentralized backtracking line search.
The singular value decomposition can be applied to the local Hessian as before, we define For notation convenience we also define ṽ(i) The theoretical results on the convergence rate in this Section require that Ĥ(i) (θ) ≥ H (i) (θ), ∀θ, which in turn requires, defining ρ(i) . Given the results of Theorem 1 on the range of the approximation parameter ρ t achieving the best convergence rate in the centralized least squares case, we set: The local Hessian can be approximated as Clearly, it still holds that Ĥt ≥ ρt I, ∀t, where ρt = 1 M M i=1 ρ(i) t .Furthermore, it is easy to see that Ĥt ≤ KI.The following Lemma is fundamental to prove convergence of the algorithm.We assume to use Armijo backtracking condition, that is recalled in Algorithm 2.
Lemma 7 If g t > ω > 0, for Ĥt defined in (26), in particular, for a backtracking line search with parameters α ∈ (0, 0.5), β ∈ (0, 1), it holds: Proof The proof is the same as the one provided in Boyd and Vandenberghe (2004) for the damped Newton phase (page 489-490), with the difference that the "Newton decrement", that we denote by σ t , here is Furthermore, the property Ĥt ≥ ρt I, ∀t is used in place of strong convexity.Proof When ρt > 0, ∀t, Lemma 7 implies g t → 0. Indeed, if not, there would be some > 0 such that g t > , ∀t.But then (31) immediately implies that f (θ t ) → −∞, that contradicts the strong convexity hypothesis.By strong convexity and differentiability of f , Note that, for the considered cost function (see ( 4)) and choice of ρ t , the presence of a regularization term µ > 0 is sufficient to have ρ (i) t > 0, ∀i, ∀t.Now, we provide results related to the convergence rate of the algorithm.In order to prove the following results, we need to impose some constraints on the renewal indices set I. Specifically, Assumption 4 Denoting I = {C j } j∈N , there exists a finite positive integer l such that C j ≤ C j−1 + l.This is equivalent to state that, writing k t = t − τ , the 'delay' τ is bounded.
Proof [Sketch of proof] The proof works by (i) applying Taylor expansion to the cost function and to the gradient, for which Assumption 2 about twice differentiable cost function is exploited, (ii) using the results of Corollary 5 about step size acceptance (iii) using Assumption 4 to show that there exists an iteration for which the Armijo condition applies.For the complete proof, see Appendix A.
The above Lemma is crucial to prove the results about linear and super-linear convergence.
The next Theorem provides a bound describing the relation between the convergence rate and the increments of outdated Hessians.
Proof the beginning of the proof follows from the proof of Lemma 3.1 in Erdogdu and Montanari ( 2015), (see page 18).In particular, we can get the same inequality as (A.1) in Erdogdu and Montanari (2015) (the Q t here is Ĥt ) with the difference that since we are not sub-sampling, we have (using the notation of Erdogdu and Montanari ( 2015)) S = [n].Setting η t = 1, the following inequality holds: We now focus on the first part of the right hand side of the inequality where the last equality holds being .
The above theorem is a generalization of Lemma 3.1 in Erdogdu and Montanari (2015) with step size equal to one.In particular, the difference is that (i) the dataset is decentralized and (ii) an outdated Hessian matrix is used.The next theorem shows that, in general, Algorithm 4 enjoys linear convergence if Assumption 4 is satisfied.
Theorem 11 Let Assumption 4 be satisfied, then 1. Algorithm 4 enjoys linear convergence.Hence, we can write, for some C > 0, where a * ∈ (0, 1) is the Lyapunov exponent of the estimation error.

If q
(i) t < n − 1, ∀i, ∀t, the Lyapunov exponent of the estimation error, a * , is such that where λo n and ρo k are the average of the n-th eigenvalues and approximation parameters, respectively, computed at the optimum: (θ * ).
2) The bound of Theorem 10 is used together with Assumption 2 on Lipschitz continuity of local cost functions to relate the eigenvalues of the Hessian at the current iteration with the eigenvalues at the optimum.3) One can use calculations similar to the ones of Lemma 6.For the complete proof, see Appendix A.
In the above Theorem we have shown the linear convergence property of the algorithm.Furthermore, through the analysis of the Lyapunov exponent, we have shown that, close enough to the optimum, the convergence rate is related to the number of EEPs of the outdated Hessian analogously to the linear regression case.In result 2) of the Theorem we have kept the general expression, with the lim sup, because this applies to any configuration of the CRs and emphasize how the rate can be always improved by the agents when they do increment the number of EEPs that they share with the Master.
Theorem 12 If the indices set is chosen equal to I = {C j } = {n j}, with n := n − 1, and at each iteration t, q t−1 + 1, ∀i, Algorithm 4 enjoys super-linear convergence, or, equivalently, the Lyapunov exponent of the estimation error dynamical system is a * = 0. Proof [Sketch of proof] The proof exploits the fact that, from Theorem 11, the convergence rate of the algorithm is at least linear.Starting from the bound of Theorem 10 and considering the Theorem conditions on q (i) t , some calculations provide the result.For the complete proof, see Appendix A.
The assumptions of this theorem have been done to simplify the proof, but they can be relaxed.In particular, I could be different from {n j}, but become equal to {n j} only after a certain amount of iterations.Furthermore, with some slight modification of the algorithm, it is possible to show that the Theorem holds also if q (i) t grows differently for each agent, that is the more realistic case.Note that Theorem 12 guarantees superlinearity, but in practice it is only meaningful when the parameter size n is not too big.

Heuristics for the choice of I: the Fibonacci sequence
From the theoretical results provided above we can do a heuristic design for the renewal indices set I. In particular, we see from the result of Lemma 7 and the bound (33) in Theorem 10 that when we are at the first iterations of the optimization we would frequently do the renewal operation, given that the Hessian matrix changes much faster and that the term c 2,t is big.As we converge, instead, we would like to reduce the number of renewal operations, to improve the convergence rate, and this is strongly suggested by the results 2) and 3) related to the Lyapunov exponent in Theorem 11.Furthermore, the superlinear convergence that follows from Theorem 12 requires q (i) t = n − 1 for each agent, so we should let the sequence of indices I increase in some way so that after a certain point renewals occur periodically after all agents have shared the EEPs up to the n − 1-th.
To evaluate the algorithm performance, we obtain the results of the next section by choosing the distance between renewals to be determined by the Fibonacci sequence, so I = {C j }, where F k being the Fibonacci sequence, with F 0 = 0, F 1 = 1.When the sequence C j reaches n − 1, the next values of the sequence are chosen so that C j+1 = C j + n − 1.Other possibilities could be to increase the distance between renewals with constant increases or with growing increase, but further study for the choice of I is left as future work.

Empirical Results
In this section we present empirical results obtained with real datasets.In particular, to illustrate Algorithm 1, thus in the case of regression via least squares, we consider the popular Million Song (1M Songs) Dataset (Bertin-Mahieux et al., 2011) and the Online News Popularity dataset (Fernandes et al., 2015), both taken from the UCI Machine Learning Repository (Dua and Graff, 2017).For the more general non linear convex case, we apply logistic regression to two image datasets, in particular the FMNIST (Xiao et al., 2017) and EMNIST digits (Cohen et al., 2017) datasets, and to the 'w8a' web page dataset available from the libSVM library (Chang and Lin, 2011).First, we show performance assessments related to the theoretical results of the previous sections.In particular, we show how the versatility of the algorithm applies effectively to the FL framework, showing results related to parameters choice and heterogeneity of CRs.Finally, we compare our algorithm against state-of-the-art approaches in both i.i.d. and non i.i.d.data distributions, showing the advantage that can be provided by our approach, especially in the case of heterogeneous CRs.When comparing with other algorithms, we provide results for both i.i.d. and non i.i.d.partitions to show that our algorithm is competitive with state-of-the-art approaches also in the i.i.d.configuration.
In the general convex case, we follow the heuristics discussed at the end of Section 3 and choose the indices set to be I = {C j }, where C j is defined in eq. ( 38).We call this version of the algorithm Fib-IOS.As done previously in the paper, Algorithm 1 is referred to as IN, and Algorithm 4 as IOS, while their periodic variants (Algorithm 3), are referred to as IN-periodic and IOS-periodic, respectively.If the value of d (i) t is not specified, IN and IOS stand for the proposed approach when d (i) t = 1 for each iteration t and agent i, i = 1, ..., M .In all the results shown in this section, we have fixed µ = 10 −5 , but we have obtained similar results with µ = 10 −4 , µ = 10 −6 and µ = 10 −8 , that were the values considered in Wang et al. (2018).In subsection 6.7 when comparing against other algorithms in the convex case, we show results for both µ = 10 −5 and µ = 10 −6 .

Datasets
We consider the following datasets.For each dataset, we partition the dataset for the FL framework in both an i.i.d. and non i.i.d.way.In particular:

Least squares
We consider the popular Milion Song Dataset (Bertin-Mahieux et al., 2011).The objective of the regression is the prediction of the year in which a song has been recorded.The dataset consists in 515K dated tracks starting from 1922 recorded songs.We take a subset of 300K tracks as training set.The feature size of each data sample is n = 90.As preprocessing of the dataset we apply rescaling of the features and labels to have them comprised between 0 and 1.To get a number of samples per agent in line with typical federated learning settings, we spread the dataset to M = 200 agents, so that each agent has around 800 data samples.
As a non i.i.d configuration, we consider a scenario with label distribution skew and unbalancedness (see Kairouz et al. (2021)).In particular, we partition the dataset so that each agent has songs from only one year, and different number of data samples.In this configuration, some agent has a very small number of data samples, while other agents have up to 11K data samples.
We also provide some results using the Online News Popularity (Fernandes et al., 2015), made of 40K online news data samples.The target of the prediction is the number of shares given processed information about the online news.Data is processed analogously to 1M Songs.M = 30 agents are considered with 500 data sample each.

Logistic regression
To validate our algorithm with a non linear convex cost we use two popular and widely adopted image datasets and a web page prediction dataset.Image datasets: We consider the FMNIST (Fashion-MNIST) dataset (Xiao et al., 2017), and the EMNIST (Extended-MNIST) digits dataset (Cohen et al., 2017).For both, we consider a training set of 60K data samples.We apply standard pre-processing to the datasets: we rescale the data to be between 0 and 1 and apply PCA (Sheikh et al., 2020), getting a parameter dimension of n = 300.Given that the number of considered data samples is smaller with respect to 1M-Songs, we distribute the dataset to M = 28 agents.
We focus on one-vs-all binary classification via logistic regression, which is the basic building block of multinomial logistic regression for multiclass classification.In a one-vs-all setup, one class (that we call target class), needs to be distinguished from all the others.The i.i.d.configuration is obtained by uniformly assigning to each agent samples from the target class and mixed samples of the other classes.When the agents perform one-vs all, we provide them with a balanced set where half of the samples belong the target class, and the other half to the remaining classes, equally mixed.In this way, each agent has around 400 data samples.
To partition the FMNIST and EMNIST datasets in a non i.i.d.way, we provide each agent with samples from only two classes: one is the target class and the other is one of the other classes.This same approach was considered in Li et al. (2020b).In this section we show results obtained with a target class corresponding to label 'one' of the dataset, but we have seen that equivalent results are obtained choosing one of the other classes as target class.
Web page dataset: We consider the web page category prediction dataset 'w8a', available from libSVM (Chang and Lin, 2011).The dataset comprises 50K data samples and the objective of the prediction is the distinction between two classes, depending on the website category.The dataset is already pre-processed.Being a binary and strongly unbalanced dataset, we do not consider an i.i.d.configuration, and spread the dataset to the agents similarly to the case of FMNIST, with the difference that some agent only has data samples from one class.The number of data samples per agent is between 700 and 900.

Decentralized backtracking
To tune the step size when there are no guarantees that a step size equal to one decreases the cost, we adopt the same strategy adopted in Wang et al. (2018): an additional communication round takes place in which each agent shares with the master the loss obtained when the parameter is updated via the new descent direction for different values of the step size.In this way, we can apply a decentralized version of the popular Armijo backtracking line search (see Algorithm 2 or Boyd and Vandenberghe (2004), page 464).When showing the results with respect to communication rounds, we always include also the additional communication round due to backtracking.

Heterogeneous channels model
As previously outlined, our algorithm is suitable for FL frameworks in which agents have heterogeneous transmission resources, that is for instance the case in Federated Edge Learning.As an example we consider the case of wireless channels, that are characterized by strong variability.In particular, in the wireless channel subject to fading conditions and interference, at each communication round agents could have very different CRs available (Pase et al., 2021;Wadu et al., 2020;Chen et al., 2020).To show the effectiveness of our algorithm in such a scenario, we consider the Rayleigh fading model.In a scenario with perfect channel knowledge, in which agent i allocates a certain bandwidth B (i) = B (we consider it to be the same for all agents) to the FL learning task, we have that the achievable rate of transmission for the i-th user is: where Γ (i) is a value related to transmission power and environmental attenuation for user i.For simplicity we fix Γ (i) = Γ = 5 for all users (in Pase et al. (2021), for instance, Γ = 1 and Γ = 10 were considered).The only source of variability is then γ ∼ Exp(ν), modeling the Rayleigh fading effect.We fix ν = 1.With respect to our algorithm, for illustration purposes, we compare scenarios in which the number of vectors in R n related to second order information, d t , is deterministically chosen and fixed for all users, with a scenario in which d (i) t is randomly chosen for each round for each user.In particular, we fix: This would correspond to each user allocating a bandwidth of B = d 0 B 0 to the transmission of second order information, where B 0 is the transmission speed needed to transmit a vector in R n per channel use.In the rest of the paper, we fix d 0 = 2.This implies that in average the number of vectors that an agent is able to transmit is 4, and the actual number can vary from 0 (only the gradient is sent) to 10.For simplicity, when adopting this framework, we assume that agents are always able to send at least a vector in R n per communication round (so the gradient is always sent).Further simulations are left as future work.

Choice of ρ t
In Fig. 2 we analyze the optimization performance with respect to the choice of the approximation parameter ρ t , following the theoretical results of Section 3.1, in particular Theorem 1 and Corollary 2. We provide performance analysis in the decentralized setting for least squares with the 1M Song and Online News Popularity datasets in (a) and (b), respectively.We show the error θ t − θ * versus the number of communication rounds.We adopt the notation of Section 5 as it includes least squares as a special case.In this subsection, the same parameter choice is done for all the agents, thus we omit to specify that the parameter is specific of the i-th agent.We compare two choices for the approximation parameter: (i) the one providing the best estimation error according to Corollary 2, ρ * t = ( λqt+1,t + λn,t )/2 with the step size η t = 1, and (ii) the one that was proposed in Erdogdu and Montanari (2015), where ρ t = λqt+1,t .In the latter case, following the result of Theorem 1 related to the best convergence rate we pick the step size to be equal to the η * t in Eq. ( 11).From the plots, we can see the improvement provided by the choice of ρ * t against λqt+1,t , although the performance is very similar for both values of the approximation parameter.

Lyapunov exponent convergence bound
In Fig. 3, we illustrate how the Lyapunov exponent bounds derived in Lemma 6 and Theorem 11 (specifically, Eqs. ( 24) and ( 37)) characterize the linear convergence rate of the algorithms.In particular, we consider the cases of periodic renewals in both the least squares and logistic regression case, considering the same conditions used to derive Eqs. ( 24) and (37), so q (i) t = q (i) t−1 + 1, ∀i, t and periodic renewals with period T .The plots show how the linear convergence rate is dominated by the Lyapunov exponent bound characterized by āT .For illustrative purposes, we show the results for the choice of T = 35 and of T = 25 for least squares on 1M songs and logistic regression on FMNIST, respectively.

Role of d t
In Figure 4, we show the impact of each agent transmitting more eigenvalue-eigenvector pairs per communication rounds, showing the optimization results when the increment, d (i) t , takes different values.We consider the case when the number is the same and fixed (specifically we consider d (i) t ∈ {1, 3, 6, 30}) for all the agents and the case d , where d γ is as defined in Eq. ( 40), with d 0 = 2 and Γ = 5, that means that in average d γ is equal to 4. In this latter case each agent is able to transmit a different random number of eigenvalue-eigenvector pairs.This configuration is relevant as our algorithm allows agents to contribute to the optimization according to their specific communication resources (CRs).In Figures 4-(a)-(c) we show the results for the least squares on 1M Songs, while in Figures 4-(b)-(d) we show the results in the convex case of logistic regression on the FMNIST dataset.We can see from Figures 4-( 24) and ( 37) for (a) and (b), respectively.Note that the upper bound is on the slope of the decreasing cost.
can be significantly reduced by increasing the amount of information transmitted at each round.In particular, at each round, the number of vectors in R n being transmitted is d t +1, since together with the d t scaled eigenvectors, {ṽ t−1 +1 (see Algorithm 2), agents need to transmit also the gradient.In the least squares case, the number of iterations needed for convergence is surely smaller than the number of EEPs sent: when the n − 1-th EEP has been shared, convergence surely occurs.When the number of EEPs is random (d γ ), but equal to 4 in average, we see that we can still get a significant improvement, which is between the choices d t = 3 and d t = 6.
Let us now analyze the case of logistic regression.In Figures 4-(b)-(d) we emphasize the role of the renewal operation, showing also how incrementally adding eigenvectors of the outdated Hessian improves the convergence, as formalized and shown in Theorems 10 and 11.In particular, from Figure 4-(b) it is possible to appreciate the impact of increasing the interval between renewals.
Quantifying the improvement provided by the usage of greater increments analyzing the empirical results, we see that when d t = 3, the number of vectors transmitted per round is 4, against the 2 transmitted when d t = 1, so the communication load per round is twice as much.We see that, despite this increase in data transmitted per round, in the case d t = 3 the overall number of communication rounds is halved with respect to the case d t = 1.An equivalent result occurs for the case d t = 6.Notably, for the case d t = d γ , with an average increment equal to 4, we see that the convergence speed is in between the cases d t = 3 and d t = 6, which shows the effectiveness of the algorithm under heterogeneous channels.For d t = 30, even if we increase the communication by a factor 15, we get a convergence that is only 10 times faster.This is of interest because it shows that if we increase arbitrarily the increment d t , we pay in terms of overall communication load.
In Figures 4-(c)-(d) we plot the error as a function of the amount of data transmitted per agent, where the number of vectors in R n is the unit of measure.These plots show that,   for small values of d t (in particular, d t ∈ {1, 3, 6}), the overall data transmitted does not increase for the considered values of d t , meaning that we can significantly reduce the number of global communication rounds by transmitting more data per round without increasing the overall communication load.This is true in particular also for the case d t = d γ , thus when the agents' channels availability is heterogeneous at each round, showing that our algorithm works even in this relevant scenario without increasing the communication load.
On the other hand, in the case of d t = 30, if we increase the amount of information by an order of magnitude, even if we get to faster convergence, we have as a consequence a significant increase in the communication load that the network has to take care of.

Comparison against other algorithms
In this subsection, IN+ and IOS+ signifies that the number of Hessian EEPs is chosen randomly for each agent i, according to the model for simulating heterogeneous CRs illustrated in Sec.6.3, so d (i) t = d γ , with an average increment equal to 4. We compare the performance of our algorithm with a state-of-the art first-order method, accelerated gradient descent (AGD, with the same implementation of Wang et al. (2018)).
As benchmark second-order methods we consider the following: • a decentralized version of the Newton-type method proposed in Erdogdu and Montanari (2015), to which we refer as Mont-Dec, which is the same as Algorithm 4 with the difference that the renewal occurs at each communication round, so the Hessian is always recomputed and the second-order information is never outdated.In this way, the eigenvectors are always the first ({v j,t } dt j=1 ), so this algorithm is neither incremental nor exploits outdated second-order information.We fix the amount of second-order information sent by the agents to be the same as the previously described IOS+.
• the decentralized optimization state-of-the-art NT approach exploiting the harmonic mean, GIANT (Wang et al., 2018).While in GIANT the local Newton direction is computed via conjugate gradient method, in DONE (T.Dinh et al., 2022), a similar approach has been proposed but using Richardson iterations.To get our results, we provide the compared algorithm with the actual exact harmonic mean, so that we include both the algorithms at their best.Referring to this approach, we write GIANT in the figures.We remark that GIANT requires an extra communication round, because the algorithm requires the agents to get the global gradient.We also implemented the determinantal averaging approach (Dereziński and Mahoney, 2019) to compensate for the inversion bias of the harmonic mean, but we did not see notable improvements so we do not show its results for the sake of plot readability.
• the very recently proposed FedNL method presented in Safaryan et al. (2021).We adopted the best performing variant (FedNL-LS), with rank-1 matrix compression, which initializes the Hessian at the master with the complete Hessian computed at the initial parameter value, so the transmission of the Hessian matrix H(θ 0 ) at the first round is required.This method requires the computation of the Hessian and of an n × n matrix SVD at each iteration.FedNL uses the exact same amount of CRs as IOS (so with d (i) t = 1, ∀i), as it sends the gradient and one eigenvector (always the first) at each iteration.For the Hessian learning step size, we adopt the same choice of Safaryan et al. (2021) (step size equal to 1), and we observed that different values of this parameter did not provide improvements.
We do not compare Mont-Dec and FedNL in the least squares case, as these algorithms are specifically designed for a time-varying Hessian and it does not make sense to use them in a least squares problem.
In the following, we show the details of results for the least squares on 1M Songs dataset and then for logistic regression on the FMNIST dataset.The detailed Figures of convergence results obtained with the EMNIST digits and 'w8a' are provided in Appendix B, while a summary of the results obtained on all the considered datasets is provided in Section 6.7.3.

Least squares on 1M Songs
In Fig. 5 we show the performance of the algorithm in the case of least squares applied on the 1M Songs dataset (thus the procedure is the one described in Algorithm 1).We plot the relative cost f (θ t ) − f (θ * ) versus the communication rounds.We see that AGD is characterized by very slow convergence due to the high condition number.Together with AGD, in Fig. 5 we show three Newton-type approaches: the state-of-the-art GIANT (Wang et al., 2018), IN and IN+.We see how in the i.i.d.case GIANT performs well even if IN+ is the fastest, but all the three algorithms have similar performance compared to the firstorder approach (AGD).In the non i.i.d.case, whose framework has been described in Sec.6.1.1,we can see how GIANT has a major performance degradation, even if still obtaining a better performance with respect to first-order methods.On the other hand, both IN and IN+ do not show performance degradation in the non i.i.d.case.

Logistic regression on FMNIST
In Figs. 6, 7 we show the results obtained with the FMNIST dataset with the setup described in Sec.6.1.2.We show the performance for two values of the regularization parameter, µ, in particular µ = 10 −5 and µ = 10 −6 .We compare our approach (Algorithm 4, IOS) also against FedNL and Mont-Dec.As before, we choose the indices set I to be I = {C j }, as described in Sec.5.1, where C j is related to the Fibonacci sequence, and we thus refer to the algorithms as Fib-IOS and Fib-IOS+.As we did for In the i.i.d.case, for µ = 10 −5 (Fig. 6), we see that Fib-IOS, Fib-IOS+ and GIANT require a similar number of communication rounds to converge, while the amount of overall data transmitted per agent is smaller for GIANT.When µ = 10 −6 , instead, we see from Figure 7 that GIANT requires more communication rounds to converge while the same amount of information needs to be transmitted.Hence, in this case GIANT is more impacted by a lower condition number when compared to our approach.On the other hand, FedNL in both cases shows a much slower convergence speed with respect to Fib-IOS.Notice, for FedNL, in Figures 6-(c)-(d) and 7-(c)-(d), the impact that the transmission of the full Hessian matrices at the first round has on the overall communication load.
In the considered non i.i.d.case, the large advantage that our approach can provide with respect to the other considered algorithms is strongly evident in both data transmitted and communication rounds.
Comparing the IOS approaches against Mont-Dec we see the key role that the incremental strategy exploiting outdated second order information has on the convergence speed of our approach.Indeed, even though in the first iterations the usage of the current Hessian information provides the same performance of the IOS methods, the performance becomes largely inferior in the following rounds.From a computational point of view, both the Mont-Dec and the FedNL approach are much heavier than Fib-IOS as they require that each agent recomputes the Hessian and SVD at each round.Fib-IOS, instead, in the more challenging case of µ = 10 6 , requires the agents to compute the Hessian matrix only 12 times out of the 450 rounds needed for convergence.For more details on this, see Figure 8.

Required computations of the Hessian
Figure 8: In this plots, we show, for three datasets (FMNIST, EMNIST and w8a), the number of times an agent is required to compute the local Hessian matrix in order for an algorithm to converge, comparing the proposed Fib-IOS and the FedNL (see Safaryan et al., 2021) algorithms.

Summary of comparison results
In Appendix B, we show the results obtained with the EMNIST and w8a datasets.In the case of EMNIST, we obtain results similar to FMNIST, except that, when µ = 10 −5 , GIANT is not much impacted by the considered non i.i.d.configuration.Even if in that case GIANT seems to be the best choice, all the other results show that GIANT and related approaches based on the harmonic mean (like DONE) are strongly sensitive to non i.i.d.data distributions.With image datasets (EMNIST and FMNIST), FedNL is largely outperformed by our approach (by both Fib-IOS and Fib-IOS+), while, with the 'w8a' dataset, FedNL is more competitive.However, the Fib-IOS and Fib-IOS+ approaches have the very appealing feature that they require agents to compute the local Hessian matrices only sporadically.The FedNL approach, instead, requires that the Hessian is recomputed by agents at each round, implying a much heavier computational demand.To better illustrate and quantify this advantage, we show, in Figure 8, the number of times that an agent is required to compute the local Hessian matrix in order to obtain convergence, comparing Fib-IOS and FedNL, in the cases of the three datasets.In the case of EMNIST and FMNIST, we are showing the non i.i.d.configurations, but similar results can be obtained with the i.i.d.ones.The results show that, compared with Fib-IOS, the number of times agents are required to compute the Hessian is always at least ten times greater for FedNL to converge.

Conclusion
In this work, we have proposed a Newton-type algorithm to perform FL in heterogeneous communication networks.The algorithm is versatile with respect to agents' (per iteration) communication resources and operates effectively the presence of non i.i.d.data distributions, outperforming state-of-the-art techniques.It achieves better performance with respect to the competing FedNL approach, while involving sporadic Hessian computations.
In the case of i.i.d.data statistics, it is also competitive with GIANT, even though the latter may perform better under certain conditions.We stress that the key advantage of the proposed technique is its robustness under any data distribution, and its effectiveness and versatility when communication resources differ across nodes and links.Future work includes the application of the algorithm to more specific scenarios and applications, like, for example, wireless networks, and also the study of new heuristics for the renewal operation in the general convex case.Furthermore, future research directions include the study of compression techniques for the Hessian eigenvectors as well as the development of heuristic algorithms combining different second-order methods, like FedNL and IOS.Lastly, the spectral characteristics of the data could also be exploited to tailor the proposed algorithm to specific FL problems, leading to further benefits in terms of convergence speed and accuracy of the final model.

Proof of Theorem 11
1) We suppose to be in the condition in which t is such that, for some t, t ≥ t implies that η t = 1 is chosen by Armijo backtracking line search.From Lemma 9, such t exists.From (33) of Theorem 10, we have that, for t ≥ t where B = L ρt + L 2ρt , and where we have used triangular inequality to get θ kt − θ t ≤ kt − θ * + θ t − θ * .Given Assumption 4 and Corollary 8, we see that there surely exists t such that for all t ≥ t , ct < 1, because k t = t − τ , with τ bounded, and this proves (35).2) From ( 35), letting a * ∈ (0, 1) be the Lyapunov exponent of θ t − θ * , we get, considering ct and B as in ( 43    where equalities (1) and (2) follow from calculations equivalent to the ones used to obtain (44).
3) Starting from 2), we can apply calculations analogous to the ones used in equation ( 25) of Lemma 6.

Proof of Theorem 12
We show that the Lyapunov exponent is a * = 0. From Lemma 9, there exists a t such that η t = 1 is chosen by the backtracking line search for any t ≥ t.Local linear convergence proven in Theorem 11 implies that there exists a t such that, for all t ≥ t , θ t+1 − θ * ≤ ct θ t − θ * , with ct < 1.In particular, we have, from (43 Linear convergence implies that, for some constant C > 0, we can write θ t − θ * ≤ Ca t for some a ∈ (0, 1).Furthermore, given that k t = t − τ with τ bounded, we have θ kt − θ * = θ t−τ − θ * ≤ C a t for some bounded C .It is easy to see that, given that the algorithm convergence is at least linear with some Lyapunov exponent a * , ā defined as ā := lim sup t ( t k=1 ck ) 1/t is such that ā ≥ a * .For k < t , ck is some bounded value that could be greater than 1.
To prove the Theorem we use the fact that, for q (i) t = n , ρ(i) t = λ(i) n,t for all i.Let t = Rn + T for some T < n and R ∈ N. Given the conditions required by the theorem, we see that q (i) kn = n for k = 1, 2, ..., R. When t = kn ≥ t, ct is as in (7), and we have q

Figure 2 :
Figure 2: Performance comparison for different values of ρ t .The best choice in termsof estimation error from Corollary 2 is compared against the choice that was proposed inErdogdu and Montanari (2015), that is ρ t = λ qt+1

Figure 3 :
Figure 3: Linear convergence with periodic renewals illustrated via the study of the upper bound on the dominant Lyapunov exponent of the estimation error from Lemma 6 and Theorem 11. āT is as in eqs.(24) and (37) for (a) and (b), respectively.Note that the upper bound is on the slope of the decreasing cost.

Figure 4 :
Figure 4: Performance comparison for different values of d t for (i) linear regression on 1M Songs ((a) and (c)) and (ii) logistic regression on FMNIST ((b) and (d)).d γ is as defined in Eq. (40), with d 0 = 2, so d γ is in average equal to 4. In (b), we emphasize three points where the renewal operation (see Algorithm 4) takes place.
L ρt C 1 a −T * + B C 2 , bounded.For any iteration t, we can write θ t+1 − θ * ≤ ct θ t − θ * for some ct that could also be greater than one, if t < t , but it is easy to see that ct is always bounded.Now, we consider ā := lim sup t ( t k=1 ck ) 1/t .It is straightforward to see that, as in the least squares case, it is ā ≥ a * .We can write log ā = lim sup < n − 1, ∀i, ∀t and thus ρk > λn,k (note that we can assume, without loss of generality, that all eigenvalues are different).Now, we see that Now we use local Lipschitz continuity (Assumption 2) to conclude proof.It is straightforward to see that Lipschitz continuity of f (i) (θ) implies that, for any k ∈ {1, ..., n}, |λ * + B θ t − θ * t