A GEOMETRIC INTEGRATION APPROACH TO SMOOTH OPTIMISATION: FOUNDATIONS OF THE DISCRETE GRADIENT METHOD

. Discrete gradient methods are geometric integration techniques that can preserve the dissipative structure of gradient ﬂows. Due to the mono-tonic decay of the function values, they are well suited for general convex and nonconvex optimisation problems. Both zero-and ﬁrst-order algorithms can be derived from the discrete gradient method by selecting diﬀerent discrete gradients. In this paper, we present a thorough analysis of the discrete gradient method for optimisation which provides a solid theoretical foundation. We show that the discrete gradient method is well-posed by proving the existence of iterates for any positive time step, as well as uniqueness in some cases, and propose an eﬃcient method for solving the associated discrete gradient equation. Moreover, we establish an O(1 / k) convergence rate for convex objectives and prove linear convergence if instead the Polyak– Lojasiewicz inequality is satisﬁed. The analysis is carried out for three discrete gradients—the Gon-zalez discrete gradient, the mean value discrete gradient, and the Itoh–Abe discrete gradient—as well as for a randomised Itoh–Abe method. Our theoretical results are illustrated with a variety of numerical experiments, and we furthermore demonstrate that the methods are robust with respect to stiﬀness.


Introduction
Discrete gradients are tools from geometric integration for numerically solving first-order systems of ordinary differential equations (ODEs), while ensuring that certain structures of the continuous system-specifically energy conservation and dissipation, and Lyapunov functions-are preserved in the numerical solution.Energy dissipation refers to the monotonic decrease in value of the objective function over time.
The use of discrete gradient methods to solve optimisation problems has gained increasing attention in recent years (Celledoni et al., 2018;Grimm et al., 2017;Riis et al., 2021;Ringholm et al., 2018), due to their preservation of dissipative structures of ODEs such as gradient flows.This means that the associated iterative scheme monotonically decreases the objective function for all positive time steps, at a rate analogous to that of gradient flow.
In this paper, we consider the unconstrained optimisation problem where the function V : R n → R is continuously differentiable.For an initial guess x 0 ∈ R n and k ∈ N, the discrete gradient method is of the form (1.2) where τ k > 0 is the time step, and ∇V is the discrete gradient, defined as follows.
Definition 1.0.1 (Discrete gradient).Let V be a continuously differentiable function.A discrete gradient is a continuous map ∇V : R n × R n → R n such that for all x, y ∈ R n , ∇V (x, y), y − x = V (y) − V (x) (Mean value property), (1.3) (Consistency property). (1.4) The background for discrete gradients and gradient flows for optimisation is given in Section 2.
There are several aspects of discrete gradient methods which make them attractive for optimisation.Their structure-preserving properties lead to schemes that are unconditionally stable, i.e. converging to a stationary point for arbitrary time steps.This may be particularly beneficial for stiff problems, where explicit schemes require prohibitively small time steps, due to local, rapid variations in the objective function gradient.We demonstrate this numerically in Section 8.5.Furthermore, the Itoh-Abe discrete gradient (2.5) is derivative-free, enabling derivative-free optimisation 1 e.g. for nonsmooth, nonconvex functions and black-box problems (Riis et al., 2021), and for functions whose gradients are expensive to compute (Ringholm et al., 2018).Beyond Euclidean gradient flow, discrete gradients can also preserve the dissipative structure for inverse scale space and Bregman distances (Benning et al., 2020).
1.1.Contributions and structure.The aim of this paper is to give a comprehensive optimisation analysis of discrete gradient methods.While having a solid theoretical foundation in geometric integration, the methods have only recently been explored as optimisation schemes, leaving several gaps in our understanding of their properties.
To this end, we address several theoretical questions.Namely, we prove wellposedness of the implicit update equation (1.2), propose efficient and stable methods for solving the update equation, and obtain convergence rates of the methods for different classes of objective functions.Furthermore, we provide various numerical examples to see how the methods perform in practice.
In Section 2 we define discrete gradients and introduce the four discrete gradient methods considered in this paper.In Section 3, we prove that the discrete gradient equation (the update formula) (1.2) is well-posed, meaning that for any time step τ k > 0 and x k ∈ R n , a solution x k+1 exists, under mild assumptions on V .Using Brouwer's fixed point theorem, this is the first existence result for the discrete gradient equation without a bound on the time step.In Section 4, we propose an efficient and stable method for solving the discrete gradient equation and prove convergence guarantees, building on the ideas of Norton and Quispel (2014).
In Section 5, we analyse the dependence of the iterates on the choice of time step, and obtain estimates for preferable time steps in the cases of L-smoothness and strong convexity.In Section 6, we establish convergence rates for convex functions with Lipschitz continuous gradients, and for functions that satisfy the Polyak-Lojasiewicz (P L) inequality (Karimi et al., 2016).In Section 7, we briefly discuss preconditioned discrete gradient methods.
In Section 8, we present numerical results for several test problems, and a numerical comparison of different numerical solvers for the discrete gradient equation (1.2).We conclude and present an outlook for future work in Section 9.
We emphasise that the majority of these results hold for nonconvex functions.Our contributions to the foundations of the discrete gradient method opens the door for future applications and research on discrete gradient methods for optimisation, with a deeper understanding of their numerical properties.
1.2.Related work.We list some applications of discrete gradient methods for optimisation.Grimm et al. (2017) propose applying them to solve variational problems in image analysis and prove convergence to stationary points for continously differentiable functions.Ringholm et al. (2018) apply the Itoh-Abe discrete gradient method to nonconvex imaging problems regularised with Euler's elastica.Miyatake et al. (2018) point out the equivalence between the Itoh-Abe discrete gradient method for quadratic problems and the well-known Gauss-Seidel and successive-over-relaxation (SOR) methods.
Several recent works look at discrete gradient methods in other optimisation settings.Concerning nonsmooth, derivative-free optimisation, Riis et al. ( 2021) study the Itoh-Abe discrete gradient method for solving nonconvex, nonsmooth functions, proving converge to a set of stationary points in the Clarke subdifferential framework.Celledoni et al. (2018) extend the Itoh-Abe discrete gradient method for optimisation on Riemannian manifolds.Hernández-Solano et al. (2015) combine a discrete gradient method with Hopfield networks to preserve a Lyapunov function for optimisation problems.Benning et al. (2020) apply the discrete gradient method to inverse scale space flow for sparse optimisation.
More generally, there is a wide range of research that studies connections between optimisation schemes and systems of ODEs.Su et al. (2016) and Wibisono et al. (2016) study second-order ODEs that can be seen as continuous-time limits of Nesterov acceleration schemes.In the former case, they shed light on the dynamics of these schemes, e.g. the oscillatory behaviour, by interpreting the ODEs as damping systems, while in the latter case, they present a family of Bregman Lagrangian functionals which generate the original and new acceleration schemes.Furthermore, they demonstrate that the choice of ODE discretisation method is central for whether the acceleration phenomena is retained in the iterative scheme.Attouch and Alvarez (2000) consider the second-order heavy ball with friction dynamical system for convex optimisation problems in Hilbert spaces.Several works have contributed to the setting of numerical analysis of acceleration methods which bridges continous-and discrete-time dynamics.Wilson et al. (2021) approach this from the perspective of Lyapunov theory, presenting Lyapunov functions that account for continuous-and discrete-time.Betancourt et al. (2018) present a framework of sympletic optimisation, considering perspectives of Hamiltonian dynamics and symplectic structure-preserving methods.In a similar vein, recent papers by Maddison et al. (2018) and França et al. (2020) study conformal Hamiltonian systems, with the former using information about the objective function's convex conjugate to obtain stronger convergence rates, and the latter considering structure-preserving numerical methods and their relation to some iterative schemes.1.3.Notation and preliminaries.We denote by S n−1 the unit sphere x ∈ R n : x = 1 .The line segment between two points is defined as [x, y] In this paper, we consider both deterministic schemes and stochastic schemes.For the stochastic schemes, there is a random distribution Ξ on S n−1 such that each iterate x k depends on a descent direction d k which is independently drawn from Ξ.We denote by ξ k the joint distribution of (d i ) k i=1 , and by V k the expectation of We also use the shorthand notation To unify notation for all the methods in this paper, we will write V k instead of V (x k ) for the deterministic methods as well.
Throughout the paper, we will consider two classes of functions, L-smooth and µ-convex functions.We here provide definitions and some basic properties.
We state some properties of L-smooth functions.
Proposition 1.1.If V : R n → R is L-smooth, then for all x, y ∈ R n , the following holds.
When µ > 0, we say that V is strongly convex.

Discrete gradient methods
2.1.The discrete gradient method and gradient flow.We motivate the use of discrete gradients by considering the gradient flow of V , where ẋ denotes the derivative of x with respect to time.This system is fundamental to optimisation, and underpins many gradient-based schemes.Applying the chain rule, we obtain The gradient flow has an energy dissipative structure, since the function value V (x(t)) decreases monotonically along any solution x(t) to (2.1).Furthermore, the rate of dissipation is given in terms of the norm of ∇V or equivalently the norm of ẋ.
In geometric integration, one studies methods for numerically solving ODEs while preserving certain structures of the continuous system-see Hairer et al. (2006); McLachlan and Quispel (2001) for an introduction.Discrete gradients are tools for solving first-order ODEs that preserve energy conservation laws, dissipation laws, and Lyapunov functions (Gonzalez, 1996;Itoh and Abe, 1988;McLachlan et al., 1999;Quispel and Turner, 1996).
For a sequence of strictly positive time steps (τ k ) k∈N and a starting point x 0 ∈ R n , the discrete gradient method applied to (2.1) is given by (1.2).This scheme preserves the dissipative structure of gradient flow, as we see by applying (1.3), Similarly to the dissipation law (2.2) of gradient flow, the decrease of the objective function value is given in terms of the norm of both the step x k+1 − x k and of the discrete gradient.
Throughout the paper, we assume there are bounds τ max ≥ τ min > 0 such that No restrictions are required on these bounds.Grimm et al. (2017) prove that if V is continuously differentiable and coercive-the latter meaning that the level set x ∈ R n : V (x) ≤ M is bounded for each M ∈ R-and if (τ k ) k∈N satisfy (2.4), then the iterates (x k ) k∈N of (1.2) converge to a set of stationary points, i.e. points x * ∈ R n such that ∇V (x * ) = 0.
We may compare the discrete gradient method to explicit gradient descent, x k+1 = x k − τ k ∇V (x k ).Unlike discrete gradient methods, gradient descent only decreases the objective function value for sufficiently small time steps τ k .To ensure decrease and convergence for this scheme, the time steps must be restricted based on estimates of the smoothness of the gradient of V , which might be unavailable, or lead to prohibitively small time steps.Conversely, implicit gradient descent, or the proximal point method (Chambolle and Pock, 2016), is unconditionally stable with respect to the time step, but does not necessarily exhibit structure preservation (Grimm et al., 2017).
One may also consider other numerical integration methods, such as implicit Runge-Kutta methods, where energy dissipation is ensured under mild time step restrictions (Hairer and Lubich, 2013), and explicit stabilised methods for solving strongly convex problems (Eftekhari et al., 2021).
2.2.Four discrete gradient methods.We now introduce the four discrete gradients considered in this paper.1.The Gonzalez discrete gradient (Gonzalez, 1996) is given by 2. The mean value discrete gradient (Harten et al., 1983), used for example in the average vector field method (Celledoni et al., 2012), is given by 3. The Itoh-Abe discrete gradient (Itoh and Abe, 1988) (also known as the coordinate increment discrete gradient) is given by , where 0/0 is interpreted as While the first two discrete gradients are gradient-based and can be seen as approximations to the midpoint gradient ∇V ( x+y 2 ), the Itoh-Abe discrete gradient is derivative-free, and is evaluated by computing successive coordinate-wise difference quotients.As a result, the corresponding implicit equation (1.2) uncouples, and can be treated as solving a series of scalar equations for k = 1, . . ., n.In an optimisation setting, the Itoh-Abe discrete gradient is often preferable to the others, as it is relatively computationally inexpensive.
4. The Randomised Itoh-Abe method (Riis et al., 2021) is an extension of the Itoh-Abe discrete gradient method, where the directions of descent are randomly chosen.Given a sequence of directions (d k ) k∈N ⊂ S n−1 drawn independently from a random distribution Ξ, we solve where x k+1 = x k is considered a solution whenever ∇V (x k ), d k+1 = 0.This scheme generalises the Itoh-Abe discrete gradient method, in that the methods are equivalent if (d k ) k∈N cycle through the standard coordinates with the rule d k = e [(k−1) mod n]+1 , k = 1, 2, . ... However, the computational effort of one iterate of the Itoh-Abe discrete gradient method equals n steps of the randomised method, so their efficiency should be judged accordingly.
While the randomised Itoh-Abe discrete gradient method does not retain the discrete gradient structure of the Itoh-Abe discrete gradient, it retains a dissipative structure akin to (2.2).
We also define the constant (2.7) and assume that Ξ is such that ζ > 0. For example, for the uniform random distribution on both S n−1 and on the standard coordinates (e i ) n i=1 , we have ζ = 1/n.See (Stich, 2014, Table 4.1) for estimates of (2.7) for these cases and others.
The motivation for introducing this randomised extension of the Itoh-Abe method is, first, to tie in discrete gradient methods with other optimisation methods such as stochastic coordinate descent (Fercoq and Richtárik, 2015;Qu et al., 2015;Wright, 2015) and random pursuit (Nesterov and Spokoiny, 2017;Stich, 2014), and, second, because this method extends to the nonsmooth, nonconvex setting (Riis et al., 2021).

Existence of solutions to the discrete gradient steps
In this section, we prove that the discrete gradient equation admits a solution y, for all τ > 0 and x ∈ R n , under mild assumptions on V and ∇V .
To the authors' knowledge, the following result is the first without a restriction on time steps.Norton and Quispel (2014) provide an existence and uniqueness result for small time steps for a large class of discrete gradients, via the Banach fixed point theorem.Furthermore, the existence of a solution for the Gonzalez discrete gradient is established for sufficiently small time steps via the implicit function theorem in (Stuart and Humphries, 1996, Theorem 8.5.4).
We use the following set notation.For δ > 0, the closed ball of radius δ about x is defined as We make two assumptions for the discrete gradient, namely that boundedness of the gradient implies boundedness of the discrete gradient, and that if two functions coincide on an open set, their discrete gradients also coincide.
Assumption 3.1.There is a constant C n that depends on the discrete gradient but is independent of V , and a continuous, nondecreasing function where δ(0) = 0 and δ(∞) := lim r→∞ δ(r), such that the following holds.
For any V ∈ C 1 (R n ; R) and any convex set U ⊂ R n with nonempty interior, the two following properties are satisfied.
Remark 3.2.It is straightforward to verify that if a finite collection of discrete gradient constructions satisfy these assumptions, then their convex combinations satisfy them too.
The following result, which is proved in Appendix A.1, shows that the discrete gradients considered in this paper satisfy the above assumption.
Remark 3.4.In practice, the Gonzalez, Itoh-Abe, and mean value discrete gradients account for the vast majority of applications and theoretical studies (Dahlby et al., 2011;Grimm et al., 2017;McLachlan et al., 1999;Quispel and Turner, 1996).Thus, while Assumption 3.1 may not hold for all constructions of discrete gradients, we consider them to be adequate for most purposes.
The existence proof is based on the Brouwer fixed point theorem (Brouwer, 1911).
Proposition 3.5 (Brouwer fixed point theorem).Let U ⊂ R n be a convex, compact set and g : U → U a continuous function.Then g has a fixed point in U .
We proceed to state the existence theorem.
Theorem 3.6 (Discrete gradient existence theorem).Suppose V is continously differentiable and that ∇ satisfies Assumption 3.1.Then there exists a solution y to (3.1) for any τ > 0 and x ∈ R n , if V satisfies any of the following properties.
(i) The gradient of V is uniformly bounded.
(iii) Both V and the gradient of V are uniformly bounded on co({y : V (y) ≤ V (x)}) (the bounds may depend on x), and δ ≡ 0 in Assumption 3.1.
Proof.Part (i).We want to show that the function g(y) = x − τ ∇V (x, y) has a fixed point, y = g(y).There is K > 0 such that ∇V (y) ≤ K for all y ∈ R n .Therefore, by Assumption 3.1, ∇V (x, y) ≤ C n K for all y ∈ R n .This implies that g(y) ∈ B τ CnK (x) for all y ∈ R n .Specifically, g maps B τ CnK (x) into itself.As g is continuous, it follows from Proposition 3.5 that there exists a point y ∈ B τ CnK (x) such that g(y) = y, and we are done.
which is continuously differentiable and supp(∇W ) ⊂ U δ+σ .Therefore, ∇W is uniformly bounded, so by part (i) there is y such that By (2.3), W (y) < W (x), which, by construction of W , implies that V (y) < V (x) and hence that y ∈ U .Lastly, since V and W coincide on U δ , and x and y both belong to U , it follows from Assumption 3.1 (ii) that ∇V (x, y) = ∇W (x, y).Hence a solution y = x − τ ∇V (x, y) exists.
Part (iii).Set U = co y : V (y) ≤ V (x) and M = sup y∈U V (y).Furthermore let η > 0 and set K = sup ∇V (y) : V (y) ≤ M + η and F = y : V (y) ≥ M + η .For any z ∈ F and y ∈ U , there is z ∈ [z, y] such that V (z) = M + η.The mean value theorem (MVT) (Nocedal and Wright, 1999, Equation A.55) and the boundedness of ∇V imply that there is λ ∈ (0, 1) such that Therefore, for all y ∈ U and z ∈ F , one has y − z ≥ η/K.As in the previous case, there exists a cutoff function ϕ ∈ C ∞ (R n ; [0, 1]) with uniformly bounded gradient, such that ϕ | U ≡ 1 and ϕ | F ≡ 0. To verify this, we may consider e.g.(Leoni, 2017, Theorem C.20), with ε = η/K and u(•) as the indicator function of Consider W : R n → R as defined in the previous case.The gradient of W is uniformly bounded, so there is a fixed point y such that y = x− τ ∇W (x, y).By the same arguments as in case (ii), ∇V (x, y) = ∇W (x, y), which implies that y solves y = x − τ ∇V (x, y).
The third case in Theorem 3.6 covers problems where V is not coercive, which includes linear systems with nonempty kernel and logistic regression problems (Lee et al., 2006) without regularisation.
While the above theorem holds also for the Itoh-Abe methods, there is a much simpler existence result in this case, given in (Riis et al., 2021), for which continuity of the objective function is sufficient.

Solving the discrete gradient equation
In the previous section, we proved that the discrete gradient equation (3.1) admits a solution for y for all τ > 0 and x ∈ R n .In what follows, we propose a relaxed fixed point method for solving (3.1), and prove linear convergence rates of the method for the mean value discrete gradient equation.This analysis can trivially be extended to the Itoh-Abe discrete gradient equation, as solving this equation corresponds to solving a succession of mean value discrete gradient equations in one dimension (since all discrete gradients are the same in one dimension, being implicitly defined by the mean value property (1.3)).
Remark 4.1.We were unable to prove linear convergence for the Gonzalez discrete gradient equation.However, in practice the scheme converged to a solution as quickly as for the mean value case.Norton and Quispel (2014) show that for sufficiently small time steps, there exists a unique solution to (3.1) that can be approximated by the fixed point iterations (4.1) y j+1 = T τ (y j ), for j ∈ N, where T τ (y) := x − τ ∇V (x, y), y 0 ∈ R n i.e. such that the iterates converge to a fixed point y * = T τ (y * ), or, equivalently, a solution to (3.1).For this, it is assumed that τ is less than 1/(10L DG ), where L DG is the Lipschitz constant for a given x of the mapping y → ∇V (x, y).
One is often interested in taking larger time steps, and particularly for the optimisation of L-smooth functions, optimal time steps may be closer to 2/L-see Section 6.Furthermore, as Theorem 3.6 ensures the existence of a solution for arbitrarily large time steps, we would like a constructive method for locating such solutions.We therefore propose to use the relaxed fixed point method, which for θ ∈ (0, 1] is given by (4.2) y j+1 = S θ,τ (y j ), S θ,τ (y) := (1 − θ)y + θT τ (y).
For θ = 1, this reduces to (4.1).In the remainder of this section, we will prove convergence guarantees of (4.2) for all time steps.In Section 8, we demonstrate its numerical efficiency.
In the following, we assume that the discrete gradient inherits smoothness and strong convexity properties from the gradient.Assumption 4.2.There are constants c L , c µ > 0 such that We write L DG := c L L and µ DG := c µ µ.
It is straightforward to verify these properties for the mean value discrete gradient.
Proposition 4.3.The mean value discrete gradient satisfies Assumption 4.2 with L DG = L/2 and µ DG = µ/2.Proof.To show that the first property holds, we write Similarly, to show the second property, we write To demonstrate that the scheme defined in (4.2) converges to a unique solution of y * = T τ (y * ), we will use Banach's Fixed Point Theorem (see e.g.(Granas and Dugundji, 2003, Theorem 1.1)).
Theorem 4.4 (Banach's Fixed Point Theorem).Let (X, d) be a non-empty complete metric space.Suppose T : X → X is a contraction mapping on X, i.e. there is q ∈ [0, 1) such that d(T (x), T (y)) ≤ qd(x, y) for all x, y ∈ X.
Then T has a unique fixed point y * ∈ X, and for any y 0 ∈ X, the sequence (y j ) j∈N defined by y j+1 = T (y j ) converges linearly to y * .
The following result demonstrates that for convex objective functions, the scheme (4.2) converges to a fixed point y * = T τ (y * ) for arbitrary time steps τ .
Theorem 4.5.If V is L-smooth and ∇ satisfies Assumption 4.2, then the mapping S θ,τ defined in (4.2) is a contraction mapping on R n if either of the following holds.
Case (ii).In a similar fashion, we write One can check that the coefficient ω(θ) is less than 1 provided θ belongs to the interval stated in the theorem.This concludes the proof.
Remark 4.6.In the second case of Theorem 4.5, the coefficient ω(θ) is minimised for which yields the linear convergence rate We note from this that the scheme converges faster for smaller time steps and for objective functions with smaller condition numbers L/µ ≈ L DG /µ DG =: κ DG .Furthermore, if τ = 1/(aL DG ) for some a ≥ 1, where a typical choice is a = 1, then we obtain This shows that the fixed point scheme (4.2) is robust to ill-conditioned problems, both with regards to appropriate choices of θ and the rate of convergence.
In Section 8.6, we compare the efficiency of the above scheme for different θ and of the built-in solver scipy.optimize.fsolve in Python.

Analysis of time steps for discrete gradient methods
In this section, we study the implicit dependence of x k+1 (τ ) on the choice of time step τ .We first establish a uniqueness result for the mean value and Itoh-Abe discrete gradient methods.Then we restrict our focus to Itoh-Abe methods, where we ascertain bounds on optimal time steps with respect to the decrease in V , for L-smooth, convex functions as well as strongly convex functions. 5.1.Uniqueness.
Lemma 5.1.If V is µ-convex, then the solution y to the discrete gradient equation (3.1) is unique for the mean value discrete gradient and the Itoh-Abe discrete gradient.
Proof.We first consider the mean value discrete gradient.Suppose y 1 , y 2 , solve Furthermore, as the Itoh-Abe discrete gradient method is a succession of scalar updates, each corresponding to a 1D mean value discrete gradient update, uniqueness follows.
5.2.Implicit dependence on the time step for Itoh-Abe methods.For the remainder of the section, we restrict our focus to Itoh-Abe methods.We fix a starting point x, direction d ∈ S n−1 and time step τ , and study the solution y to (5.1) By the analysis in Section 3, there exists a solution y for all τ > 0. For convenience and to exclude the case y = x, we assume ∇V (x), d > 0. For notational brevity, we rewrite the optimisation problem in terms of a scalar function f , i.e. solve (5.2) For optimisation schemes with a time step τ , it is common to assume that the distance between x and y increases with the time step.For explicit schemes, this naturally holds.However, for implicit schemes, such as the discrete gradient method, this is not always the case.We demonstrate this with a simple example in one dimension.
Example 5.2.Define V (x) := −x 3 and x = 0.For all τ > 0, (5.1) is solved by y = 1 τ .Then, as τ → 0, we have y → ∞, and as τ → ∞, we have y → x.The above example illustrates that for nonconvex functions, decreasing the time step might lead to a larger step y ← x and vice versa.
We now show that for convex functions, the distance y − x does increase with τ .Set R = sup r : V (x − αd) < V (x) for all α ∈ (0, r) .By the assumption that ∇V (x), d > 0, we have R > 0.
Proof.We first establish continuity and strict monotonicity.We use the alternative characterisation of convex functions in one dimension, which states that is monotonically nondecreasing in α.Since f (α) < 0 for α ∈ (0, R), it follows that α → f (α)/α 2 is strictly increasing on (0, R).We can thus apply e.g. the implicit function theorem for strictly monotone functions (Dontchev and Rockafellar, 2014, Theorem 1H.3) to conclude that the mapping τ → α(τ ) is continuous for all τ > 0. Furthermore, for any τ 2 > τ 1 > 0 and corresponding solutions α 1 , α 2 , using (5.2), we get Next, we show that α(τ ) → 0 as τ → 0. This can be seen by inspecting The left-hand side is bounded by the derivative f ′ (0) = − ∇V (x), d .Hence, as τ goes to zero, α(τ ) must also go to zero to prevent the right-hand side from blowing up.
Remark 5.4.The above proposition can also be shown to hold for non-differentiable, convex functions, by replacing the derivative with a subgradient.
Remark 5.7.This result also holds for strongly convex, non-differentiable functions.

Convergence rate analysis
In this section we derive convergence rates for L-smooth functions, µ-convex functions, and functions that satisfy the Polyak-Lojasiewicz (P L) inequality.We follow the arguments in (Beck and Tetruashvili, 2013;Nesterov, 2012), on convergence rates of coordinate descent.We assume throughout that the iterates (x k ) k∈N solve the discrete gradient equation exactly, and leave the impact of inexact updates on the convergence rates for future work.
We recall the notation in (1.5), , where V k = V (x k ) for deterministic methods.Estimates of the following form will be crucial to the analysis.(6.1) We first provide this estimate for each of the four methods.We assume throughout that the time steps (τ k ) k∈N satisfy arbitrary bounds (2.4).We consider coordinate-wise Lipschitz constants for the gradient of V as well as a directional Lipschitz constant.For i = 1, . . ., n, we suppose ∂ i V : R n → R n is Lipschitz continuous with Lipschitz constant L i ≤ L. We denote by L sum the ℓ 2 -norm of the coordinate-wise Lipschitz constants, Table 1.Estimates of β, as well as optimal time steps τ * and β * , with ζ given in (2.7).

Discrete gradient method
Furthermore, for a direction d ∈ S n−1 , we consider the Lipschitz constant L d ≤ L, such that for all x ∈ R n and α ∈ R, we have For the Itoh-Abe discrete gradient method or when Ξ only draws from the standard coordinates, we write L i instead of L e i .We define L max ≤ L to be the supremum of L d over all d in the support of the probability density function of Ξ.That is, L max ≥ L d for all d ∼ Ξ.In the case when Ξ draws from a restricted set, such as the standard coordinates, L max can be notably smaller than L. Hereby, we refine the L-smoothness property in Proposition 1.1 (i) to (6.2) for all α ∈ R and d in the support of the density of Ξ (Beck and Tetruashvili, 2013, Lemma 3.2).
Lemma 6.1.If V is L-smooth, then the three discrete gradient methods and the randomised Itoh-Abe method satisfy (6.1) with values for β given in Table 1.
A proof of this lemma is given in the Appendix A.2.Note that these estimates do not require convexity of V .6.1.Optimal time steps and estimates of β.Lower values for β in (6.1) correspond to better convergence rates, as can be seen in Theorems 6.4 and 6.6.In what follows, we briefly discuss the time steps that yield minimal values of β, denoted by τ * and β * in Table 1.
For the Gonzalez and mean value discrete gradient methods, it is natural to compare rates to those of explicit gradient descent, which has the estimate β * = 2L (Nesterov, 2004, Section 2.1.5).Hence, the mean value discrete gradient method recovers the optimal rates of gradient descent, while the estimate for the Gonzalez discrete gradient is worse by a factor of √ 2. In contrast, for the proximal point method, it can be shown that β = 2/τ and can thus be made arbitrarily small by taking larger time steps.This follows from the fact that the proximal point method with time step τ applied to V is equivalent to explicit gradient descent with time step τ applied to the Moreau-Yosida regularisation of V with parameter τ , V τ , which is 1/τ -smooth (Chambolle and Pock, 2016, Section 4.2).It follows that β = 2/τ .
For the Itoh-Abe discrete gradient method, we compare its rates to those obtained for cyclic coordinate descent (CD) schemes in (Wright, 2015, Theorem 3) and (Beck and Tetruashvili, 2013, Lemma 3.3), β * = 8 √ nL, where we have set their parameters L max and L min to √ nL.Hence, the estimate for the Itoh-Abe discrete gradient method is stronger, being at most half that of CD, even in the worst-case scenario L sum = √ nL.
Remark 6.2.Note however that we can improve the estimate for the CD scheme to recover the same rate.See Appendix B.
We give one motivating example for considering the parameter L sum .
Example 6.3.Let V be a least squares problem V (x) = Ax − f 2 /2.We then have Thus, for low-rank system where rank(A) ≪ n, the convergence speed of the Itoh-Abe discrete gradient method improves considerably.
To derive (6.3), one can show that L = A * A 2 and L sum = A * A F , where • 2 and • F denote the operator norm and Frobenius norm.The bound follows from the properties B F ≤ rank(B) B 2 (Higham, 2002, Table 6.2) and rank(A * A) = rank(A) (Meyer, 2000, Statement 4.5.4).
We compare the rates for the randomised Itoh-Abe methods to randomised coordinate descent (RCD).Recall that when Ξ is the uniform distribution on the coordinates (e i ) n i=1 or on the unit sphere S n−1 , we have ζ = 1/n.This gives us β * = 2nL max for the randomised Itoh-Abe methods, which is the optimal bound for randomised coordinate descent (Wright, 2015).6.2.Lipschitz continuous gradients.For the next result, we use the notation Theorem 6.4.Let V be an L-smooth, convex, coercive function.Then for all four methods, β given in Table 1, and V * := min x V (x), we have Proof.Let x * be a minimiser of V .By respectively convexity, the Cauchy-Schwarz inequality, and Lemma 6.1, we have Taking expectation on both sides with respect to ξ k , and applying Jensen's inequality (Rudin, 1987, Theorem 3.3), we obtain By monotonicity of V k and following the steps in the proof of (Nesterov, 2012, Theorem 1), we obtain To eliminate dependence on the starting point, we derive V (x 0 ) − V * ≤ L 2 R(x 0 ) 2 from Proposition 1.1 (i), which gives us the desired result (6.4).
6.3.The Polyak-Lojasiewicz inequality.The next result states that for Lsmooth functions that satisfy the P L inequality, we achieve a linear convergence rate.A function is said to satisfy the P L inequality with parameter µ > 0 if, for all x ∈ R n , (6.5) Originally formulated by Polyak (1963), it was recently shown that this inequality is weaker than other properties commonly used to prove linear convergence (Bolte et al., 2010(Bolte et al., , 2017;;Csiba and Richtárik, 2017;Karimi et al., 2016;Necoara et al., 2018).This is useful for extending linear convergence rates to functions that are not strongly convex, including some nonconvex functions.
We now proceed to the main result of this subsection.
Theorem 6.6.Let V be L-smooth and satisfy the P L inequality (6.5) with parameter µ.Then, for β given in Table 1, the three discrete gradient methods and the randomised Itoh-Abe method obtain the linear convergence rate Proof.This is a standard argument, see e.g.(Karimi et al., 2016).Combining the P L inequality (6.5) with the estimate in Lemma 6.1, and taking expectation with respect to ξ k on both sides, we get

Preconditioned discrete gradient method
We briefly discuss the generalisation of the discrete gradient method (1.2) to a preconditioned version (7.1) where (A k ) k∈N ⊂ R n×n is a sequence of positive-definite matrices.Denoting by σ 1,k and σ n,k the smallest and largest singular values of A k respectively, we have, for all x, σ 1,k x ≤ A k x ≤ σ n,k x .It is straightforward to extend the results in Section 3 and Section 6 to this setting, under the assumption that there are σ max ≥ σ min > 0 such that σ min ≤ σ 1,k , σ n,k ≤ σ max for all k ∈ N.
There are several motivations to precondition.In the context of geometric integration, it is typical to group the gradient flow system (2.1) with the more general dissipative system ẋ = −A(x)∇V (x), where A(x) ∈ R n×n is positive-definite for all x ∈ R n (Quispel and Turner, 1996).This yields numerical schemes of the form (7.1), where we absorb τ k into A k .There are optimisation problems in which the time step τ k should vary for each coordinate.This is, for example, the case when one derives the SOR method from the Itoh-Abe discrete gradient method (Miyatake et al., 2018).
More generally, if one has coordinate-wise Lipschitz constants for the gradient of the objective function, it may be beneficial to scale the coordinate-wise time steps accordingly.

Numerical experiments
In this section, we apply the discrete gradient methods to various test problems.The code for the figures has been implemented in Python and MATLAB and is available at https://github.com/riis-academic/discrete-gradient-method-smooth-optimisation/.For solving the discrete gradient equation (1.2) with the Gonzalez and mean value discrete gradients, we use the fixed point method (4.2) detailed in Section 4 and tested numerically in Section 8.6 under the label 'R'.For solving (1.2) for the Itoh-Abe method, we use the built-in solver scipy.optimize.fsolve in Python.
For these test problems, we generally assume knowledge of the Lipschitz constant L when setting time steps.However, in Section 8.5 we assess Itoh-Abe methods with a wider range of time steps.8.1.Setup.We use the following time steps for the different methods, unless otherwise specified.For the mean value discrete gradient method, we use τ MV = 2/L, for the Gonzalez discrete gradient method, we use τ G = 2/L, and for the Itoh-Abe methods, we use the coordinate-dependent time steps τ IA,i = τ RIA,i = 2/L i .Note that the time steps for the Itoh-Abe discrete gradient method are not the optimal choice suggested in Table 1, but were heuristically optimal for the test problems we considered.An analysis of coordinate-dependent time step strategies is an open topic for future research.
In figure captions and legends, the abbreviations CIA and RIA refer respectively to the (cyclic) Itoh-Abe discrete gradient method and the randomised Itoh-Abe method drawing uniformly from the standard coordinates.For the sake of comparison, we define one iterate of the randomised Itoh-Abe methods as n scalar updates, so that the computational time is comparable to the standard Itoh-Abe discrete gradient method.
Unless otherwise specified, matrices and vectors are constructed by standard Gaussian draws.To provide the matrix with a given condition number, we compute its singular value decomposition and linearly transform its eigenvalues accordingly.8.2.Linear systems.We first solve linear systems of the form For linear systems, the Gonzalez and the mean value discrete gradient are both given by ∇V (x, y) = ∇V ( x+y 2 ) = A * (A x+y 2 − b), so we consider these jointly.As discussed previously, the Itoh-Abe methods reduce to SOR methods for linear systems and are therefore explicit.8.2.1.Effect of the condition number.We set n = 500 and consider two linear systems respectively with a low condition number κ = L/µ = 10 and a high one κ = 1, 000.In both cases, we set x 0 = 0. See Figure 1 for the results for both cases.
. Linear rate is observed for all methods and is sensitive to condition number.8.2.2.Sharpness of rates.We test the sharpness of the convergence rate (6.6) for the randomised Itoh-Abe method.To do so, we run 100 instances of the numerical experiment in the previous subsection and plot the mean convergence rate and 90%confidence intervals, and compare the results to the proven rate.We do this for two condition numbers, κ = 1.2 and 10.The results are presented in Figure 2.These plots suggest that the proven convergence rate estimate is sharp for the randomised Itoh-Abe method.Comparison of observed convergence rate with theoretical convergence rate (6.6), for randomised Itoh-Abe method applied to linear system with condition numbers κ = 1.2 (left) and κ = 10 (right).Average convergence rate and confidence intervals as estimated from 100 runs on the same system.The sharpness of the proven convergence rate is observed in both cases.
8.2.3.Linear system with kernel.Next we consider linear systems where the operator A has a nontrivial kernel, meaning that the objective function is not strongly convex, but nevertheless satisfies the P L inequality.We let A ∈ R m×n and b ∈ R m , where n = 800 and m = 400, meaning the kernel of A has dimension 400.See Figure 3 for results. .DG methods for linear systems with nontrivial kernel, and convergence rate plotted as relative objective.The function is not strongly convex but satisfies the P L inequality, yielding linear convergence rates.8.2.4.A note of caution.The performance of coordinate descent methods and their optimal time steps vary significantly with the structure of the optimisation problem (Gürbüzbalaban et al., 2017;Sun and Ye, 2021;Wright and Lee, 2020).If the linear systems above were constructed using a random distribution whose mean is not zero, then the results would look different.We demonstrate this with a numerical test with results in Figure 4.
We compare two time steps for the cyclic Itoh-Abe method, τ i = 2/L i and τ i = 2/(L i √ n), denoted by the curves labelled 'heuristic' and 'proven' respectively.While the heuristic time step was superior for most of the test problems considered in this section, it performs significantly worse for this example.Furthermore, in this case the randomised Itoh-Abe method converges faster than the cyclic one.where C > 0. We set n = 100, m = 200, C = 1, and the elements of (y i ) m i=1 is drawn from {−1, 1} with equal probability.See Figure 5 for the numerical results.Convergence rate plotted as relative objective.The rates of randomised and cyclic Itoh-Abe methods almost coincide, and so do the mean value and Gonzalez discrete gradient methods.8.4.Nonconvex function.We solve the nonconvex problem (8.3) min x∈R n V (x) = Ax2 + 3 sin 2 ( c, x ), where A ∈ R n×n is a square, nonsingular matrix, and c ∈ R n satisfies Ac = c and c = 1.This is a higher-dimensional extension of the scalar function x 2 + 3 sin 2 (x) considered by Karimi et al. (2016).This scalar function satisfies the P L inequality (6.5) for µ = 1/32, and it follows that V satisfies it for µ = 1/(32κ), where κ is the condition number of A * A. Furthermore, the nonconvexity of V is observed by considering the restriction of V to x = λc for λ ∈ R, which has the form of the original scalar function.The function has the unique minimiser x * = 0.
We set n = 50 and choose x 0 constructed by random, independent draws from a Gaussian distribution.See Figure 6 for the numerical results.8.5.Comparison of Itoh-Abe and coordinate descent for stiff problems.In image analysis and signal processing, variational optimisation problems often feature nonsmooth regularisation terms to promote sparsity, e.g. in the gradient domain or a wavelet basis.While one may replace these terms with smooth approximations, this can lead to stiffness of the optimisation problem, i.e. local, rapid variations in the gradient, requiring the use of severely small time steps for explicit numerical methods.In such cases, the cost of solving an implicit equation such as (1.2) may be preferrable to explicit methods.
We investigate this scenario, by comparing the Itoh-Abe discrete gradient method to cyclic coordinate descent (CD) (B.1) for solving (smoothed) total variation denoising problems.We denote by x true ∈ R n a ground truth image 2 , and by ) .Linear convergence rates are observed for the objective value and the gradient of the norm.
In Figure 7, we compare the DG method for a range of time steps to CD.This demonstrates that the superior convergence rate of the DG method is stable with respect to a wide range of time steps.In Figure 8, we compare the DG method to CD for different values of ε, demonstrating that the benefits of using the DG method increases as ε gets smaller.In Figure 9, we compare different time steps for CD to the DG method, showing that for large time steps, the CD scheme is unstable and fails to decrease while for small time steps, the iterates decrease too slowly.In Figure 10, we employ a simple backtracking line search (LS) method based on the Armijo-Goldstein condition, and compare this to the DG method.8.6.Comparison of methods for solving the discrete gradient equation.We test the numerical performance of four methods for solving the discrete gradient equation (1.2), building on the fixed point theory in Section 4.
The first method, denoted F, is the fixed point updates (4.1) proposed by Norton and Quispel (2014) (θ = 1).The second method, denoted R, is the relaxed fixed point method (4.2),where θ is optimised according to (4.3) if V is convex, and is otherwise set to 1/2.The third method, denoted F+R, is also the updates (4.2) with θ = 1 by default, but whenever the discrepancy T (y k+1 ) − y k+1 is greater than T (y k ) − y k , then the update is repeated with θ set to half its previous value.This third option might be desirable in cases where θ = 1 is expected to give faster convergence but also be unstable.The fourth method is the built-in solver scipy.optimize.fsolve in Python.As the problem becomes more ill-conditioned-when ε is reduced-the performance of CIA improves relative to CD, demonstrating the CIA scheme's resilience to nonsmooth features.
To test these methods, we performed 50 iterations of the discrete gradient method for different test problems, where at each iterate the discrete gradient solver would run until for some tolerance ε > 0, or until k reaches a maximum K max .We then compare the average CPU time (s) for each of these methods.If a method fails to converge for a significant number of the iterations (> 10%), we consider the method inapplicable for that test problem.We test the methods for the mean value discrete gradient applied to three of the previous test problems, for = 10 −6 and 10 −12 .We have not included results for the Gonzalez discrete gradient and other tolerances, as the results were largely the same.
The results are given in Table 2, where the best result is highlighted in bold for each test problem.We see that R is superior in stability, being the only method that locates the minimiser in every case.In all cases, R or F+R were the most efficient or close to the most efficient method.However, the relative performances of the different methods vary notably for the different test problems.

Conclusion
In this paper, we studied the discrete gradient method for optimisation, and provided several fundamental results on well-posedness, convergence rates and optimal time steps.We focused on four methods, using the Gonzalez discrete gradient, the mean value discrete gradient, the Itoh-Abe discrete gradient, and a randomised version of the Itoh-Abe method.Several of the proven convergence rates match the optimal rates of classical methods such as gradient descent and stochastic coordinate descent.For the Itoh-Abe discrete gradient method, the proven rates are better than previously established rates for comparable methods, i.e. cyclic coordinate descent methods (Wright, 2015).
There are open problems to be addressed in future work.Similar to acceleration for gradient descent, proximal gradient descent (Chambolle and Pock, 2016), and coordinate descent (Beck and Tetruashvili, 2013;Nesterov, 1983Nesterov, , 2012;;Wright, 2015), we will study acceleration of the discrete gradient method to improve the convergence rate from O(1/k) to O(1/k 2 ).We would furthermore like to consider generalisations of the discrete gradient method to discretise gradient flows with respect to other measures of distance than the Euclidean inner product (Benning et al., 2013;Burger et al., 2009).
In (Beck and Tetruashvili, 2013) (see Lemma 3.3) and referenced by Wright (2015), the estimate β = 4L max 1 + nL 2 /L 2 min , is obtained, using the time step τ i = 1/L i .This rate is optimised with respect to L min , L max when setting L min = L max = √ nL, yielding β = 8 √ nL.However, we show in Section 6.1 that the Itoh-Abe discrete gradient method achieves the stronger bound β = 4L sum ≤ 4 √ nL.We therefore include an analysis to show that the bound for CD can similarly be improved.
By the coordinate-wise descent lemma (6.2), we have For some α ∈ (0, 2), we choose τ i = α/L i , and substitute into the above inequality to get We then compute Setting α = 1/ √ n and L i = L, we obtain the new estimate for β, This is approximately the same rate as that of the Itoh-Abe discrete gradient method.
Coordinate descent methods are often extended to block coordinate descent methods.The above analysis can be extended to this setting by replacing n with the number of blocks.

Figure 1 .
Figure 1.DG methods for linear systems with condition number κ = 10 (left) and κ = 1, 000 (right).Convergence rate plotted as relative objective [V (x k ) − V * ]/[V (x 0 ) − V * ].Linear rate is observed for all methods and is sensitive to condition number.
Figure 2.Comparison of observed convergence rate with theoretical convergence rate (6.6), for randomised Itoh-Abe method applied to linear system with condition numbers κ = 1.2 (left) and κ = 10 (right).Average convergence rate and confidence intervals as estimated from 100 runs on the same system.The sharpness of the proven convergence rate is observed in both cases.
Figure3.DG methods for linear systems with nontrivial kernel, and convergence rate plotted as relative objective.The function is not strongly convex but satisfies the P L inequality, yielding linear convergence rates.

Figure 4 .
Figure 4. CIA and RIA methods applied to linear system, with matrix entries created from uniform distribution.CIA with the time step τ = 1/[ √ nL] (orange, circle) performs better than the same method with heuristic time step τ = 2/L (blue, triangle), but worse than RIA.This is the reverse of what was observed in previous examples.

Figure 5 .
Figure 5. DG methods for l 2 -regularised logistic regression (8.2).Convergence rate plotted as relative objective.The rates of randomised and cyclic Itoh-Abe methods almost coincide, and so do the mean value and Gonzalez discrete gradient methods.

Figure 6 .
Figure6.DG methods applied to nonconvex problem (8.3) that satisfies the P L inequality.Left: Relative objective.Right: Norm of gradient ∇V (x k ) / ∇V (x 0 ) .Linear convergence rates are observed for the objective value and the gradient of the norm.

Figure 7 .Figure 8 .
Figure 7.Comparison of CIA with different time steps to CD with standard time step 1/L for solving the total variation problem (8.4).Left: Relative objective.Middle: Noisy image.Right: Reconstructed image.For the larger time steps, the CIA method converges significantly faster than both the CD method and the CIA method with the usual time step τ i = 2/L.

Figure 9 .Figure 10 .
Figure9.Comparison CD using different time steps versus CIA using a fixed time step, for the total variation problem (8.4).For smaller time steps, the CD iterates decrease too slowly, and for larger steps, they become unstable and fail to decrease.