Stochastic Heavy Ball

This paper deals with a natural stochastic optimization procedure derived from the so-called Heavy-ball method differential equation, which was introduced by Polyak in the 1960s with his seminal contribution [Pol64]. The Heavy-ball method is a second-order dynamics that was investigated to minimize convex functions f . The family of second-order methods recently received a large amount of attention, until the famous contribution of Nesterov [Nes83], leading to the explosion of large-scale optimization problems. This work provides an in-depth description of the stochastic heavy-ball method, which is an adaptation of the deterministic one when only unbiased evalutions of the gradient are available and used throughout the iterations of the algorithm. We first describe some almost sure convergence results in the case of general non-convex coercive functions f . We then examine the situation of convex and strongly convex potentials and derive some non-asymptotic results about the stochastic heavy-ball method. We end our study with limit theorems on several rescaled algorithms.


Introduction
Finding the minimum of a function f over a set Ω with an iterative procedure is very popular among numerous scientific communities and has many applications in optimization, image processing, economics and statistics, to name a few. We refer to [NY83] for a general survey on optimization algorithms and discussions related to complexity theory, and to [Nes04,BV04] for a more focused presentation on convex optimization problems and solutions. The most widespread approaches rely on some first-order strategies, with a sequence pX k q kě0 that evolves over Ω with a first-order recursive formula X k`1 " ΨrX k , f pX k q, ∇f pX k qs that uses a local approximation of f at point X k , where this approximation is built with the knowledge of f pX k q and ∇f pX k q alone. Among them, we refer to the steepest descent strategy in the convex unconstrained case, and to the Frank-Wolfe [FW56] algorithm in the compact convex constrained case. A lot is known about first-order methods concerning their rates of convergence and their complexity. In comparison to second-order methods, first-order methods are generally slower and are significantly degraded on ill-conditioned optimization problems. However, the complexity of each update involved in first-order methods is relatively limited and therefore useful when dealing with a large-scale optimization problem, which is generally expensive in the case of Interior Point and Newton-like methods. A second-order "optimal" method was proposed in [Nes83] in the 1980s' (also see [AB09] for an extension of this method with proximal operators). The so-called Nesterov Accelerated Gradient Descent (NAGD) has particularly raised considerable interest due to its numerical simplicity, to its low complexity and to its mysterious behavior, making this method very attractive for large-scale machine learning problems. Among the available interpretations of NAGD, some recent advances have been proposed concerning the second-order dynamical system by [WSC16], being a particular case of the generalized Heavy Ball with Friction method (referred to as HBF in the text), as previously pointed out in [CEG09a,CEG09b]. In particular, as highlighted in [CEG09a], NAGD may be seen as a specific case of HBF after a time rescaling t " ? s, thus making the acceleration explicit through this change of variable, as well as being closely linked to the modified Bessel functions when f is quadratic.
A growing field of interest related to these optimization algorithms concerns the development of efficient procedures when only noisy gradients are available at each iteration of the procedure. On the practical side, this question was first introduced in the seminal contributions on stochastic approximation and optimization of [RM51] and [KW52]. Even though the Robbins-Monro algorithm is able to achieve an optimal Op1{nq rate of convergence for strongly convex functions, its ability is highly sensitive to the step sizes used. This remark led [PJ92] to develop an averaging method that makes it possible to use longer step sizes of the Robbins-Monro algorithm, and to then average these iterates with a Cesaro procedure so that this method produces optimal results in the minimax sense (see [NY83]) for convex and strongly convex minimization problems, as pointed out in [BM11].
On the theoretical side, numerous studies have addressed a dynamical system point of view and studied the close links between stochastic algorithms and their deterministic counterparts for some general function f (i.e., even non convex). These links originate in the famous Kushner-Clark Theorem (see [KY03]) and successful improvements have been obtained using differential geometry by [BH96,Ben06] on the long-time behavior of stochastic algorithms. In particular, a growing field of interest concerns the behavior of self-interacting stochastic algorithms (see, among others, [BLR02] and [GP14]) because these non-Markovian processes produce interesting features from the modeling point of view (an illustration may be found in [GMP15]).
Several theoretical contributions to the study of second-order stochastic optimization algorithms exist. [Lan12] explores some adaptations of the NAGD in the stochastic case for composite (strongly or not) convex functions. Other authors [GL13,GL16] obtained convergence results for the stochastic version of a variant of NAGD for non-convex optimization for gradient Lipschitz functions but these methods cannot be used for the analysis of the Heavy-ball algorithm. Finally, a recent work [YLL16] proposes a unified study of some stochastic momentum algorithms while assuming restrictive conditions on the noise of each gradient evaluation and on the constant step size used. It should be noted that [YLL16] provides a preliminary result on the behavior of the stochastic momentum algorithms in the non-convex case with possible multi-well situations. Our work aims to study the properties of a stochastic optimization algorithm naturally derived from the generalized heavy ball with friction method.
Our paper is organized as follows: Section 2 introduces the stochastic algorithm as well as the main assumptions needed to obtain some results on this optimization algorithm. For the sake of readability, these results are then provided in Section 2.4 without too many technicalities. The rest of the paper then deals with the proof of these results. Section 3 is dedicated to the almost sure converegence result we can obtain in the case of a non-convex function f with several local minima. Section 4 establishes the convergence rates of the stochastic heavy ball in the strongly convex case. Section 5 provides a central limit theorem in a particular case of the algorithm. Appendix A consists of some important results on the supremum of certain random variables needed for the non-convex case.

Stochastic Heavy Ball
We begin with a brief description of what is known about the underlying dynamical system.

Deterministic Heavy Ball
This method introduced by Polyak in [Pol64] is inspired from the physical idea of producing some inertia on the trajectory to speed up the evolution of the underlying dynamical system: a ball evolves over the graph of a function f and is submitted to both damping (due to a friction on the graph of f ) and acceleration. More precisely, this method is a second-order dynamical system described by the following O.D.E.: : xt`γt 9 xt`∇f pxtq " 0, where pγtqtě0 corresponds to the damping coefficient, which is a key parameter of the method. In particular, it is shown in [CEG09a] that the trajectory converges only under some restrictive conditions on the function pγtqtě0, namely: • if`8 ş Intuitively, these conditions translate the oscillating nature of the solutions of (1) into a quantitative setting for the convergence of the trajectories: if γt ÝÑ 0 is sufficiently fast, then the trajectory cannot converge (the limiting case being : x`∇f pxq " 0). These properties lead us to consider two natural families of functions pγtqtě0: γt " r{t with r ą 1 and γt " γ ą 0. To convert (1) into a tractable iterative algorithm, it is necessary to rewrite this O.D.E. using a coupled momentum equation. Consistent with [CEG09b], (1) is equivalent to the following integro-differential equation: where h and k are two memory functions related to γ. In the natural situation of two positive increasing functions h and k, if pxtqtě0 is a solution of (2), then pxsqsě0 is solution of (1) with: xs " x τ psq and 9 τ psq " a pkh´1qpτ psqq with γs " 9 kh`k 9 h 2h 3{2 k 1{2˝τ psq. We can consider two typical situations where the deterministic HBF (1) converges (see [CEG09a] for further details): • The exponentially memoried HBF corresponds to the choice kptq " λe λt and hptq " e λt and to a constant damping function γs " ? λ when the time scale is given by τ psq " ? λs. Note that in this situation, the two convergence conditions are satisfied since: • The polynomially memoried HBF corresponds to the choice kptq " t α`1 and hptq " pα`1qt α and is associated with an asymptotically vanishing damping γs " 2α`1 s and a time scale τ psq " s 2 4pα`1q , where the choice α " 1 is associated with the NAGD (see [WSC16] and their "magic" constant 3 " 2α`1 in that case).

Stochastic HBF
All these remarks lead to the consideration of a natural stochastic version of (2) when h " 9 k. As pointed out by [GP14], the introduction of an auxiliary function yt " kptq´1 ş t 0 hpsq∇f pxsqds makes it possible to obtain a first-order Markov evolution because 9 yt " rtp∇f pxtq´ytq with rt " hptq kptq . Hence, we define the stochastic Heavy Ball system as pX0, Y0q " px, yq P R 2d and:

#
Xn`1 " Xn´γn`1Yn Yn`1 " Yn`γn`1rnp∇f pXnq´Ynq`γn`1rn∆Mn`1, where the natural filtration of the sequence pXn, Ynqně0 is denoted pFnqně1 and: • p∆Mnq is a sequence of Fnq-martingale increments. For applications, ∆Mn`1 usually represents the difference between the "true" value of ∇f pXnq and the one observed at iteration n denoted BxF pXn, ξnq, where pξnqn is a sequence of i.i.d. random variables and F is an R d -valued measurable function such that: In this case, ∆Mn`1 " ∇f pXnq´BxF pXn, ξnq.
The randomness appears in the second component of the algorithm (3), whereas it was handled in the first component in [GP14]. We will introduce some assumptions on f and on the martingale sequence later.
• pγnqně1 corresponds to the step size used in the stochastic algorithm, associated with the "time" of the algorithm represented by: Γn "`8.
For the sake of convenience, we also define: which may converge or not according to the choice of the sequence pγ k q kě1 .
• prnqně1 is a deterministic sequence that mimics the function t Þ ÝÑ rt defined as: In particular, when an exponentially weighted HBF with kptq " e rt is chosen, we have rn " r ą 0, regardless of the value of n. In the other situation where kptq " t r , we obtain rn " rΓ´1 n .

Baseline assumptions
We introduce some of the general assumptions we will work with below. Some of these conditions are very general, whereas others are more specifically dedicated to the analysis of the strongly convex situation. We will use the notation }.} (resp. }.}F ) below to refer to the Euclidean norm on R d (resp. the Frobenius norm on M d,d pRq). Finally, when A P M d,d pRq, }A}8 will refer to the maximal size of the modulus of the coefficients of A: }A}8 :" sup i,j |Ai,j|. Our theoretical results will obviously not involve all of these hypotheses simultaneously.
Function f We begin with a brief enumeration of assumptions on the function f .
‚ Assumption pHsq : f is a function in C 2 pR d , Rq such that: The assumption pHsq is weak: it essentially requires that f be smooth, coercive and have, at the most, a quadratic growth on 8. In particular, no convexity hypothesis is made when f satisfies pHsq. It would be possible to extend most of our results to the situation where f is L-smooth (with a L-Lipschitz gradient), but we preferred to work with a slightly more stringent condition to avoid additional technicalities.
‚ Assumption pHSC pαqq : f is a convex function such that α " inf xPR d Sp`D 2 f pxq˘ą 0 and D 2 f is Lipschitz.
In particular, pHSC pαqq implies that f is α-strongly convex, meaning that: Of course, pHSC pαqq is still standard and is the most favorable case when dealing with convex optimization problems, leading to the best possible achievable rates. pHSC pαqq translates the fact that the spectrum of the Hessian matrix at point x, denoted by Sp`D 2 f pxq˘, is lower bounded by α ą 0, uniformly over R d . The fact that D 2 f is assumed to be Lipschitz will be useful to achieve convergence rates in Section 4.2.
Noise sequence p∆M n`1 q ně1 We will essentially use three types of assumptions alternatively on the noise of the stochastic algorithm (3). The first and second assumptions are concerned with a concentration-like hypothesis. The first one is very weak and asserts that the noise has a bounded L 2 norm.
‚ Assumption pHσ,pq : (p ě 1) For any integer n, we have: The assumption pHσ,2q is a standard convergence assumption for general stochastic algorithms. For some nonasymptotic rates of convergence results, we will rely on pHσ,pq for any p ě 1. In this case, we will denote the assumption by pHσ,8q. Finally, let us note that the condition could be slightly alleviated by replacing the right-hand member by σ 2 p1`f pXnq`|Yn| 2 q p . However, in view of the standard case (4), this improvement has little interest in practice, which explains our choice.
‚ Assumption pH Gauss,σ q : For any integer n, the Laplace transform of the noise satisfies: This hypothesis is much stronger than pHσ,pq and translates a sub-Gaussian behavior of p∆Mn`1qně1. In particular, it can be easily shown that pH Gauss,σ q implies pHσ,pq. Hence, pH Gauss,σ q is somewhat restrictive and will be used only to obtain one important result in the non-convex situation for the almost sure limit of the stochastic heavy ball with multiple wells.
‚ Assumption pHE q : For any iteration n, the noise of the stochastic algorithm satisfies: This assumption will be essential to derive an almost sure convergence result towards minimizers of f . Roughly speaking, this assumption states that the noise is uniformly elliptic given any current position of the algorithm at step n: the projection of the noise has a non-vanishing component over all directions v. We will use this assumption to guarantee the ability of (3) to get out of any unstable point.
Step sizes One important step in the use of stochastic minimization algorithms relies on an efficient choice of the step sizes involved in the recursive formula (e.g. in Equation 3). We will deal with the following sequences pγnqně0 below.
Memory size We consider the exponentially and polynomially-weighted HBF as a unique stochastic algorithm parameterized by the memory function prnqně1. From the definition of rn given in (5), we note that in the exponential case, rn " r remains constant while the inertia brought by the memory term in the polynomial case prnq nPN is defined by rn " r Γn . Under Assumption pH γ β q, we can show that regardless of the memory, we have: This is true when rn " r because γn " γn´β with β ď 1. It is also true when we deal with a polynomial memory since in that case: • if β ă 1, then γnrn " γn´βˆrp1´βqγ´1n´1`β " rp1´βqn´1 • if β " 1, then γnrn " r n log n and ř kďn γ k r k " logplog nq.
Similarly, we also have that in the polynomial case, regardless of β: although this bound holds in the exponential situation when β ą 1{2. Below, we will use these properties on the sequences pγnqně0 and prnqně0 and define the next set of assumptions: ‚ Assumption pHrq: The sequence prnqně0 is a non-increasing sequence such that: ÿ ně1 γn`1rn "`8 and ÿ ně1 γ 2 n`1 rn ă`8 and lim sup nÑ`8 1 2γn`1ˆ1 rn´1 rn´1˙" : cr ă 1.
In the exponential case, cr " 0, whereas if rn " r{Γn, it can be shown that cr " 1 2r and the last point is true when r ą 1{2. In any case, r8 will refer to the limiting value of rn when n ÝÑ`8, which is either 0 or r ą 0.

Main results
Section 3 is dedicated to the situation of a general coercive function f . We obtain the almost sure convergence of the stochastic HBF towards a critical point of f .
Theorem 1 Assume that f satisfies pHsq, that pHσ,2q holds and that and the sequences pγnqně1 and prnqně1 are chosen such that pH γ β q and pHrq are fulfilled. If for any z, tx, f pxq " zu X tx, ∇f pxq " 0u is locally finite, then pXnq a.s. converges towards a critical point of f . This result obviously implies the convergence when f has a unique critical point. In the next theorem, we focus on the case where this uniqueness assumption fails, under the additional elliptic assumption pHE q.
Theorem 2 Assume that f satisfies pHsq, that the noise is elliptic, i.e., pHE q holds, and the sequence pγnqně1 is chosen such that pH γ β q and pHrq are fulfilled. If for any z, tx, f pxq " zu X tx, ∇f pxq " 0u is locally finite, we have: paq If rn " r (exponential memory) and pHσ,2q holds, then pXnq a.s. converges towards a local minimum of f .
Remark 3 £ The previous result provides some guarantees when f is a multiwell potential. In paq, we consider the exponentially weighted HBF and show that the convergence towards a local minimum of f always holds under the additional assumption pHE q. To derive this result, we will essentially use the former results of [BD96] on "homogeneous" stochastic algorithms. £ Point pbq is concerned by polynomially-weighted HBF and deserves more comment: • First, the result is rather difficult because of the time inhomogeneity of the stochastic algorithm, which can be written as Zn`1 " Zn`γn`1FnpZnq`γn`1∆Mn`1: the drift term Fn depends on Zn and on the integer n, which will induce technical difficulties in the proof of the result. In particular, the assumption β ă 1{3 will be necessary to obtain a good lower bound of the drift term in the unstable manifold direction with the help of the Poincaré Lemma near hyperbolic equilibrium of a differential equation.
• Second, the sub-Gaussian assumption pH Gauss,σ q is less general than pHσ,2q even though it is still a reasonable assumption within the framework of a stochastic algorithm. To prove pbq, we will need to control the fluctuations of the stochastic algorithm around its deterministic drift, which will be quantified by the expectation of the random variable sup kěn γ 2 k }∆M k } 2 . The sub-Gaussian assumption will be mainly used to obtain an upper bound of such an expectation, with the help of a coupling argument. Our proof will follow a strategy used in [Pem90] and [Ben06] where this kind of expectation has to be upper bounded. Nevertheless, the novelty of our work is also to generalize the approach to unbounded martingale increments: the arguments of [Pem90,Ben06] are only valid for a bounded martingale increment, which is a somewhat restrictive framework.
In Section 4, we focus on the consistency rate under stronger assumptions on the convexity of f . In the exponential memory case, we are able to control the quadratic error and to establish a CLT for the stochastic algorithm under the general assumption pHSC pαqq. In the polynomial case, the problem is more involved and we propose a result for the quadratic error only when f is a quadratic function (see Remark 5 for further comments on this restriction). More precisely, using the notation À to refer to an inequality, up to a universal multiplicative constant, we establish the following results.
Theorem 4 Denote by x ‹ the unique minimizer of f and assume that pH γ β q, pHsq, pHSC pαqq and pHσ,2q hold, we have: paq When rn " r (exponential memory) and β ă 1, we have: If pHσ,8q holds and β " 1, set αr " rˆ1´b1´p 4λq^r r˙w here λ denotes the smallest eigenvalue of D 2 f px ‹ q. We have, for any ε ą 0: pbq Let f : R d Ñ R be a quadratic function. Assume that rn " rΓ´1 n (polynomial memory) with β ă 1. Then, if r ą 1`β 2p1´βq , we have: When rn " rΓ´1 n (polynomial memory) and β " 1, we have: For paq, the case β ă 1 is a consequence of Proposition 20 (or Proposition 15 in the quadratic case), whereas the (more involved) case β " 1 is dealt with Propositions 15 and 22 for the quadratic and the non-quadratic cases, respectively. We first stress that that when β ă 1, the noise only needs to satisfy pHσ,pq to obtain our upper bound. When we deal with β " 1, we could prove a positive result in the quadratic case when we only assume pHσ,pq. Nevertheless, the stronger assumption pHσ,8q is necessary to produce a result in the general strongly convex situation. Finally, pbq is a consequence of Proposition 17.
Remark 5 £ It is worth noting that in paq (β " 1), the dependency of the parameter αr in D 2 f only appears through the smallest eigenvalue of D 2 f px ‹ q. In particular, it does not depend on inf xPR d λ D 2 f pxq as it could be expected in this type of result. In other words, we are almost able to retrieve the conditions that appear when f is quadratic. This optimization of the constraint is achieved with a "power increase" argument, but this involves a stronger assumption pHσ,8q on the noise. £ The restriction to quadratic functions in the polynomial case may appear surprising. In fact, the "power increase" argument does not work in this non-homogeneous case. However, when β ă 1, it would be possible to extend to non-quadratic functions through a Lyapunov argument (on this topic, see Remark 21), but under some quite involved conditions on r, β and the Hessian of f . Hence, we chose to only focus on the quadratic case and to try to obtain some potentially optimal conditions on r and β only (in particular, there is no dependence to the spectrum of D 2 f ). The interesting point is that it is possible to preserve the standard rate order when β ă 1 but under the constraint r ą 1`β 2p1´βq , which increases with β. In particular, the rate Opn´1q cannot be attained in this case (see Remark 18 for more details). Finally, we conclude by a central limit theorem related to the stochastic algorithm the exponential memory case.
Theorem 6 Assume pHsq and pHSC pαqq are true. Suppose that rn " r and that pH γ β q holds with β P p0, 1q or, β " 1 and γαr ą 1. Assume that pHσ,pq holds with p ą 2 when β ă 1 and p " 8 when β " 1. Finally, suppose that the following condition is fulfilled: where V is a symmetric positive dˆd-matrix. Let σ be a dˆd-matrix such that σσ t " V. Then, piq The normalized algorithm´X n ? γn , Yn ? γn¯n converges in law to a centered Gaussian distribution µ pβq 8 , which is the invariant distribution of the (linear) diffusion with infinitesimal generator L defined on C 2 -functions by: piiq In the simple situation where V " σ 2 0 I d (σ0 ą 0) and β ă 1. In this case, the covariance of µ pβq 8 is given by Remark 7 £ As a first comment of the above theorem, let us note that in the fundamental example where: ∆Mn`1 " ∇f pXnq´BxF pXn, ξnq, n ě 1, the additional assumption (6) is a continuity assumption. Actually, in this case: Er∆Mn∆M t n |Fn´1s "VpXnq, withVpxq " CovpF px, ξ1qq.
Thus, since Xn Ñ x ‹ a.s., Assumption (6) is equivalent to the continuity ofV in x ‹ so that: £ Point piiq of Theorem 6 reveals the behavior of the asymptotic variance of Y increases with r. This translates the fact that the instantaneous speed coordinate Y is proportional to r in Equation (3), which then implies a large variance of the Y coordinate when we use an important value of r. £ When β " 1, it is also possible (but rather technical) to make the limit variance explicit. The expression obtained with the classical stochastic gradient descent with step-size γn´1 and Hessian λ, the asymptotic variance is γ{p2λγ´1q, whose optimal value is attained when γ " λ´1 (it attains the Cramer-Rao lower bound). Concerning now the stochastic HBF, for example, when d " 1 and r ě 4λ (the result is still valid in higher dimensions, see Section 5), we can show that: whereα`" 1`b1´4 λ r andα´" 1´b1´4 λ r . Similar expressions may be obtained when r ă 4λ. Note also that we assumed that γαr ą 1, and it is easy to check that this condition implies that γr ą 1 because αr ď r, regardless of r. In the meantime, this condition also implies that 2λγ ąα`ěα´.
Finally, This explicit value could be used to find the optimal calibration of the parameters to obtain the best asymptotic variance. Unfortunately, the expressions are rather technical and we can see that such calibrations are far from being independent of λ, the a priori unknown Hessian of f on x ‹ .

Almost sure convergence of the stochastic heavy ball
In this section, the baseline assumption on the function f is pHsq, and we are thus interested in the almost sure convergence of the stochastic HBF. In particular, we do not make any convexity assumption on f . Below, we will sometimes use standard and sometimes more intricate normalizations for the coupled process Zn " pXn, Ynq. These normalizations will be of a different nature and, to be as clear as possible, we will always use the same notation q Zn andZn to refer to a rotation of the initial vector Zn, whereas r Zn will introduce a scaling in the Yn component of Zn by a factor ? rn.

Preliminary result
We first state a useful upper bound that makes it possible to derive a Lyapunov-type control for the mean evolution of the stochastic algorithm pXn, Ynqně1 described by (3). This result is based on the important function px, yq Þ ÝÑ Vnpx, yq that depends on two parameters pa, bq P R 2 defined by: Vnpx, yq " pa`brn´1qf pxq`a 2rn´1 }y} 2´b x∇f pxq, yy.
We will show that Vn plays the role of a (potentially time-dependent) Lyapunov function for the sequence pXn, Ynqně1. The construction of Vn shares a lot of similarity with other Lyapunov functions built to control second-order systems.
If the two first terms are classical and generate a´}y} 2 term, the last one is more specific to hypo-coercive dynamics and was already used in [Har91]. Recent works fruitfully exploit this kind of Lyapunov function (see, among others, the kinetic Fokker-Planck equations in [Vil09] and the memory gradient diffusion in [GP14]). This function is obtained by the introduction of some Lie brackets of differential operators, leading to the presence of x∇f pxq, yy that generates a mean reverting effect on the variable x.
It follows from Assumption pHsq that }∇f } 2 ď c f f . Using the above inequality, we obtain that for any x, y P R d : Choosing now a and b such that a ą b{2 and a ą br8pc f´1 q, we obtain the first assertion follows from (8).P oint piiq: The Taylor formula ensures the existence of ξn`1,1 and ξn`1,2 in rXn, Xn`1s such that: Combining the similar terms leads to: Vn`1pXn`1, Yn`1q " VnpXn, Ynq´bprn´rn´1qf pXnq γn`1x∇f pXnq, Yny¨´a´brn`a`brn looooooooooomooooooooooon "0‚´γ where p∆Nnqně1 is a sequence of martingale increments, Dn is a dˆd-matrix defined by: and ∆Rn`1 is a remainder term. Using pHsq, we know that D 2 f is bounded, and we have the following bound for ∆Rn`1: where C2 is a deterministic positive constant independent of n. The fact that prnqně1 is a bounded sequence combined with Assumptions pHσ,2q and pHsq yields Er}∆Rn`1}|Fns ď C2γn`1rn`1`}Yn} 2`f pXnq˘. It follows that: Second, the condition given by (8) shows that an integer n1 ě n0 and a constant c a,b ą 0 exist such that: Using the previous bounds in Vn`1pXn`1, Yn`1q and the fact that prnq nPN is non-increasing shows that: Dn2 ě n1 @n ě n2 : ErVn`1pXn`1, Yn`1q|Fns ď VnpXn, Ynqp1`Cγ 2 n`1 rnq´c a,b γn`1}Yn} 2´b γn`1rn}∇f pXnq} 2 .N ote that if pHrq holds, then Equation (10) provides a strong repelling effect on the system px, yq because in that case, ř γn`1rn "`8. This makes it possible to obtain a more precise a.s. convergence result, stated below.
Corollary 9 If pHσ,2q and pHsq hold and prnqně1 satisfies pHrq, then we have: (iv) pYn{ ? rnqně0 tends to 0 since n Ñ`8 and every limit point of pXnqně0 belong to tx, ∇f pxq " 0u. Furthermore, if for any z, tx, f pxq " zu X tx, ∇f pxq " 0u is locally finite, pXnqně0 converges towards a critical point of f .

Proof
Proof of piq´piiq´piiiq: Under the conditions on prnq, we can check that some positive a and b exist such that the conclusions of the previous lemma hold true. We then deduce that: ErVn`1pXn`1, Yn`1q|Fns ď VnpXn, Ynqp1`Cαn`1q´Un`1, with αn " γ 2 n rn and Un`1 " c a,b γn`1}Yn} 2`b γn`1rn}∇f pXnq} 2 . Subsequently, using the Robbins-Siegmund Theorem (see, e.g., Theorem 27 in Section A.1, borrowed from [Duf97]), we deduce, on the one hand, that sup ně1 ErVnpXn, Ynqs ằ 8 and that pVnpXn, Ynqqně1 almost surely (and in L 1 ) converge towards a random variable V8 P R`. In particular, the coercivity of f implies the a.s.-boundedness of pXnqně0. On the other hand, the Robbins-Siegmund Theorem also implies that: Hence, the three first statements follow.P roof of pivq: The proof relies on the so-called ODE method (see, e.g., [Ben06]). Set r8 " limnÑ`8 rn. We deal with cases r8 ą 0 and r8 " 0 separately.
Case r8 ą 0 (exponential memory): Set Γn " ř n k"0 γ k with the convention γ0 " 0. Denote by pzptqqtě0 the interpolated process defined byzpΓnq " Zn " pXn, Ynq 1 , n ě 0, with linear interpolations between times Γn and Γn`1 and letz pnq be the associated shifted-sequence defined by: Setting εn " p0, prn´1´r8qp∇f pXnq´Ynq`∆Mnq 1 and hpx, yq " p´y, r8p∇f pxq´yqq 1 , we have: Set N pn, tq " inftk ě n, γn`1`. . .`γ k ě tu (with the convention inf H " n). Then, since pZnqně0 is a.s.-bounded, it is a classical result on stochastic algorithm theory (see, e.g., [Duf97], Theorem 9.2.8 and the remark below) that if for any T ą 0, then pz pnq qně0 is relatively compact (for the topology of uniform convergence on compact sets) and its limit points are solutions to the ODE 9 z " hpzq. Let us prove (11). Let T ą 0. Using the Cauchy-Schwarz inequality, we have, for every t P r0, T s: where the last convergence follows from piiiq. On the basis of Assumption pHσ,2q and piiiq, we also note that px ř n k"1 γ k ∆M k yqně1 is a.s.-convergent so that ř γn∆Mn. It easily follows that: and that (11) is satisfied. Now, we again deduce from (12) that for any T ą 0, so that each limit point is stationary. At this stage, we have thus proven that every limit point of pz n qně0 is a stationary solution to 9 z " hpzq. This implies that any limit point Z8 of pZnqně0 satisfies hpZ8q " 0 (and thus Y8 " ∇f pX8q " 0). Actually, let pZn k q kě1 be a convergent subsequence of the (a.s. bounded) sequence pZnqně0 and denote its limit by Z8. Up to a second extraction, pz pn k q q converges to a stationary solutionz 8 of 9 z " hpzq. As a consequence, hpz 8 ptqq " 0 for any t ě 0. In particular, hpz 8 p0qq " hpZ8q " 0. By piiq and the fact that pYnqně0 converges to 0, we also deduce that pf pXnqqně0 is a.s.-convergent. To conclude the proof, it remains to observe that the set of possible limits of subsequences of pXnqně1 is connected. This is true since Xn´Xn´1 "´γnYn´1 Ñ 0 as n Ñ`8.C ase r8 " 0 (polynomial memory): In this case, the proof is somewhat similar but the identification of the asymptotic dynamics requires an appropriate normalization of Yn 1 . Let us set: Also set by r Zn " p r Xn, r Ynq 1 . The dynamic of r Zn is described by Lemma 10 below. We denote as pr zptqqtě0 the interpolated process, i.e. defined byzp r Γnq " r Zn, n ě 0, with linear interpolations between times r Γn and r Γn`1 and let z pnq be the associated shifted-sequence defined bỹ z pnq ptq "zpt`r Γnq t ě 0.
With this setting, the idea is to show that the sequence pz pnq ptqqtě0 is tight with limits being stationary solutions of a homogeneous O.D.E. 9 z "hpzq (h being the drift to be determined). The sequence p r Znqně0 satisfies Lemma 10 that shows that r Zn`1 " r Zn`r γn`1´hp r Znq`εn`1¯withhpx,ỹq :" p´ỹ, ∇f pxqq 1 and: where υ p1q n and υ p2q n are given in the statement of Lemma 10. On the basis of Assumption pHrq, we know that: lim sup nÑ`8 1 2γ n`1´1 r n`1´1 rn¯ă 1 so that: Thus, pυ p1q n qně1 and pυ p2q n qně1 converge to 0 as n Ñ`8. We can now repeat the arguments used in the situation r8 ą 0 and we obtain: where r N pn, tq " inftk ě n, r γn`1`. . .`r γ k ě tu. We can still combine (12) and piiiq to obtain sup tďT |z pnq ptqź pnq p0q| nÑ`8 ÝÝÝÝÑ 0 for any T ą 0. We conclude that pz pnq qně0 is relatively compact and that its limits are stationary solutions of 9 z "hpzq. The end of the proof is exactly the same as in the case r8 ą 0.˛1 In fact, due to the asymptotic stationarity, the limiting dynamics is not intrinsic.

Proof
First, the fact that r Xn`1 " r Xn´γn`1 r Yn is obvious. Second, The lemma follows.3 .

Convergence to a local minimum
To motivate the next theoretical result, we address the result of Corollary 9. We have shown the almost sure convergence of (3) towards a point of the form px8, 0q in both exponential and polynomial cases where x8 is a critical point of f . This result is obtained under very weak assumptions on f and on the noise p∆Mn`1qně1 and is rather close to Theorems 3-4 of [YLL16] (obtained within a different framework). Unfortunately, it this only provides a very partial answer to the problem of minimizing f because nothing is said about the stability of the limit of the sequence pXnqně0 by Corollary 9: the attained critical point may be a local maximum, a saddle point or a local minimum. This result is made more precise below and we establish some sufficient guarantees for the a.s. convergence of pXnq towards a minimum of f , even if f possesses some local traps with the additional assumption pHE q. This proof follows the approach described in [BD96] and [Ben06] but requires some careful adaptations because of the hypo-elliptic noise of the algorithm (there is no noise on the x-component) for both the exponentially and polynomially-weighted memory. Moreover, the linearization of the inhomogeneous drift around a critical point of f in the polynomial memory case is a supplementary difficulty we need to bypass.
Note that some recent works on stochastic algorithms (see, e.g., [LSJR16]) deal with the convergence to minimizers of f of deterministic gradient descent with a randomized initialization. In our case, we will obtain a rather different result because of the randomization of the algorithm at each iteration. Note, however that the main ingredient of the proofs below will be the stable manifold theorem (the Poincaré Lemma on stable/unstable hyperbolic points of [Poi86]) and its consequence around hyperbolic points. This geometrical result is also used in [LSJR16].
3.2.1 Exponential memory r n " r ą 0 The exponential memory case may be (almost) seen as an application of Theorem 1 of [BD96]. More precisely, if Zn " pXn, Ynq and hpx, yq " p´y, r∇f pxq´ryq, then the underlying stochastic algorithm may be written as: When rn " r ą 0 (exponential memory), Corollary 9 applies and Zn a.s.
For the analysis of the dynamics around a critical point of the drift, the critical poinf of f is denoted x0 and we can linearize the drift around px0, 0q P R dˆRd as: where I d is the dˆd identity-squared matrix and D 2 pf qpx0q is the Hessian matrix of f at point x0. When x0 is not a local minimum of f , the spectral decomposition of D 2 pf qpx0q leads to the spectral decomposition: where Λ is a diagonal matrix with at least one negative eigenvalue λ ă 0. Considering now q Zn " p q Xn, q Ynq where q Xn " P Xn and q Yn " P Yn, we have: where q h may be linearized as: In particular, if e λ is an eigenvector associated with the eigenvalue λ ă 0 of D 2 f px0q, we can see that the linearization ofh on the space Spanpe λ q b p1, 0, . . . , 0q acts as: Its spectrum is SppA λ,r q "´r 2˘b r 2 4´r λ. The important fact is that when λ ă 0, the eigenvalue´r 2`b r 2 4´r λ is positive and whose corresponding eigenspace is Eλ "´1, 1 2´b 1 4´λ {r¯. In the initial space R dˆRd (without applying the change of basis through P b P ), the corresponding eigenvector is: Consequently, when x0 is not a local minimum of f , it generates a hyperbolic equilibrium of h and we can apply the "general" local trap Theorem 1 of [BD96]. If Π Eλ denotes the projection on the eigenspace Spanpeλ q, then the noise in the direction Eλ is: We can then apply Theorem 1 of [BD96] and conclude the following result.
Theorem 11 If pHσ,2q , pHsq and pHE q hold and rn " r, then Xn a.s. converges towards a local minimum of f .

Polynomial memory
We introduce a key normalization of the speed coordinate and define the rescaled process: Hence, the couple p r Xn, r Ynq evolves as an almost standard stochastic algorithm, whose step size is r where qn`1 " a Γn`1{Γn " 1`opn´1q as n ÝÑ`8 and pUn`1qně1 is defined by: This dynamical system is related to the deterministic one # 9 xt "´yt 9 yt " r∇f pxtq or equivalently: It is easy to see that when x8 is a local maximum of f , then the above drift is unstable near z8 " px8, 0q. Unfortunately, Theorem 1 of [BD96] cannot be applied because of the size of the remainder terms involved in (13) and the a.s. convergence of pXn, Ynqně0 requires further investigation. From [Ben06], we borrow a tractable construction of a "Lyapunov" function η in the neighborhood of each hyperbolic point, which translates a mean repelling effect of the unstable points. This construction still relies on the Poincaré Lemma (see [Poi86] and [Har82] for a recent reference). Again, in the neighborhood of any hyperbolic point, we will treat the projection Π`as a projection on the unstable manifold.
Proposition 12 ( [Ben06]) For any local maximum point x8 of f , a compact neighborhood N of z8 " px8, 0q and a positive function η P C 2 pR dˆRd , R˚q exist such that: piq @z " px, yq P N , Dηpzq : R dˆRd ÝÑ R dˆRd is Lipschitz, convex and positively homogeneous.
piiq Two constants k ą 0 and c1 ą 0 and a neighborhood U of p0, 0q exist such that: and if t u`denotes the positive part: piiiq A positive constant κ exists such that: When d " 1, it is possible to check that if λ is a negative eigenvalue of the Hessian of f around a local maximum x8, then the drift may be linearized in p´y, λpx´x8qq and a reasonable approximation of η is given by ηpx, yq " 1 2 }y´?´λx} 2 . Nevertheless, the situation is more involved in higher dimensions and the construction of the function η relies on the stable manifold theorem. We are now able to state the next important result.
Theorem 13 Assume that the noise satisfies pH Gauss,σ q and pHE q, that the function satisfies pHsq, and that γn " γn´β with β ă 1{3, then pXnqně0 a.s. converges towards a local minimum of f .
The proof relies on an argument of [Pem90,Ben06] even though it requires major modifications to deal with the time inhomogeneity of the process and the unbounded noise, which are assumed in these previous works. We denote N as any neighborhood of z8 and consider any integer n0 P N. We then introduce r Zn " p r Xn, r Ynq and the stopping time: T We will show that PpT ă`8q " 1, which implies the conclusion. We introduce two sequences pΩnqněn 0 and pSnqněn 0 : Note that the construction of η implies that z Þ ÝÑ Dηpzq is Lipschitz, so that the following inequality holds: This inequality provides some information when u is small. In the meantime, η is positive so that: The family of inequalities described in (16) will be used with an appropriate value of α in the next result.
piiiq Assume that β ă 1 3 , then pS 2 n qně0 has a submartingale increment: for a small enough constant a.

Proof:
Proof of piq. When n ě T , we have Ωn`1 "γn`1 by definition and the conclusion follows. In the other situation when n ď T , we use the Lipschitz continuity of η: if m " sup zPN }Dηpzq}, then Equation (13) yields: The neighborhood N being compact, we deduce from the previous inequality that a constant C ą 0 exists such that: where we used a uniform upper bound on Er}∆Mn`1} 2 1năT |Fns, leading to the proof of piq.1 Proof of piiq. Note that 1năT and 1něT are Fn measurable and we have: 1něT E rΩn`1|Fns " 1něTγn`1 ě 0.
On the complementary set, we also have: Hence, we can use the lower bound given by (16): for any value of α P p0, 1s: where we used the triangle inequality in the last line to derive an upper bound of } r Zn`1´r Zn}. When n ă T , r Zn is bounded and we have Er}∆Mn`1} 2 |Fns ď σ 2 M for a large enough M . Hence, the Hölder inequality implies that: Therefore, we can find a large enough constant C1 ą 0 such that: The lower bound piiiq of Proposition 12 and the definition of Un`1 implies that a constant C2 exists such that: , which corresponds to the choice: , we then deduce that if n ă T , then Sn " ηp r Znq so that: 1S n ě n E rΩn`1|Fns ě 0, which concludes the proof. In particular, n must be chosen on the orderγ α n`1 (or on the order Γ´1 {2 n " n´p 1´βq{2 ).P roof of piiiq. Observe that S 2 n`1´S 2 n " Ω 2 n`1`2 SnΩn`1. Now, if Sn ě n, then we have seen in the proof of piiq that: 1S ně n ErS 2 n`1´S 2 n |Fns " 1S n ě n ErΩ 2 n`1 |Fns`2Sn1S n ě n ErΩn`1|Fns ě 1S n ě n ErΩ 2 n`1 |Fns. In the other situation, we have Sn ď n, meaning that n ă T and we have seen in the proof of piiq that: 1năT ErΩn`1|Fns ě 1năT "γ n`1κηp r Znq`γn`1xDηp r Znq, Un`1y Consequently, because of the positivity of η, we deduce that: 1năT ErΩn`1|Fns ě´}Dηp r Znq}ˆOpγn`1Γ´1 {2 n q´Opγ 2 n`1 q.
We know that Dη is locally bounded on N , we then obtain: 1S nď n ErΩn`1|Fns " 1S n ď n 1năT ErΩn`1|Fns " 1 ηp r Znqď n 1năT ErΩn`1|Fns for a large enough constant C. In the two situations, we then have: Finally, Lemma 9.7 of [Ben06] and our hypoelliptic assumption pHE q implies that for small enough c: The conclusion follows if nγn`1Γ´1 {2 n " o`γ 2 n`1˘. Since n is chosen on the order Γ´1 {2 n "γ α n`1 with α " p1´βq{p1`βq, this condition is equivalent to:γ 1`2α meaning that α ą 1{2. It then implies that β should be less than 1{3.˛W e use now the key estimations derived from Proposition 14 to obtain the proof of the main result of this Section. Proof of Theorem 13: The proof is split into three parts. We consider: In our case, we have chosen β P p0, 1{3q and we can check that: γn " n´p 1`βq{2 so that δn " n´β.
We consider the sequence n defined in Proposition 14: In this case, we have: The proof now proceeds by considering the sequential crossings Sn ď c ? δn and Sn ě c ? δn for a suitable value of c.
Step 1: Sn becomes greater than ? bδn with a positive probability. For a given constant b and a positive n P N, we introduce the stopping time: and we show that an ą 0 exists such that P pT ă 8q ě 1´ . For a given by piiiq of Proposition 14, we consider: pM k q kěn is a submartingale, so that pM k^T q kěn is also a stopped submartingale. This yields: In the meantime, we can decompose S 2 m^T´S 2 n into: Since pδ k q kěn is decreasing, we then have δm^T´1 ď δn. We then study the remaining term. We can use Equation (13) and the Lipschitz continuity of η over the neighborhood N (before time T ) to obtain a large enough C such that: However, nothing more is known about the stopped process }∆Mm^T } 2 and we are forced to use: Given that all ∆M k are independent sub-Gaussian random variables that satisfy Inequality (56), we can use Theorem 30 and obtain that a constant C large enough exists such that for any ą 0: We can plug the estimate (19) into Inequality (18) to obtain: Letting m ÝÑ`8, we deduce that: Cγ 2 n logpγ´2 n q aδn .
According to the calibration (17), we haveγ 2 n logpγ´2 n q " opδnq. Consequently, we can choose n large enough such that: tep 2: The sequence pS k q kěn may remain larger than a b{2δn with a positive probability. We introduce the stopping time S and the event En P Fn: ) .
Since the sequence pδiqiěn is non-increasing, piiq of Proposition 14 yields: Hence, pSi^S qiěn is a submartingale and the Doob decomposition reads Si^S " Mi`Ii where pMiqiěn is a Martingale and pIiq is a predictable increasing process such that In " 0. Hence, δn | Fn˙1E n .
The rest of the proof follows a standard martingale argument: where we used the upper bound given by piq of Proposition 14 in the last line. Now, the Doob inequality implies that: E`pMm´Mnq 2 |Fn˘`t 2 ps`tq 2 " cδn`t 2 ps`tq 2 .
We apply this inequality with s " δn and use ps`tq 2 ď p1`ϑqs 2`p 1`ϑ´1qt 2 for any ϑ ą 0. It leads to: We now choose ϑ " 4c{b , t " ? δn and deduce that: Consequently, we deduce that: PpS " 8|Fnq1E n ě P |Fnˆ@ i ě n : Mi ě tep 3: pSnqně0 does not converge to 0 with probability 1. We denote G as the event that pSnqně0 does not converge to 0. For any integer n, we have the inclusion: which implies: Since 1G P F8, we have limnÝÑ`8 Er1G|Fns " 1G. The previous lower bound implies that G almost surely holds.C onclusion of the proof: The stochastic algorithm does not converge to a local trap. Consider N a neighborhood of a local maximum of f , and its associated function η given by Proposition 12. We then consider the random variables pΩnqně0 and pSnqně0. We have seen that Sn does not converge to 0 with probability 1. We define: and assume that TN "`8. In that case, we always have: Ωn`1 " ηp r Zn`1q´ηp r Znq and Sn " ηp r Znq.
The limit set of p r Znqně0 is a non empty compact subset of N , which is left invariant by the flow pΦtqtě0 of the O.D.E. whose drift is F . Now, consider z in p r Znqně0 and apply piiiq of Proposition 12. We then have ηpΦtpzqq ě e κt ηpyq. Since ηpΦtpzqq ď sup N η, we therefore deduce that ηpzq " 0. Hence, the unique limiting value for pSnqně0 is zero, meaning that Sn ÝÑ 0 as n ÝÑ`8. However, we have seen in Step 3 that Sn does not converge to 0 with probability 1. Therefore, PpTN "`8q " 0 and the process does not converge towards a local maximum of f with probability 1.4

Convergence rates for strongly convex functions
This section focuses on the convergence rates of algorithm (3) according to the step-size γn " γn´β for λ-strongly convex function f with a L-Lipschitz gradient, corresponding to the assumptions pHSC pλqq and pHsq.

Quadratic case
We first study the benchmark case of a purely quadratic function f , meaning that ∇f is linear. In this case, f pxq " 1 2 }Ax} 2 and ∇f pxq " Sx, leading to the following form of the algorithm: # Xn`1 " Xn´γn`1Yn Yn`1 " Yn`γn`1rnpSXn´Ynq`γn`1rn∆Mn`1, where S is a dˆd squared matrix defined by S " A 1 A. The matrix S is assumed to be positive definite with lower bounded eigenvalues, e.g., SppSq Ă rλ,`8r when f is pHSC pλqq with λ ą 0.

Reduction to a two dimensional system
Equation (20) may be parameterized in a simpler form using the spectral decomposition of S " P´1ΛP , where P is orthogonal, and Λ is a diagonal matrix: @pi, jq P t1 . . . du 2 Λi,j " λiδi,j ě λ ą 0.
Keeping the notation p q Xn, q Ynqně1 for the change of basis induced by P , we define q Xn " P Xn and q Yn " P Yn and obtain: # q Xn`1 " q Xn´γn`1 r Yn q Yn`1 " q Yn`γn`1rnpΛ q Xn´q Ynq`γn`1rnP ∆Mn`1, Since Λ is diagonal, we are now led to study the evolution of d couples of stochastic algorithms: where we used the notations q Xn " pq x piq n q 1ďiďd and q Yn " pq y piq n q 1ďiďd . Consequently, in the quadratic case, the stochastic HBF may be reduced to d couples of 2-dimensional random dynamical systems: where q Z piq n :" pq x piq n , q y piq n q and C piq n "ˆ0´1 λ piq rn´rn˙a nd Σ2 "ˆ0 0 0 1˙, λ piq " Λi,i ě λ ą 0 and p∆N piq n qně1 is a sequence of martingale increments. It is worth noting that due to the multiplication by the matrix P , the martingale increment ∆N piq n`1 potentially depends on the whole coordinate p q Z pjq n q 1ďjďd . In a completely general case, this involves technicalities mainly due to the fact that the system (21) is not completely autonomous (in general, the components q Z piq n and q Z pjq n do not evolve independently). To overcome this difficulty, the idea is to obtain some general controls for a system solution to (21) and to then bring the controls of each coordinate together. For the sake of simplicity, we propose in the sequel to state the results in the general case but to only make the proof for (21) with the assumption that: From now on, we will omit the indexation by j to alleviate the notations. An easy computation shows that the characteristic polynomial of Cn is given by: .
We now consider the two different cases: • For all n ě 1, Cn has two real or complex eigenvalues whose values do not change from n to n, which corresponds to rn " r. This case necessarily corresponds to an exponentially-weighted memory and rn is thus kept fixed constant: rn " r ě 4λ or rn " r ă 4λ.
• For a large enough n, Cn has two complex conjugate and vanishing eigenvalues. This situation may occur if we use a polynomially-weighted memory because, in that case, rn ÝÑ 0 as n ÝÑ`8.

Exponential memory r n " r
We first study the situation when rn " r, which is easier to deal with from a technical point of view.
piiq If β " 1, then a constant c r,λ,γ exists such that: Proof: According to Subsection 4.1.1, we only make the proof for a system solution to (21) with the assumption that (22) holds. We begin with the simplest case where r ě 4λ. The above computations show that: SppCnq " # µ`"´r`a pr´4λqr 2 ; µ´"´r´a pr´4λqr 2 while the associated eigenvectors are given by e`"ˆ1 µ`˙a nd e´"ˆ1 µ´˙a nd are kept fixed throughout the iterations of the algorithm. Consequently, (21) may be rewritten in an even simpler way: where q Zn " QZn (pZnq being defined by (21) ) where Q is an invertible matrix such that Cn " Q´1ˆµ`0 0 µ´˙Q and q ξn`1 " QΣ2∆Nn`1. The squared norm of p q Znqně1 is now controlled using a standard martingale argument and Assumption pHσ,2q: } q Zn`1} 2 |Fn ı ď rp1`µ`γn`1q 2`C γ 2 n`1 s} q Zn} 2`C γ 2 n`1 , so that by setting un " Er} q Zn} 2 s, this yields: un`1 ď p1`2µ`γn`1`C1γ 2 n`1 q`C2γ 2 n`1 .
The result then follows from Propositions 28 piiiq and 29 piiiq (see Appendix A). We now study the situation r ă 4λ. In this case, Cn possesses two conjugate complex eigenvalues: Once again, we use the notation p q Znqně1 defined as q Zn " QZn with Q an invertible (complex) matrix such that Sn " Q´1ˆµ`0 0 µ´˙Q and q ξn`1 " QΣ2∆Nn`1. The squared norm of p q Znqně1 may be controlled while paying attention to the modulus of complex numbers, and we obtain an inequality similar to (25). Once again, we can apply piiiq of Propositions 28piiiq and 29piiiq to obtain the desired conclusion.R emark 16 In the above proposition, the constants c r,λ,γ are not made explicit. However, it is possible to obtain an estimation if we assume that Er∆Mn`1| 2 s ď σ 2 and r ě 4λ. In this particular case, with the notations of (25), we have: un`1 ď p1´αrγnq un`r 2 σ 2 }Qr} 2 γ 2 n`1 , where un " E} q Zn} 2 . The Propositions 28 piiiq and 29 piiiq now imply that: which, in the end, provide an explicit upper bound of E}Zn} 2 since Zn " Q´1 r q Zn. A more important issue concerns the rate obtained when β " 1 and we can remark in the statement of Proposition 15 that this rate depends on the size of γ and of αr. In particular, the best rate (of order Opn´1q) is obtained when γαr ą 1, meaning that αr must be as large as possible to optimize the performance of the algorithm and we therefore obtain a non-adaptive rate. It is easy to see that r Þ ÝÑ αr increases on r0, 4λs and decreases on r4λ,`8q. It attains its maximal value (maxr αr " 4λ) when r " 4λ. This maximal value is twice the size of the eigenvalue of the (standard) stochastic gradient descent (SGD). Finally, limrÝÑ`8 αr " 2λ. This limiting value 2λ corresponds to the size of the eigenvalue of the SGD. In other words, the limit r "`8 in HBF may be seen as an almost identical situation to SGD.
If we compare the rate of convergence of HBF to the one of SGD using the same step size γn " γn´1, we see that choosing a reasonably large r makes it possible to obtain a less stringent condition on γ to recover the (optimal) rate Opn´1q. In particular, the rate of the HBF is better when r ě 2λ than the one attained by the SGD. Unfortunately, it seems impossible to obtain an adaptive procedure on the choice of pγ, rq that guarantees the rate Opn´1q, unlike the Polyak-Ruppert averaging procedure. 4.1.3 Polynomial memory r n " rΓ´1 n ÝÑ 0 This case is more intricate because of the variations with n of the eigenvectors of the matrix Cn defined in (21).
piiq If β " 1, a constant C exists such that: @n ě 1 E}Xn} 2 ď C log n and @n ě 1 E}Yn} 2 ď C n log n Remark 18 We can observe that when β ă 1, the rates of the exponential case are preserved under a constraint on r which becomes harder and harder when β is close to 1: r needs to be greater than 1`β 2p1´βq . Carefully following the proof of this result, we could in fact show that when 1{2 ă r ă 1`β 2p1´βq , then E}Xn} 2 ď Cn´p r´1 2 qp1´βq . Since pr´1 2 qp1´βq ÝÑ 0 as β ÝÑ 1, our upper bound in plog nq´1 related to the case β " 1 becomes reasonable. Another possible interpretation of the poor convergence rate in that case is that the size of the negative real part of the eigenvalues of Cn is on the order 1 n log n , which leads to a contraction of the bias equivalent to O´e´c Regardless of c, we cannot obtain a polynomial rate of convergence in that case since ř n 1 1 k log k " log log n.

Proof
Proof of piq: We study the case β ă 1 here. According to the arguments used in the proof of Proposition 15 and Subsection 4.1.1, the dynamical system may be reduced to d couples of systems in the form px piq n , y piq n qně1 so that we only make the proof for a system solution to (21) under assumption (22). Another key feature of the polynomial case has been observed in the proof of the a.s. convergence of the algorithm (Theorem 13): the study of the rate in the polynomial case involves a normalization of the algorithm with a ? rn-scaling of the Y coordinate. Therefore, we set Zn " pXn,Ỹnq withXn " Xn andỸn " Yn{ ? rn. With these notations, we obtain (similar to Lemma 10): Zn`1 " pI2`γn`1CnqZn`γn`1 c rn rn`1 Σ2∆Nn`1, withγn`1 " γn`1 ? rn and:C n "˜0´1 λ b rn r n`1 ρnw ith ρn :" 1 γn`1ˆc rn rn`1´1˙´r n ? rn`1 .
Since rn " rΓ´1 n , the following expansion holds: In particular, for a large enough n, ρn ă 0 if and only if r ą 1{2. Furthermore, an integer n0 P N exists such that for any n ě n0,Cn has complex eigenvalues given by: We define the diagonal matrix: Λn :"˜µ pnq 0 0 µ pnq2 and let Qn be the matrix that satisfies Q´1 n ΛnQn "Cn. We have: We can now introduce the change of basis brought by Qn and the new coordinates q Zn :" Qn r Zn. We have: q Zn`1 " Qn`1pI2`γn`1CnqQ´1 n q Zn`γn`1 c rn rn`1 Qn`1Σ2∆Nn`1 " Qn`1Q´1 n pI2`γn`1Λnq q Zn`γn`1 c rn rn`1 Qn`1Σ2∆Nn`1.
From the above, we obtain, for any z P R 2 , which after several computations yields: Note that a universal constant C (independent of n) exists such that }Qn`1}8 ď C and the upper bounds above can be used into (28) to deduce that: where p∆ | Mnqně1 is a sequence of martingale increments and b a large enough constant. When γn " γn´β with β ă 1, the fact that Γn " n 1´β 1´β`O p1q combined with the upper bound of the variance of the martingale (22) imply that: where α :" pr´1 2 qp1´βq. Under the condition r ą 1`β 2p1´βq , we observe that: α ą β.
An induction based on Inequality (30) yields: Cn´β where in the second line, we repeated an argument used in the proof of Propositions 29 and made use of the property α ą β. To conclude the proof, it remains to observe that }Q´1 n`1 }8 ď C regardless of n.p iiq When β " 1, Inequality 29 leads to: Er} q Zn`1} 2 s ďˆ1´α n log n`b n 2 log n˙E r} q Zn} 2 s`C n 2 log n and a procedure similar to the one used above (given that ř n k"1 pk log kq´1 " logplog nq) leads to the desired result.˛2

The non-quadratic case under exponential memory
The objective of this subsection is to extend the results of the quadratic case to strongly convex functions satisfying pHSC pαqq for a given positive α. As pointed out in Remark 5, we are not able to obtain neat and somewhat intrinsic results in the polynomial memory case, so we therefore preferred to only consider the exponential memory one. With the help of Subsection 4.1.1, we can restrain the study to the situation where d " 1 and f has a unique minimum in x ‹ and we denote λ " f 2 px ‹ q, which is assumed to be positive. We also assume that f 2 " inf xPR f 2 pxq ą 0. It is worth noting that in this setting, we are able to obtain some non-asymptotic bounds with some assumptions on λ only. This means that our results do not involve the quantity f 2 . To only involve the value of the second derivative in x ‹ , the main argument is a power increase stated in the next lemma.
Lemma 19 Let pu pkq n q ně0,kě1 be a sequence of non-negative numbers satisfying for every integers n ě 0 and k ě 1, u pkq n`1 ď p1´a k γn`1`b k γ 2 n`1 qu pkq n`Ck pγ 2 n`1`γn`1 u pk`1q where pa k q kě1 and pb k q kě1 are sequences of positive numbers. Furthermore, assume that K ě 2 exists and a constant C ą 0 exists such that: @n ě 1, u pKq n ď Cγn.

Proof
Let K ě 2. We proceed by a decreasing induction on k P t1, . . . , Ku. The initialization is given by (32). Then, let k P t1, . . . , K´1u and assume that u pk`1q n ď C k`1 γn (where C k is a positive constant that does not depend on n). We can use this upper bound in the second term of the right hand side of (31) and obtain: u pkq n`1 ď p1´aγn`1`bγ 2 n`1 qu pkq n`C γ 2 n`1 where C is a constant that does not depend on n.
When β ă 1, it follows from Proposition 28piiiq that: @n ě 1, u pkq n À γn.Į f β " 1 and aγ ą 1 now, the above control is a consequence of Proposition 29piiiq. This concludes the proof.˛W e will apply this lemma to u pkq n " Er| q Zn| 2k s where q Zn is an appropriate linear transformation of Zn Therefore, we will mainly have to check that Conditions (31) and (32) hold.
Proposition 20 Assume pHsq, pHSC pαqq and pHσ,8q with p ě 1. Let a and b be some positive numbers such that (8) holds. Then, an integer K ě 1 exists such that for any p ě K: Furthermore, if rn " r and γn " γn´β with β P p0, 1q, then (34) holds for p " K " 1 under pHσ,2q instead of pHσ,8q. As a consequence, Remark 21 Note that the second assertion (35) easily follows from Equations (9) and (34) and from the fact that under pHSC pαqq, a constant c exists such that for all x, f pxq ě c}x} 2 . Moreover, note that this proposition is not restricted to the exponential memory case. In particular, as suggested in Remark 5, this Lyapunov approach could lead to some (rough) controls of the quadratic error in the polynomial case when the function is not quadratic.

Proof
We begin by the first assertion under Assumption pHσ,8q. Going back to the proof of Lemma 8 (and to the associated notations), we obtain the existence of some positive a and b such that Vn`1pXn`1, Yn`1q ď VnpXn, Ynq`γn`1∆n`1 with ∆n`1 "´c a,b }Yn} 2´r nb}∇f pXnq} 2´b rnx∇f pXnq, ∆Mn`1y`∆Rn`1 pc a,b ą 0q.
Following the arguments of the proof of Lemma 8 once again, we can easily deduce the existence of some positive ε and C such that: Er∆n`1|Fns ď p´ε`Cγn`1qrnVnpXn, Ynq`Cγn`1rn. Using pHσ,8q, we also obtain for every r ě 1: Er}∆n`1} r |Fns ď Crp1`V r n pXn, Ynqq.
We have: k¯.

Proof
The starting point is to linearize the gradient: Since f 2 is Lipschitz continuous, then: Let us begin with the case where the matrix Cn defined in (21) has real eigenvalues µ`and µ´(given by (23)). With the notations introduced in (24), q Zn`1 "ˆ1`γ n`1µ`0 0 1`γn`1µ´˙q Zn`rγn`1Qˆ0 φn˙`r γn`1 q ξn`1.
At this stage, we observe that Assumption (31) is satisfied with u pkq n " Er} q Zn} 2k s and a k " kp2µ``εq. Using Proposition 20 and Lemma 8piq, we check that the second assumption of Lemma 19 also holds. Thus, the result follows in this case from this lemma.5

Limit of the rescaled algorithm
In this paragraph, we establish a (functional) Central Limit Theorem when the memory is exponential, i.e., when rn " r and when pHSC pαqq holds. In particular, f admits a unique minimum x ‹ . Without loss of generality, we assume that x ‹ " 0.

Rescaling stochastic HBF
We start with an appropriate rescaling by a factor ? γn. More precisely, we define a sequence pZnqně1: Given that f is C 2 (and that x ‹ " 0), we "linearize" ∇f around 0 with a Taylor formula and obtain that ξn P r0, Xns exists such that: ∇f pXnq " D 2 f pξnqXn.
We used the standard notations t n " ΓÑ pn,tq´Γ n above where N pn, tq " min # m ě n, m ř k"n`1 To obtain a CLT, we show that pZ pnq qně1 converges in distribution to a stationary diffusion, following a classical roadmap based on a tightness result and on an identification of the limit as a solution to a martingale problem.

Tightness
The next lemma holds for any sequence of processes that satisfy (41). Lemma 23 Assume that D 2 f is bounded, that sup kě1 Er}Z k } 2 s ă`8 and that a p ą 2 exists such that sup kě1 Er}∆M k } p s ằ 8, then pZ pnq qně1 is tight for the weak topology induced by the weak convergence on compact intervals.

Proof
First, note thatZ pnq 0 "Zn, the assumption sup kě1 Er}Z k } 2 s ă`8 implies the tightness of pZ pnq 0 qně1 (on R 2d ). Then, by a classical criterion (see, e.g., [Bil95,Theorem 8.3]), we deduce that a sufficient condition for the tightness of pZ pnq qně1 (for the weak topology induced by the uniform convergence on compacts intervals) is the following property: for any T ą 0, for any positive ε and η, a δ ą 0 exist and an integer n0 such that for any t P r0, T s and n ě n0, We consider B pnq and M pnq separately and begin by the drift term B pnq . On the one hand, The Chebyschev inequality and the fact that }b k pzq} ď Cp1`}z}q (where C does not depend on k) yield: The Jensen inequality and the fact that ř N pn,t`δq`1 k"N pn,tq γ k ď 2δ when n is large enough imply that a constant C exists such that for large enough n and for a small enough δ: Under the assumptions of the lemma, Err}∆M k } p s ď C. Furthermore, we can use the rough upper bound: n N pn,t`δq`1 ÿ k"N pn,tq`1 γ k ď ηδ for large enough n. This concludes the proof.˛C orollary 24 Let the assumptions of Theorem 6 hold, then pZ pnq qně1 is tight.

Proof
To prove this result, it is enough to check that the assumptions of Lemma 23 are satisfied. First, one remarks that the assumptions of Theorem 6 imply the ones of Theorem 4paq so that Er}Zn´z ‹ } 2 s ď Cγn (this also holds when β " 1 since we assume that γαr ą 1). As a consequence, sup kě1 Er}Z k } 2 s ă`8. On the other hand, since pHσ,pq holds for a given p ą 2, we can derive by following the lines of the proof of Proposition 20 that sup ně1 ErV p pXn, Ynqs ă`8. As a consequence, sup n Erf p pXnqs ă`8 and pHσ,pq leads to: Er}∆Mn} p s À sup n Erf p pXnqs ă`8.2

Identification of the limit
Starting from our compactness result above, we now characterize the potential weak limits of pZ pnq qně1. This step is strongly based on the following lemma.
Lemma 25 Suppose that the assumptions of Lemma 23 hold and that: where σ 2 is a positive symmetric dˆd-matrix. Then, for every C 2 -function g : R 2d Ñ R, compactly supported with Lipschitz continuous second derivatives, we have: EpgpZn`1q´gpZnq|Fnq " γn`1LgpZnq`R g n where γ´1 n`1 R g n Ñ 0 in L 1 and L is the infinitesimal generator defined in Theorem 6.

Remark 26
We recall that L is the infinitesimal generator of the following stochastic differential equation: dZt "HZtdt`ΣdBt where:H " 1 2γ 1 tβ"1u I 2d`H and Σ is defined in Theorem 6. pZtqtě0 lies in the family of Ornstein-Uhlenbeck processes: on the one hand, the drift and diffusion coefficients being respectively linear and constant, pZtqtě0 is a Gaussian diffusion; on the other hand, sinceH has negative eigenvalues, pZtqtě0 is ergodic.

Proof
C will denote an absolute constant whose value may change from line to line, for the sake of convenience. We use a Taylor expansion betweenZn andZn`1 and obtain that θn exists in r0, 1s such that: We first deal with the remainder term R p1q n`1 and observe that pCnq introduced in (39) is uniformly bounded so that a constant C exists such that }bnpzq} ď C}z}. We thus conclude that: }Zn`1´Zn} ď C`γn`1}Zn}`?γn`1}∆Mn`1}˘.
Then, given that D 2 f is Lipschitz (and that x ‹ " 0), it follows that: where pεnqně1 is a deterministic sequence such that limnÑ`8 εn " 0. Under the conditions of Theorem 6, we may apply the convergence rates obtained in Theorem 4 and observe that sup n Er}Xn} 2 s À γn, meaning that sup n Er}Zn} 2 s ă`8. Since }Xn} ď }Zn}, we deduce that: n where γ´1 n`1 R p2q n Ñ 0 in L 1 as n Ñ`8. Let us now consider the second term of the right-hand side of (42). We have: ErpZn`1´Znq T D 2 gpZnqpZn`1´Znq|Fns " γn`1 ÿ i,j D 2 y i y j gpZnqEr∆M i n`1 ∆M j n`1 |Fns`R p3q n where |γ´1 n`1 R p3q n | ď Cγn`1}Zn} 2 nÑ`8 ÝÝÝÝÑ 0 in L 1 under the assumptions of the lemma. To conclude the proof, it remains to note that under the assumptions of the lemma for any i and j, pEr∆M i n`1 ∆M j n`1 |Fnsqně1 is a uniformly integrable sequence that satisfies: Er∆M i n`1 ∆M j n`1 |Fns " Vi,j in probability.
Thus, the convergence also holds in L 1 . The conclusion of the lemma easily follows from the boundedness of D 2 g.˛W e are now able to prove Theorem 6: Proof of Theorem 6, piq: Note that under the assumptions of Theorem 6, we can apply Lemma 23 and Lemma 25 and obtain that the sequence of processes pZ pnq qně1 is tight. The rest of the proof is then divided into two steps. In the first one, we prove that every weak limit of pZ pnq qně1 is a solution of the martingale problem pL, Cq where C denotes the class of C 2 -functions with compact support and Lipschitz-continuous second derivatives. Before going further, let us recall that, owing to the Lipschitz continuity of the coefficients, this martingale problem is well-posed, i.e., tha,t existence and uniqueness hold for the weak solution starting from a given initial distribution µ (see, e.g., [EK86] or [SV06]). In a second step, we prove the uniqueness of the invariant distribution related to the operator L and the convergence in distribution to this invariant measure. We end this proof by showing that pZ pnq q converges to this invariant distribution, so that the sequence pZ pnq qně1 converges to a stationary solution of the previously introduced martingale problem. We will characterize this invariant (Gaussian) distribution in the next paragraph.
Step 1: Let g belong to C and let pF pnq t qtě0 be the natural filtration ofZ pnq . To prove that any weak limit of pZ pnq qně1 solves the martingale problem pL, Cq, it is enough to show that: This inequality combined with the Lipschitz continuity of g and its derivatives implies that the first three terms tend to 0 when n Ñ`8. Now, concerning the last one, the previous lemma yields: tep 2: First, let us prove that uniqueness holds for the invariant distribution related to L. We denote it by µ pβq 8 below. In this simple setting where the coefficients are linear, we could use the fact that the process, which is solution to the martingale problem, is Gaussian so that any invariant distribution is so. Uniqueness could then be deduced through the characterization of the mean and the variance through the relationship ş Lf pxqµ pβq 8 pdxq " 0 (see next subsection for such an approach). However, at this stage, we prefer to use a more general strategy related to the hypoellipticity of L (see, e.g., [GP14] for a similar approach). More precisely, set LD :"´xy, Bxy`rxD 2 f px ‹ qx´ys, Byy and σi :" ř d j"1 σ j i By j , where σ satisfies σσ t " V (where V is defined by (6)). We have assumed that σ is invertible, so that: spanpσ1, . . . , σ d q " spanpBy 1 , . . . , By d q.
Therefore, Lie pLD, σ1, . . . , σ d q " Lie pLD, By 1 , . . . , By d q Now, it is straightforward to check that: @i P t1, . . . , du rLD, By i s pf q "´Bx i pf q, and we deduce that Lie pLD, σ1, . . . , σ d q " Lie pBx 1 , . . . , Bx d , By 1 , . . . , By d q. This means that the Hormandër bracket condition holds at any point z of R 2d , which implies that the process admits a density pptpz, .qqtě0 such that for any t ą 0, pz, z 1 q Þ Ñ ptpz, z 1 q, which is smooth on R 2dˆR2d . It is moreover possible to show that these densities are positive, for any t ą 0, given that the linear vector field is approximately controllable: for any time T ą 0, any η ą 0 and any couple of initial points px0, y0q and ending points pxT , yT q, we can build a function ϕ such that 9 ϕ P L 2 and such that the controlled trajectory: " 9 xptq "´yptq 9 yptq " rptqp∇U pxptqq´yptqq`σ 9 ϕ, satisfies: z0 " px0, y0q and }zT´pxT , yT q} ď η. This implies the irreducibility of the diffusion and, therefore, the uniqueness of the invariant distribution. We refer to [GP14] for more details on this controllability problem. Then, checking that L}x} 2 ď β´α}x} 2 for positive α and β, it can be classically deduced from the Meyn-Tweedietype arguments (see [MT93]) that the process converges locally uniformly, exponentially fast in total variation to µ pβq 8 . For more details, we refer to [MSH02,Theorem 4.4]. Below, we will only use the following corollary: for any bounded Lipschitz-continuous function f , for any compact set K of R 2d , where pPtqtě0 denotes the semi-group related to the (well-posed) martingale problem pL, Cq.S tep 3: Let pZn k q kě1 be a (weakly) convergent subsequence of pZnqně1 to a probability ν. We have to prove that ν " µ pβq 8 . To do this, we take advantage of the "shifted" construction of the sequence pZ pnq q nPN . More precisely, as a result of construction, for any positive T , a sequence pψpn k , T qq kě1 exists such that: N pT, ψpn k , T qq " n k .
In other words,Z pψpn k ,T qq T ψpn k ,T q "Zn k .
At the price of a potential extraction, pZ pψpn k ,T qq q kě1 is convergent to a continuous process, which is denoted by Z 8,T below. Given thatZ pnq T´Z pnq T n tends to 0 as n Ñ`8 in probability, it follows that Z 8,T T has distribution ν. However, according to Step 1, Z 8,T is also a solution to the martingale problem pL, Cq so that for any Lipschitz continuous function f , Step 2 that the right-hand member tends to 0 as T Ñ`8. From this, we can therefore conclude that for any bounded Lipschitz-continuous function f , a large enough T exists such that:ˇˇE rf pZ 8,T T qs´µ pβq 8 pf qˇˇď C f ε.
Since Erf pZ 8,T T qs " νpf q, it follows that νpf q " µ pβq 8 pf q. Finally, the set P is reduced to a single element P " tµ pβq 8 u, and the whole sequence pZnqně1 converges to µ pβq 8 . Before ending this section, let us note that µ pβq 8 is a Gaussian centered distribution is a simple consequence of Remark 26. We therefore leave this point to the reader.˛5

.4 Limit variance
We end this section on the analysis of the rescaled algorithm with some considerations on the invariant measure µ pβq 8 involved in Theorem 6 for the exponential memoried stochastic HBF, i.e. when rn " r. As shown in the above paragraph, this invariant measure describes the exact asymptotic variance of the initial algorithm. We now focus on its characterization i.e., on the proof of Theorem 6piiq. In particular, to ease the presentation, we assume that the covariance matrix V related to p∆Mn`1qně1 is proportional to the identity matrix: We also assume that γn " γn´β with β ă 1. Then, piq of Theorem 6 states that pZnqně1 weakly converges toward a diffusion process, whose generator L is the one of an Ornstein-Uhlenbeck process. Assumption (46) leads to a simpler expression: Lpφqpx, yq "´xy, ∇xφy`rxD 2 pf qpx ‹ qx´y, ∇yφy`r 2 σ0 2 2 ∆yφ.
A particular feature of Equation (47) when γn " γn´β is that L does not depend on β nor γ. The invariant measure µ pβq 8 is a multivariate Gaussian distribution that may be well described in the basis given by the eigenvectors of the Hessian D 2 pf qpx ‹ q. The reduction to d couples of two-dimensional system used in Section 4.1.1 makes it possible to use the spectral decomposition of D 2 pf qpx ‹ q " P´1ΛP where P is an orthonormal matrix and Λ a diagonal matrix with positive eigenvalues. The process p q Xn, q Ynq " pPXn, PȲnq is therefore centered and Gaussianly distributed asymptotically. This process is associated with d 2ˆ2 blockwise independent Ornstein-Uhlenbeck processes, whose generator is now Lpφqpx,yq "´xy, ∇xφy`rxΛx´y, ∇yφy`r 2 σ0 2 2 ∆yφ, where we used Tr`P t D 2 y P˘" Tr`D 2 y P P t˘" Tr`D 2 y˘i n the last line because P t P " I d . If we denoteμ pβq 8 the associated invariant gaussian measure, the tensor structure ofĽ leads to Finally, we chose φpx,yq "x piqypiq and obtainĽ´x piqypiq¯px ,yq "´!y piq ) 2`r λitx piq u 2´rxpiqypiq . Therefore, we deduce that: We can sum-up formulae (48)-(51) inμ pβq 8 " N p0, Dr,σ 0 q with Dr,σ 0 " σ 0 2 2ˆΛ´1 0 dˆd 0 dˆd rI d˙. Since pXn,Ȳnq " pP´1 q Xn, P´1 q Ynq, we deduce that: This situation is more involved since we can observe that the drift of the limit diffusion is modified according to the size of γ. In particular, the generator L in that case is shifted from the one above by 1 2γ I so that: Lpφqpx, yq " 1 2γ rx∇xφ, xy`x∇yφ, yys´xy, ∇xφy`rxD 2 f px ‹ qx´y, ∇yφy`r 2 σ0 2 2 ∆yφ.
Again, we can use the decomposition D 2 f px ‹ q " P´1ΛP where P is an orthonormal matrix, and the generator of the rotated process p q Xn, q Ynq " pPXn, PȲnq is: The associated Ornstein-Uhlenbeck process has a unique Gaussian invariant measureμ p1q 8 if and only if γαr ą 1 where αr is the constant defined in the statement of Proposition 15. The following equations still hold: To determine the rest of the covariance matrix, we follow the same strategy and only address the case d " 1 for the sake of convenience. We define: σ 2 x :" E px,yq"μ p1q 8 "x 2 ‰ , σ 2 y :" E px,yq"μ p1q 8 "y 2 ‰ and σx,y :" E px,yq"μ p1q 8 rxys.

Numerical experiments
In this short paragraph, we briefly investigate the behavior of several algorithms, widely used in the field of stochastic approximation. In particular, we are interested in the convergence rates of each algorithm, as well as their behavior in the case of non-convex potential f with multiple wells, to illustrate both Theorem 4 and Theorem 2.
Convergence rates We are first concerned by the typical behavior of the heavy ball stochastic approximation algorithm in the convex case. In particular, we are interested in the role played by the parameter r that varies, both in the polynomial case and in the exponential case. Figure 1 represents the logarithmic loss of the algorithms with respect to the logarithm of the number of iterations in the 1 dimensional case with f pxq " x 2 2 . The step size used is γ k " k´1. We immediately observe that for small values of r (that correspond to a long-term memory case), the algorithm possesses a lengthy oscillating behavior, which is a feature of second-order algorithms with a very mild damping effect. This phenomenon has also been observed in previous works (see, e.g., [FB15] and the references therein). We also observe that the use of an excessively large value of r (say, when r is greater than 10) creates a numerical instability at the beginning of the iterations. This could be fixed by using a supplementary truncating trick introduced in [Lem07]. Finally, the obtained rates are better (from a numerical point of view) when r is chosen at around 5 in the exponential case, and at around 10 in the polynomial case, mainly because of the oscillations that deteriorate the convergence when r is too small. Figure 2 then compares several stochastic optimization algorithms in the toy example f pxq " |x| p {p: the standard Robbins-Monro stochastic gradient descent introduced in [RM51] (SGD) and several second order algorithms: the "optimal" Ruppert-Polyak averaging algorithm (see [PJ92,Rup88]), the Nesterov accelerated gradient descent [Nes83] adapted in the stochastic framework in a straightforward way using an unbiased evaluation of the gradient in each iteration, and the recent SAGE method introduced in [HPK09]. Note that the Rupper-Polyak averaging algorithm is used according to the recommendation of [Bac14] with γ k " k´1 {2 . The first elementary remark is that the rate is of course deteriorated by the loss of strong convexity (left side, Figure 2). In this case, the Ruppert-Polyak averaging outperforms other methods and attains the Op1{ ? nq minimax rate (see [NY83]). When f is strongly convex, the second-order algorithms then all share an equivalent efficiency with, apparently a Op1{nq convergence rate. This corresponds to piiq of Theorem 4 when the Hessian at the critical point is sufficiently large to make this minimax optimal rate possible. Nevertheless, the ability of the stochastic heavy ball in a more general situation may deserve further numerical investigation, which is beyond the scope of this paper. The SGD seems to be a little bit less effective in the strongly convex case. Finally, the Nesterov adaptation to the stochastic case does not lead to an efficient algorithm (in comparison to the other methods tested). However, this remark should be balanced by the fact that we did not use the Lan adaptation of the Nesterov accelerated gradient descent introduced in [Lan12]. It appears that this modification that consists in an addition of an intermediary point in the NAGD seems important to optimize the behavior of the algorithm in the stochastic case.
Non-convex case In this paragraph, we investigate the ability of the stochastic algorithm to avoid local traps and, in particular, we focus on the behavior of second order algorithms that may be an intermediary step towards global optimization methods such as simulated annealing. For this purpose, we defined f as: with a " 1{40 and b "´1{5. These values have been fixed to guarantee the numerical stability of the stochastic procedures, but the results we obtained may be replicated for other values. The values of a and b above yield a double-well potential with a global minimizer of f of around x ‹ »´4.9, although f has a local trap on the positive part at around x`» 4. The function f is represented on the top left of Figure 3. We used γ k " k´1 for all of the methods and we varied the initialization point of each algorithm from´10 to 10 with 100 Monte-Carlo replications. For each simulation, we arbitrarily stopped the evolution of the algorithm after T " 10 4 iterations, and considered that optimization was successful when |xT´x ‹ | ď 1. This criterion may be replaced by a more stringent inequality, at the price of an increase of T , without really changing the main conclusions below.
Performances are reported in Figure 3. We observe that both SGD and Ruppert-Polyak algorithms have the same behavior. This fact is absolutely clear because Polyak averaging is built with a Cesaro average of SGD. The target convergence point of SGD and of Polyak averaging are thus the same. We can also note that in the almost no noise setting, the basin of attraction of x ‹ for SGD may be roughly approximated by s´8, 1s. Nevertheless, both SAGE and HBF seem to behave better behaviour with a somewhat larger basin of attraction: in particular, it is possible to start from an initialization point x1 " 8 and still obtain convergence of SAGE or HBF towards x ‹ . This last point is clearly impossible with SGD. The same conclusions hold for different values of σ (see Figure 3, bottom left and right). Finally, we observe that NAGD does not present very good behavior: the probability of failure when the algorithm is initialized at´4 is lower than 1 for σ " 1 or σ " 2.
We can calculate a more quantitative indicator of this behavior with the computation of the average rate of success of each algorithm when the initialization point is sampled uniformly over r´10; 10s. Table 1 seems to indicate that the stochastic heavy ball leads to a better exploration of the state space, in particular, with reasonable values of r (see Table 2). These conclusions should be understood as numerical observations of experimental results on this particular   • (ii) sup ω ś n p1`αnpωqq ă 8, ř n Epβnq ă 8.
• (iii) @n P N, EpVn`1|Fnq ď Vnp1`αn`1q`βn`1´Un`1 The function x Þ ÝÑ x´2 β e aγ 1´β x 1´β´bγ 2 1´2β x 1´2β being increasing for x ě c a,b,γ,β , we then obtain considering an integer t ą c a,b,γ,β : Proof: The upper bounds involved in piq and piiq are straightforward.P roof of piiiq: The situation is easier than the one involved in point piiq of Proposition 28 because in that case, we have: @n ě 1 Γ p2q n ď γ 2 π 2 {6. Therefore, we can repeat the computations above and get: We then deduce that: .2 Expectation of the supremum of the square of sub-Gaussian random variables We consider a sequence of independent random variables pξiqiěn of R d such that each coordinate satisfies a sub-Gaussian assumption pH Gauss,σ q: @λ P R @j P t1, . . . , du @i ě n log E " e λξ j i ı ď λ 2 σ 2 2 , where σ 2 is a variance factor. If pγ k q kěn is a decreasing sequence in 2 pNq, we are looking for an upper bound of: For any ν ą 0 and any decreasing sequence γn " γn´ν , we establish the following result (useful for Theorem 13).
Theorem 30 If each coordinate ξ j i is absolutely continuous w.r.t. the Lebesgue measure and satisfies pH Gauss,σ q, then: m ‹ n À σ 2 d γ 2 n logpγ´2 n q, where À refers to an inequality up to a universal constant.
We begin with a preliminary lemma.

Lemma 31
Assume that X is a real random variable that satisfies pH Gauss,σ q with median 0: P pX ą 0q " P pX ă 0q " 1 2 .
Then, we can find Y " N p0, σ 2 q on the same probability space and c large enough s.t.

Proof:
We use a coupling argument. We denote FX as the cumulative distribution function: Similarly, we also denote Ψ σ 2 as the cumulative distribution function of a Gaussian random variable N p0, σ 2 q: dx " PrN p0, σ 2 q ď ts.
Our assumption on the distribution on X shows that the generalized inverse of FX (denoted F´1 X ) exists and if U is a uniform random variable between on r0, 1s, then X " F´1 X pUq. We now consider the random variable Y " F´1 σ 2 pUq built with the same realization of U. Of course, Y is distributed according to a Gaussian random variable N p0, σ 2 q. We need to show that a sufficiently large c ą 0 exists such that |X| ď c|Y |, that is:ˇF´1 Using the fact that FX is an increasing function, and letting u " Ψ σ 2 pyq, it is then equivalent to show that: We now study two different situations for y. If y " 0, then Inequality (59) holds since the median of X is 0. If |y| ď η is close to 0, the same inequality is satisfied with a first-order Taylor expansion. For example, the right hand side reads: FX pc|y|q " 1 2`ż c|y| 0 fX puqdu ě 1 2`c fX p0q|y|`op|y|q, which is greater than Ψ σ 2 p|y|q for c large enough. Hence, we deduce that Inequality (59) holds around 0. Now, we assume that |y| ą η ą 0, the desired upper bound (59) is equivalent to: 1´FX pc|y|q ď 1´Ψ σ 2 p|y|q.
At the same time, the lower bound of the Gaussian tail is given by: with κpδq a constant independent of |y| ě δ. Hence, the right hand side of (59) holds for a large enough c (independent on σ 2 ). A symmetry argument permits to conclude for the left hand side of (59). Inequality (59) being equivalent to (58), the conclusion of the proof follows.W e are now looking at to the proof of Theorem 30.

Proof of Theorem 30:
We will shift all of the coordinates of the random variables pξiqiěn by their corresponding medians. Assuming pH Gauss,σ q, the coordinates pξ j i q 1ďjďd are centered and have a second-order moment upper bounded by σ 2 (see [Str94], for example): @i ě n @j P t1, . . . , du Ertξ i j u 2 s ď σ 2 . The Tchebychev inequality implies that each median m j i of the random variables ξ j i are bounded by: @i ě n @j P t1, . . . , du |m j i | ď ? 2σ.
We then consider the centered (w.r.t. their medians) random variables: We can use Lemma 31 and deduce that up to a multiplicative universal constant: where each pZ k q kěn are i.i.d. realizations of Gaussian random variables N p0, σ 2 I d q.
We now aim to apply a chaining argument to control the supremum of the empirical process above. To apply Lemma 32, we define Tn :" n;`8 and compute the Laplace transform of the chi-square-like random variables: log Ee λrγ 2 k }Z k } 2´γ2 j }Z j } 2 s " d 2 logˆ1´2 λγ 2 j 1´2λγ 2 kẆ e can check that up to a universal multiplicative constant, we have: @λ P R`@pa, bq P R`ˆR`: log 1´aλ 1´bλ À λ|a´b|`| a´b| 2 λ 2 1´λ|a´b| .
We are naturally driven to define the pseudo-metric on Tn by: @pi, jq P T 2 n dpi, jq "ˇˇγ 2 i´γ 2 jˇ.
It remains to upper bound the covering number of Tn according to d for any radius ą 0. Indeed, when 2γ 2 n ď , we have N p , Tnq " 1 although when ď 2γ 2 n , we use the rough bound: N p , Tnq ď inf j ě n : 2γ 2 j ď ( .
We apply Lemma 32 and obtain an upper bound for the right hand side of (61). The first term is proportionnal to γ 2 n . The other terms lead to the computation of the two integrals (up to some universal multiplicative constants): The change of variable " e´x and an integration by parts leads to an upper bound whose size is logpγ´2 n qγ 2 n .˝The next Lemma, borrowed from [BLM13] (see Lemma 13.1, Chapter 13), provides a key estimate for the expectation of the suppremum of an empirical process indexed by a pseudo metric space pT , dq. This estimate involves the covering numbers N pδ, T q associated with the set T and the pseudo-metric d.
Lemma 32 Let T be a separable metric space and pXtqtPT be a collection of random variables such that for some constants a, v, c ą 0, log Ee λrX i´Xj s ď aλdpi, jq`v λ 2 d 2 pi, jq 2p1´cλdpi, jq for all pi, jq P T 2 and all 0 ă λ ă tcdpi, jqu´1. Then, for any i0 P T :