Transport-entropy inequalities and deviation estimates for stochastic approximation schemes

We obtain new transport-entropy inequalities and, as a by-product, new deviation estimates for the laws of two kinds of discrete stochastic approximation schemes. The first one refers to the law of an Euler like discretization scheme of a diffusion process at a fixed deterministic date and the second one concerns the law of a stochastic approximation algorithm at a given time-step. Our results notably improve and complete those obtained in [Frikha, Menozzi,2012]. The key point is to properly quantify the contribution of the diffusion term to the concentration regime. We also derive a general non-asymptotic deviation bound for the difference between a function of the trajectory of a continuous Euler scheme associated to a diffusion process and its mean. Finally, we obtain non-asymptotic bound for stochastic approximation with averaging of trajectories, in particular we prove that averaging a stochastic approximation algorithm with a slow decreasing step sequence gives rise to optimal concentration rate.


Introduction
In this work, we derive transport-entropy inequalities and, as a consequence, non-asymptotic deviation estimates for the laws at a given time step of two kinds of discrete-time and d-dimensional stochastic evolution scheme of the form X n+1 = X n + γ n+1 H(n, X n , U n+1 ), n ≥ 0, It is well known that (GC(β)) implies the following deviation bound Examples of random variables satisfying this property include Gaussians, as well as bounded random variables. A characterization of (GC(β)) is given by Gaussian tail of U 1 , that is there exists ε > 0 such that E[exp(ε|U 1 | 2 )] < +∞, see e.g. Bolley and Villani [BV05]. The two claims are actually equivalent.
We are interested in furthering the discussion, initiated in [FM12], about giving non asymptotic deviation bounds for two specific problems related to evolution schemes of the form (1.1). The first one is the deviation between a function of an Euler like discretization scheme of a diffusion process at a fixed deterministic date and its mean. The second one refers to the deviation between a stochastic approximation algorithm at a given time-step and its target. Under some mild assumptions, in particular the assumption that the function u → H(n, x, u) is lipschitz uniformly in space and time, it is proved in [FM12] that both recursive schemes share the Gaussian concentration property of the innovation.
In the present work, we point out the contribution of the diffusion term to the concentration rate which to our knowledge is new. This covers many situations and gives rise to different regimes ranging from exponential to Gaussian. We also derive a general non-asymptotic deviation bound for the difference between a function of the trajectory of a continuous Euler scheme associated to a diffusion process and its mean. It turns out that, under mild assumptions, the concentration regime is log-normal. Finally, we study non-asymptotic deviation bound for stochastic approximation with averaging of trajectories according to the averaging principle of Ruppert & Polyak, see e.g. [Rup91] and [PJ92].

Euler like Scheme of a Diffusion Process
We consider a Brownian diffusion process (X t ) t≥0 defined on a filtered probability space (Ω, F , (F t ) t≥0 , P), satisfying the usual conditions, and solution to the following stochastic differential equation (SDE) where (W t ) t≥0 is a q-dimensional (F t ) t≥0 Brownian motion and the coefficients b, σ are assumed to be uniformly Lipschitz continuous in space and measurable in time.
A basic problem in Numerical Probability is to compute quantities like E x [f (X T )] for a given Lipschitz continuous function f and a fixed deterministic time horizon T using Monte Carlo simulation. For instance, it appears in mathematical finance and represents the price of a European option with maturity T when the dynamics of the underlying asset is given by (SDE b,σ ). Under suitable assumptions on the function f and the coefficients b, σ, namely smoothness or non degeneracy, it can also be related to the Feynman-Kac representation of the heat equation associated to the generator of X. To this end, we first introduce some discretization schemes of (SDE b,σ ) that can be easily simulated. For a fixed time step ∆ = T /N, N ∈ N * , we set t i := i∆, for all i ∈ N and define an Euler like scheme by where (U i ) i∈N * is a sequence of R q -valued i.i.d. random variables with law µ satisfying: E[U 1 ] = 0 q , E[U 1 U * 1 ] = I q , where U * 1 denotes the transpose of the column vector U 1 and 0 q , I q respectively denote the zero vector of R q and the identity matrix of R q ⊗ R q . We also assume that µ satisfies (GC(β)) for some β > 0. The main advantage of such a situation is that it includes the case of the standard Euler scheme where U 1 d = N (0, I q ) (satisfying (GC(β)) with β = 2) and the case of the Bernoulli law where U 1 d = (B 1 , · · · , B q ), (B k ) k∈ [[1,q]] are i.i.d random variables with law µ = 1 2 (δ −1 + δ 1 ), which turns out to be one of the only realistic options when the dimension is large. It is well-known that if f (X ∆,x T ) belongs to L 2 (P) the central limit theorem provides an asymptotic rate of convergence of order M 1/2 . If f (X ∆,x T ) ∈ L 3 (P), a non-asymptotic result is given by the Berry-Essen theorem. However, in practical implementation, one is interested in obtaining deviation bounds in probability for a fixed M and a given threshold r > 0, that is explicitly controlling the quantity P (E Emp (M, ∆) ≥ r).
In this context, Malrieu and Talay [MT06] obtained Gaussian deviation bounds in an ergodic framework and for a constant diffusion coefficient. Concerning the standard Euler scheme, Menozzi and Lemaire [LM10] obtained two-sided Gaussian bounds up to a systematic bias under the assumptions that the diffusion coefficient is uniformly elliptic, σσ * is Hölder-continuous, bounded and that b is bounded. Frikha and Menozzi [FM12], getting rid of the non-degeneracy assumption on σ, recently obtained Gaussian deviation bound under the mild smoothness condition that b, σ are uniformly Lipschitz-continuous in space (uniformly in time) and that σ is bounded. The main tool of their analysis is to exploit similar decompositions used in [TT90] for the analysis of the weak error. It should be noted that it is the boundedness of σ that gives rise to the Gaussian concentration regime for the deviation of the empirical error.
Using optimal transportation techniques, Blower and Bolley [BB06] obtained Gaussian concentration inequalities and transportation inequalities for the joint law of the first n positions of a stochastic processes with state space some Polish space. However, continuity assumptions in Wasserstein metric need to be checked which can be hard in practice, see conditions (ii) in their Theorems 1.1, 1.2 and 2.1. The authors provide a computable sufficient condition which notably requires the smoothness of the transition law, see Proposition 2.2. in [BB06].
In the current work, we get rid of the boundedness of σ and we only need the Gaussian concentration property of the innovation. We suppose that the coefficients satisfy the following smoothness and domination assumptions (HS) The coefficients b, σ are uniformly Lipschitz continuous in space uniformly in time.
The idea behind assumption (HD α ) is to parameterize the growth of the diffusion coefficient in order to quantify its contribution to the concentration regime. Indeed, under (HS) and (HD α ), with α ∈ [1/2, 1], and if the innovations satisfy (GC(β)), for some positive β, we derive non-asymptotic deviation bounds for the statistical error ] ranging from exponential (if α = 1/2) to Gaussian (if α = 1) regimes. Therefore, we greatly improve the results obtained in [FM12].
Our approach here is different from [FM12]. Indeed, in [FM12], the key tool consists in writing the deviation using the same kind of decompositions that are exploited in [TT90] for the analysis of the discretization error. In the current work, we will use the fact that the Euler-like scheme (1.2) defines an inhomogenous Markov chain having Feller transitions P k , k = 0, · · · , N − 1, defined for non negative or bounded Borel function f : For every k, p ∈ {0, · · · , N − 1}, k ≤ p, we also define the iterative kernels for a non negative or bounded Borel function f : For a 1-Lipschitz function f and λ ≥ 0, using that the law µ of the innovation satisfies (GC(β)) for some positive β, we obtain If σ is bounded, the Gaussian concentration property will readily follow provided the iterated kernel functions P k,p (f ) are uniformly Lipschitz. Under the mild smoothness assumption (HS), this can be easily derived, see Proposition 3.2. Otherwise, using (HD α ), we obtain The last inequality is the first step of our analysis. To investigate the empirical error, the key idea is to exploit recursively from (1.3) that the increments of the scheme (1.2) satisfy (GC(β)) and to adequately quantify the contribution of the diffusion term V 1−α (x) to the concentration rate. Under (HS) and (HD α ), the latter is addressed using flow techniques and integrability results on the law of the scheme (1.2), see Propositions 3.1 and 3.3.

Stochastic Approximation Algorithm
Beyond concentration bounds of the empirical error for Euler-like schemes, we want to look at non asymptotic bounds for stochastic approximation algorithms. Introduced by H. Robbins and S. Monro [RM51], these recursive algorithms aim at finding a zero of a continuous function h : R d → R d which is unknown to the experimenter but can only be estimated through experiments. Successfully and widely investigated since this seminal work, such procedures are now commonly used in various contexts such as convex optimization since minimizing a function amounts to finding a zero of its gradient.
To be more specific, the aim of such an algorithm is to find a solution θ * to the equation h(θ) := E[H(θ, U )] = 0, where H : R d × R q → R d is a Borel function and U is a given R q -valued random variable with law µ. The function h is generally not computable, at least at a reasonable cost. Actually, it is assumed that the computation of h is costly compared to the computation of H for any couple (θ, u) ∈ R d × R q and to the simulation of the random variable U .
A stochastic approximation algorithm corresponds to the following simulation-based recursive scheme where (U n ) n≥1 is an i.i.d. R q -valued sequence of random variables with law µ defined on a probability space (Ω, F , P) and (γ n ) n≥1 is a sequence of non-negative deterministic steps satisfying the usual assumption n≥1 γ n = +∞, and n≥1 γ 2 n < +∞. (1.5) When the function h is the gradient of a potential, the recursive procedure (1.4) is a stochastic gradient algorithm. Indeed, replacing H(θ n , U n+1 ) by h(θ n ) in (1.4) leads to the usual deterministic descent gradient method. When h(θ) = M (θ) − ℓ, θ ∈ R, where M is a monotone function, say increasing, we can write The key idea of stochastic approximation algorithms is to take advantage of an averaging effect along the scheme due to the specific form of h(θ) := E[H(θ, U )]. This allows to avoid the numerical integration of h at each step of a classical first-order optimization algorithm. In the present paper, we make no attempt to provide a general discussion concerning convergence results of stochastic approximation algorithms. We refer readers to [Duf96], [KY03] for some general results on the a.s. convergence of such procedures under the existence of a so-called Lyapunov function, i.e. a continuously differentiable function L : R d → R + such that ∇L is Lipschitz, |∇L| 2 ≤ C(1 + L) for some positive constant C and See also [LP12] for a convergence theorem under the existence of a pathwise Lyapunov function. For the sake of simplicity, in the sequel it is assumed that θ * is the unique solution of the equation h(θ) = 0 and that the sequence (θ n ) n≥0 defined by (1.4) converges a.s. towards θ * .
The last assumption (HUA), which already appeared in [FM12], is introduced to derive a sharp estimate of the concentration rate in terms of the step sequence. Let us note that such assumption appears in the study of the weak convergence rate order for the sequence (θ n ) n≥1 as described in [Duf96] or [KY03]. Indeed, it is commonly assumed that the matrix Dh(θ * ) is uniformly attractive that is Re(λ min ) > 0 where λ min is the eigenvalue with the smallest real part. In our current framework, this local condition on the Jacobian matrix of h at the equilibrium is replaced by the uniform assumption (HUA). This allows to derive sharp estimates for the concentration rate of the sequence (θ n ) n≥1 around its target θ * and to provide a sensitivity analysis for the bias δ n := E[|θ n − θ * |] with respect to the starting point θ 0 .
The global error between the stochastic approximation procedure θ n at a given time step n and its target θ * can be decomposed as an empirical error and a bias as follows where we introduced the notations E Emp (γ, n, H, λ, α) = |θ n − θ * | − E θ0 [|θ n − θ * |] and δ n : The empirical error E Emp (γ, n, H, λ, α) is the difference between the absolute value of the error at time n and its mean whereas the bias δ n corresponds to the mean of the absolute value of the difference between the sequence (θ n ) n≥0 at time n and its target θ * . Unlike the Euler like scheme, a bias systematically appears since we want to derive a deviation bound for the difference between θ n and its target θ * . This term strongly depends on the choice of the step sequence (γ n ) n≥1 and the initial point θ 0 , see Proposition 4.4 for a sensitivity analysis.
As for Euler like schemes, our strategy is different from [FM12]. Indeed, we exploit again the fact that the stochastic approximation scheme (1.4) defines an inhomogenous Markov chain having Feller transitions P k , k = 0, · · · , N − 1, defined for non negative or bounded Borel function f : For every k, p ∈ {0, · · · , N − 1}, k ≤ p, we also define the iterative kernels for a non negative or bounded Borel function f : R d → R as follows For a 1-Lipschitz function f and for all λ ≥ 0, using (HLS) α and that the law µ of the innovation satisfies (GC(β)) for some positive β, we obtain Let us note the similarity between (1.3) and (1.7). If (HLS) α holds with α = 1 then the last term appearing in the right hand side of the last inequality is uniformly bounded in θ. This latter assumption corresponds to the framework developed in [FM12] and leads to a Gaussian concentration bound.
Otherwise, the problem is more challenging. Under the mild domination assumption (HLS) α , the key idea consists again in exploiting recursively from (1.7) that the increments of the stochastic approximation algorithm (1.4) satisfy (GC(β)) and in properly quantifying the contribution of the diffusion term L 1−α (θ) to the concentration rate.
As already noticed in [FM12], the concentration rate and the bias strongly depends on the choice of the step sequence. In particular, if γ n = c n , with c > 0 then the optimal concentration rate and bias is achieved if c > 1 2λ , see Theorem 2.2. in [FM12]. Otherwise, they are sub-optimal. This kind of behavior is well-known concerning the weak convergence rate for stochastic approximation algorithm. Indeed, if c > 1 2Re(λmin) we know that a Central Limit Theorem holds for the sequence (θ n ) n≥1 (see e.g. [Duf96]). Let us note that the condition c > 1 2λ as well as c > 1 2Re(λmin) is difficult to handle and may lead to a blind choice in practical implementation. To circumvent such a difficulty, it is fairly well-known that the key idea is to carefully smooth the trajectories of a converging stochastic approximation algorithm by averaging according to the Ruppert & Polyak averaging principle, see e.g. [Rup91] and [PJ92]. It consists in devising the original stochastic approximation algorithm (1.4) with a slow decreasing step and to simultaneously compute the empirical mean (θ n ) n≥1 of the sequence (θ n ) n≥0 by settinḡ θ n = θ 0 + · · · + θ n−1 n =θ n−1 − 1 n θ n−1 − θ n−1 . (1.8) We will not enter into the technicalities of the subject but under mild assumptions (see e.g. [Duf96], p.169) one shows that √ n(θ n − θ * ) where Σ * is the optimal covariance matrix. For instance, for d = 1, one has Σ * = V ar(H(θ * ,U)) . Hence, the optimal weak rate of convergence √ n is achieved for free without any condition on the constants c or b. However, this result is only asymptotic and so far, to our best knowledge, non-asymptotic estimates for the deviation between the empirical mean sequence (θ n ) n≥0 at given time step and its target θ * , that is non-asymptotic averaging principle were not investigated.
The sequence (z n ) n≥0 defined by z n := (θ n+1 , θ n ) is F -adapted, i.e. for all n ≥ 0, z n is F n -measurable, where F n := σ(θ 0 , U k , k ≤ n). Moreover, it defines an inhomogenous Markov chain having Feller transitions K k , k = 0, · · · , N − 1, defined for non negative or bounded Borel function f : For every k, p ∈ {0, · · · , N − 1}, k ≤ p, we define the iterative kernels for a non negative or bounded Borel Hence, for any 1-Lipschitz function and for all λ ≥ 0, using again (HLS) α and that the law µ of the innovation satisfies (GC(β)) for some positive β, one has for all k ∈ {0, · · · , N − 1} . Here again, (1.7) and (1.9) are quite similar and if α = 1 the concentration regime turns out to be Gaussian. Otherwise, an analysis along the lines of the methodology developed so far provides the concentration regime of the stochastic approximation algorithm with averaging of trajectories.

Transport-Entropy inequalities
As a by-product of our analysis, we derive transport-entropy inequalities for the law of both stochastic approximation schemes. We recall here basic definitions and properties. For a complete overview and recent developments in the theory of transport inequalities, the reader may refer to the recent survey [GL10]. We will denote by P(R d ) the set of probability measures on R d .
For p ≥ 1, we consider the set P p (R d ) of probability measures with finite moment of order p. The Wasserstein metric W p (µ, ν) of order p between two probability measures µ, ν ∈ P p (R d ) is defined by where π 0 and π 1 are two probability measures standing for the first and second marginals of π ∈ P(R d × R d ).
For µ ∈ P(R d ), we define the relative entropy w.r.t ν ∈ P(R d ) as if µ ≪ ν and H(µ, ν) = +∞ otherwise. We are now in position to define the notion of transport-entropy inequality. Here as below, Φ : R + → R + is a convex, increasing function with Φ(0) = 0.
For the sake of simplicity, we will write that µ satisfies T Φ .
The following proposition comes from Corollary 3.4. of [GL10].
Proposition 1.1. The following propositions are equivalent: • The probability measure µ satisfies T Φ .
• For all 1-Lipschitz function f , one has Such transport-entropy inequalities are very attractive especially from a numerical point of view since they are related to the concentration of measure phenomenon which allows to establish non-asymptotic deviation estimates. The three next results put an emphasis on this point. Suppose that (X n ) n≥1 is a sequence of independent and identically distributed R d -valued random variables with common law µ.
Corollary 1.1. If µ satisfies T Φ then for all 1-Lipschitz function f and for all r ≥ 0, for all M ≥ 1, one has Proposition 1.2. If µ satisfies T Φ then the empirical measure µ n defined as µ n = 1 n n k=1 δ X k satisfies the following concentration bound where for x ∈ R d , δ x stands for the Dirac mass at point x.
The quantity E[W 1 (µ, µ n )] will go to zero as n goes to infinity, by convergence of empirical measures, but we still need quantitative bounds. The next result is an adaptation of a result of [RR98] on similar bounds but for the distance W 2 . For sake of completeness, we provide a proof in Appendix 4.2.
Proposition 1.3. Assume that µ has a finite moment of order d + 3. Then, one has In view of Kantorovich-Rubinstein duality formula, namely where [f ] 1 denotes the Lipschitz-modulus of f , the latter result provides the following concentration bounds Similar results were first obtained for different concentration regimes by Bolley, Guillin, Villani [BGV07] relying on a non-asymptotic version of Sanov's Theorem. Some of these results have also been derived by Boissard [Boi11] using concentration inequalities, and were also extended to ergodic Markov chains up to some contractivity assumptions in the Wasserstein metric on the transition kernel.
Some applications are proposed in [BGV07]. Such results can indeed provide non-asymptotic deviation bounds for the estimation of the density of the invariant measure of a Markov chain. Let us note that the (possibly large) constant C(d, µ) appears as a trade-off to obtain uniform deviations over all Lipschitz functions.
As a consequence of the transport-entropy inequalities obtained for the laws at a given time step of Euler like schemes and stochastic approximation algorithm, we will derive non-asymptotic deviation bounds in the Wasserstein metric.

Euler like schemes and diffusions
where the positive constants λ 3.2 and K 3.2 are defined in Corollary 3.2.
Note that in the above theorem, we do not need any non-degeneracy condition on the diffusion coefficient. In the case α ∈ ( 1 2 , 1], one easily gets the following explicit formula: Let us note that the linear behavior of Φ * α on a small interval is due to the fact that Φ α is not C 1 . One may want to replace ρ 2 ∨ ρ 2α 2α−1 by ρ 2 + ρ 2α 2α−1 (up to a factor 2) in the expression of Φ α . However, in this case, an explicit expression for Φ * α does not exist (except for the case α = 1) and only its asymptotic behavior can be derived so that one is led to compute it numerically in practical situations.
In the case α = 1/2, tedious but simple computations show that This behavior corresponds to a concentration profile that is Gaussian at short distance, and exponential at large distance.
Corollary 2.1. (Non-asymptotic deviation bounds) Under the same assumptions as Theorem 2.1, one has: • for all real-valued 1-Lipschitz function f defined on R d , for all α ∈ [1/2, 1] for all M ≥ 1 and all r ≥ 0, • for all α ∈ [1/2, 1], for all M ≥ 1 and all r ≥ 0, where the ((X ∆ T ) k ) 1≤k≤M are M independent copies of the scheme (1.2) starting at point x at time 0 and evaluated at time T .
Remark 2.1 (Extension to smooth functions of a finite number of time step). The previous transport-inequalities and non-asymptotic bounds could be extended to smooth functions of a finite number of time step such as the maximum of a scalar Euler like scheme. In that case, it suffices to introduce the additional state variable is Markovian and similar arguments could be easily extended to the couple for Lipschitz functions of both variables.
Remark 2.2 (Transport-Entropy inequalities for the law of a diffusion process). The previous transportinequalities and non-asymptotic bounds could be extended to the law at time T of the diffusion process solution to (SDE b,σ ) by passing to the limit ∆ → 0. Indeed, it is well-known that under (HS), one has X ∆ T a.s.
−→ X T , as ∆ → 0 and by Lebesgue theorem, one deduces from the first result of Corollary 2.1 that the empirical error (empirical mean) of X T itself satisfies a non-asymptotic deviation bound with a similar deviation function (just pass to the limit ∆ → 0 in all constants). Then, using Corollary 5.1 in [GL10] (equivalence between deviation of the empirical mean and transport-entropy inequalities), one easily derives that the law of X T satisfies a similar transport-entropy inequalities when α ∈ (1/2, 1].
We want to point out that it is the growth of σ that gives the concentration regime ranging from Gaussian concentration bound if α = 1 to exponential when α = 1 2 . However, in many popular models in finance, the diffusion coefficient is linear, for instance practitioners often have to deal with Black-Scholes like dynamics of the form , equipped with the uniform norm ||f || ∞ := sup 0≤t≤T |f (t)|, the expected concentration is the log-normal one. To deal with the latter case, we consider the continuous Euler scheme X c,∆ associated to (SDE b,σ ) and writing The next result provides a general non-asymptotic deviation bound for the empirical error under very mild assumptions.
Theorem 2.2 (General non-asymptotic deviation bounds). Denote by X c,∆ := (X c,∆ t ) 0≤t≤T the path of the scheme (2.1) with step ∆ starting from point x at time 0. Assume that ∀t ∈ [0, T ], the coefficients b(t, .) and σ(t, .) are continuous functions in x and that they satisfy the linear growth assumption: Then, for all 1-Lipschitz function f : C → R, for all M ∈ N * , for all r ≥ 0, one has are M independent copies of the scheme (2.1). The result remains valid when one considers the path of the diffusion X solution to (SDE b,σ ) instead of the continuous Euler scheme.
As in the case of Euler like schemes, for α ∈ ( 1 2 , 1], we have: For α = 1 2 , we obtain the following explicit bound for the Legendre transform of Φ 1/2,N Hence, for N ≥ 1 being fixed, the following simple asymptotic behaviors can be easily derived: • When λ is small, Φ * 1/2,N (λ) ∼ λ 2 4.1 λ 2 /(2ϕC γ N ); • When λ goes to infinity, Φ * 1/2 (λ) ∼ λ 4.1 λ/s N . Corollary 2.2. (Non-asymptotic deviation bounds) Under the same assumptions as Theorem 2.3, one has Moreover, the bias δ N at step N satisfies with K > 0. Now, we investigate the impact of the step sequence (γ n ) n≥1 on the concentration rate sequences C γ N , C γ,α N , s N and the bias δ N . Let us note that a similar analysis has been performed in [FM12]. We obtain the following results: • If we choose γ n = c n , with c > 0. Then -If c > 1 2λ , a comparison between the series and the integral yields ). Let us notice that we find the same critical level for the constant c as in the Central Limit Theorem for stochastic algorithms. Indeed, if c > 1 2Re(λmin) where λ min denotes the eigenvalue of Dh(θ * ) with the smallest real part then we know that a Central Limit Theorem holds for (θ n ) n≥1 (see e.g. [Duf96], p.169). Such behavior was already observed in [FM12].
The associated bound for the bias is the following: • If we choose γ n = c n ρ , c > 0, 1 2 < ρ < 1, then δ N → 0, Γ 1,N ∼ c 1−ρ N 1−ρ as N → +∞ and elementary computations show that there exists C > 0 s.t. for all N ≥ 1, Π 1,N ≤ C exp(−2λ c 1−ρ N 1−ρ ). Hence, for all ǫ ∈ (0, 1 − ρ) we have: Up to a modification of ǫ, this yields Concerning the bias, from Corollary 2.2, we directly obtain the following bound: The impact of the initial difference |θ 0 − θ * | is exponentially smaller compared to the case γ n = c n . This is natural since the step sequence is decreasing slower to 0.
As regards the explicit computation of the Legendre transform ofΦ α,N , similarly to the previous theorem, we have: Hence, for N ≥ 1 being fixed, the following simple asymptotic behaviors can be easily derived: -When λ is small,Φ * 1/2,N (λ) ∼ λ 2 4.1 λ 2 /(2ϕC γ N ); -When λ goes to infinity,Φ * 1/2 (λ) ∼ λ 4.1 λ/ŝ N . Corollary 2.3. (Non-asymptotic deviation bounds) Under the same assumptions as Theorem 2.4, for all N ≥ 1 for all r ≥ 0, one has Now, we analyze the impact of the step sequence on the concentration rate sequencesC γ N ,C γ,α N ,ŝ N and the biasδ N . We first simplify the expression of the concentration rate. Let us note that since the step sequence (γ n ) n≥1 satisfies (1.5), there exists a positive constant K > 0 such that (Π 1,j Π −1 1,k ) 1 2 ≤ K exp(−λ(Γ 1,j −Γ 1,k+1 )), k < j. Moreover, since the function x → exp(−λx) is decreasing on [Γ 1,p , Γ 1,p+1 ], one clearly gets for all i, j ∈ {0, · · · , N − 1}, i < j so that, using the latter bound and an Abel transform, we obtain which finally leads to the following bound Now, we are in position to study the impact of the step sequence (γ n ) n≥1 on the concentration rate sequences: • If we select γ n = c n with c > 0, then, using that Γ 1,N = c log(N ) + c ′ 1 + r N , c ′ 1 > 0 with r N → 0, one easily derives from (2.2) that there exists C > 0 such that and a comparison between the series and the integral yields the following bounds: -If λc < 1 2 , one has: . Hence, we clearly see that for the case γ n = c n , averaging the trajectories of a stochastic approximation algorithm is not the key to circumvent the lake of robustness concerning the choice of the constant c.
The bound for the bias is obtained by averaging the bound previously obtained for δ N . We easily get:δ • If we choose γ n = c n ρ , c > 0, 1 2 < ρ < 1 then we have for k ≤ p so that for some positive constant C which may vary from line to line where we use a change of variable in the latter integral. For k large enough, the function Hence, we finally haveγ k, Concerning the bias, by averaging the bias sequence (δ k ) 1≤k≤N −1 we directly obtain the following boundδ Hence, we see that there is no sub-exponential decreasing of the impact of the initial condition but a decay at rate O(N −1 ). Consequently, this leads us to say that a stochastic approximation algorithm must be averaged after few iterations in practical implementations and not directly from the first step.

Euler Scheme: Proof of the Main Results
In this section we will assume that (HS) and (HD α ) are in force.

Proof of Theorem 2.1
The proof of Theorem 2.1 is divided into several propositions.
Proposition 3.2. (Control of the Lipschitz modulus of iterative kernels) Denote the Lipschitz modulus of b and σ appearing in the diffusion process (SDE b,σ ) by [b] 1 and [σ] 1 , respectively. Denote by P k and P k,p = P k • · · · • P p−1 , k, p ∈ {0, · · · , N − 1}, k ≤ p the (Feller) transition kernel and the iterative kernels of the Markov chain X ∆ defined by the scheme (1.2), respectively. Then for all real-valued Lipschitz function f and for all k, p ∈ {0, · · · , N − 1}, k ≤ p the functions P k (f ) are Lipschitz-continuous and one has where [f ] 1 stands for the Lipschitz modulus of the function f and C(b, σ, Using the Cauchy Schwarz inequality and (HS), for all (x, y) ∈ (R d ) 2 and for all k ∈ {0, · · · , N − 1}, one has A straightforward induction argument completes the proof.
Proof. As mentionned earlier on in the introduction, we begin our proof using that the law µ of the innovation satisfies (GC(β)) and (HD α ). Hence, for λ ≥ 0 and k ∈ {0, · · · , N − 1}, one has Taking expectation from both sides of the last inequality and using the Hölder inequality with conjugate exponents (p, q) (to be specified later on) leads to (3.2)

TRANSPORT-ENTROPY INEQUALITIES AND DEVIATION ESTIMATES FOR STOCHASTIC APPROXIMATIONS SCHEMES 21
Now, we apply the last inequality for f := P k+1,N (f ) and obtain Consequently, an elementary induction yields where we used Proposition 3.2 for the last inequality. Observe now that since (p, q) are conjugate exponents, we have 1 .

Proof of Theorem 2.2
We will prove the result for the process X solution of (SDE b,σ ). The proof for the continuous Euler scheme is similar.

Stochastic Approximation Algorithm: Proof of the main Results
Throughout this section we will assume that (HL), (HLS) α and (HUA) are in force.

Proof of Theorem 2.3
The proof of Theorem 2.3 is divided into several propositions.