Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Learning

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) is a momentum version of stochastic gradient descent with properly injected Gaussian noise to find a global minimum. In this paper, non-asymptotic convergence analysis of SGHMC is given in the context of non-convex optimization, where subsampling techniques are used over an i.i.d dataset for gradient updates. Our results complement those of [RRT17] and improve on those of [GGZ18].


Introduction
Let (Ω, F , P ) be a probability space where all the random objects of this paper will be defined. The expectation of a random variable X with values in a Euclidean space will be denoted by E [X].
We consider the following optimization problem and Z is a random element in some measurable space Z with an unknown probability law µ.
The function x → f (x, z) is assumed continuously differentiable (for each z) but it can possibly be non-convex. Suppose that one has access to i.i.d samples Z = (Z 1 , ..., Z n ) drawn from µ, where n ∈ N is fixed. Our goal is to compute an approximate minimizer X † such that the population risk is minimized, where the expectation is taken with respect to the training data Z and additional randomness generating X † . Since the distribution of Z i , i ∈ N is unknown, we consider the empirical risk minimization problem using the dataset z := {z 1 , ..., z n } Stochastic gradient algorithms based on Langevin Monte Carlo have gained more attention in recent years. Two popular algorithms are Stochastic Gradient Langevin Dynamics (SGLD) and Stochastic Gradient Hamiltonian Monte Carlo (SGHMC). First, we summarize the use of SGLD in optimization, as presented in [RRT17]. Consider the overdamped Langevin stochastic differential equation where (B t ) t≥0 is the standard Brownian motion in R d and β > 0 is the inverse temperature parameter. Under suitable assumptions on f , the SDE (3) admits the Gibbs measure π z (dx) ∝ exp(−βF z (x)) as its unique invariant distribution. In addition, it is known that for sufficiently big β, the Gibbs distribution concentrates around global minimizers of F z . Therefore, one can use the value of X t from (3), (or from its discretized counterpart SGLD), as an approximate solution to the empirical risk problem, provided that t is large and temperature is low.
In this paper, we consider the underdamped (second-order) Langevin diffusion where (X t ) t≥0 , (V t ) t≥0 model the position and the momentum of a particle moving in a field of force F z with random force given by Gaussian noise. It is shown that under some suitable conditions for F z , the Markov process (X, V ) is ergodic and has a unique stationary distribution where Γ z is the normalizing constant It is easy to observe that the x-marginal distribution of π z (dx, dv) is the invariant distribution π z (dx) of (3). We consider the first order Euler discretization of (4), (5), also called Stochastic Gradient Hamiltonian Monte Carlo (SGHMC), given as follows where λ > 0 is a step size parameter and (ξ k ) k∈N is a sequence of i.i.d standard Gaussian random vectors in R d . The initial condition v 0 , x 0 may be random, but independent of (ξ k ) k∈N . In certain contexts, the full knowledge of the gradient F z is not available, however, using the dataset z, one can construct its unbiased estimates. In what follows, we adopt the general setting given by [RRT17]. Let U be a measurable space, and g : R d × U → R d such that for any z ∈ Z n , where U z is a random element in U with probability law Q z . Conditionally on Z = z, the SGHMC algorithm is defined by where (U z,k ) k∈N is a sequence of i.i.d. random elements in U with law Q z . We also assume from now on that v 0 , x 0 , (U z,k ) k∈N , (ξ k ) k∈N are independent. Our ultimate goal is to find approximate global minimizers to the problem (1). Let X † := X λ k be the output of the algorithm (9),(10) after k ∈ N iterations, and ( X * z , V * z ) be such that L( X * z , V * z ) = π z . The excess risk is decomposed as follows, see also [RRT17], The remaining part of the present paper is about finding bounds for these errors. Section 2 summarizes technical conditions and the main results. Comparison of our contributions to previous studies is discussed in Section 3. Proofs are given in Section 4. Notation and conventions. For l ≥ 1, scalar product in R l is denoted by ·, · . We use · to denote the Euclidean norm (where the dimension of the space may vary). B(R l ) denotes the Borel σ-field of R l . For any R l -valued random variable X and for any 1 ≤ p < ∞, let us set X p := E 1/p X p . We denote by L p the set of X with X p < ∞. The Wasserstein distance of order p ∈ [1, ∞) between two probability measures µ and ν on B(R l ) is defined by where Π(µ, ν) is the set of couplings of (µ, ν), see e.g. [Vil08]. For two R l -valued random variables X and Y , we denote W 2 (X, Y ) := W 2 (L(X), L(Y )), where L(X) is the law of X. We do not indicate l in the notation and it may vary.

Asumptions and main results
The following conditions are required throughout the paper.
Assumption 2.1. The function f is continuously differentiable, takes non-negative values, and there are constants A 0 , B ≥ 0 such that for any z ∈ Z, Assumption 2.2. There is M > 0 such that, for each z ∈ Z, Assumption 2.4. For each u ∈ U, it holds that g(0, u) ≤ B and Assumption 2.5. There exists a constant δ > 0 such that for every z ∈ Z n , Assumption 2.6. The law µ 0 of the initial state ( where V is the Lyapunov function defined in (17) below.
Remark 2.7. If the set of global minimizers is bounded, we can always redefine the function f to be quadratic outside a compact set containing the origin while maintaining its minimizers. Hence, Assumption 2.3 can be satisfied in practice. Assumption 2.4 means that the estimated gradient is also Lipschitz when using the same training dataset. For example, at each iteration of SGHMC, we may sample uniformly with replacement a random minibatch of size ℓ. Then we can choose U z = (z I1 , ..., z I ℓ ) where I 1 , ..., I ℓ are i.i.d random variables having distribution Uniform({1, ..., n}). The gradient estimate is thus which is clearly unbiased and Assumption 2.4 will be satisfied whenever Assumptions 2.2 and 2.1 are in force. Assumption 2.5 controls the variance of the gradient estimate.
An auxiliary continuous time process is needed in the subsequent analysis. For a step size λ > 0, denote by with initial condition V s = v, X s = x where v, x may be random but independent of (B λ t ) t≥0 . Our first result tracks the discrepancy between the SGHMC algorithm (9), (10) and the auxiliary processes (13), (14).
Proof. The proof of this theorem is given in Section 4.2.
The following is the main result of the paper.
Proof. The proof of this theorem is given in Section 4.3.

Related work and our contributions
Non-asymptotic convergence rate Langevin dynamics based algorithms for approximate sampling log-concave distributions are intensively studied in recent years. For example, overdamped Langevin dynamics are discussed in [WT11], [Dal17b], [DM16], [DK17], [DM17] and others. Recently, [BCM + 18] treats the case of non-i.i.d. data streams with a certain mixing property. Underdamped Langevin dynamics are examined in [CFG14], [Nea11], [CCBJ17], etc. Further analysis on HMC are discussed on [BBLG17], [Bet17]. Subsampling methods are applied to speed up HMC for large datasets, see [DQK + 17], [QKVT18]. The use of momentum to accelerate optimization methods are discussed intensively in literature, for example [AP16]. In particular, performance of SGHMC is experimentally proved better than SGLD in many applications, see [CDC15], [CFG14]. An important advantage of the underdamped SDE is that convergence to its stationary distribution is faster than that of the overdamped SDE in the 2-Wasserstein distance, as shown in [EGZ17].
Finding an approximate minimizer is similar to sampling distributions concentrate around the true minimizer. This well known connection gives rise to the study of simulated annealing algorithms, see [Hwa80], [Gid85], [Haj85], [CHS87], [HKS89], [GM91], [GM93]. Recently, there are many studies further investigate this connection by means of non asymptotic convergence of Langevin based algorithms and in stochastic non-convex optimization and large-scale data analysis, [CCG + 16], [Dal17a].
Relaxing convexity is a more challenging issue. In [CCAY + 18], the problem of sampling from a target distribution exp(−F (x)) where F is L-smooth everywhere and m-strongly convex outside a ball of finite radius is considered. They provide upper bounds for the number of steps to be within a given precision level ε of the 1-Wasserstein distance between the HMC algorithm and the equilibrium distribution. In a similar setting, [MMS18] obtains bounds in both the W 1 and W 2 distances for overdamped Langevin dynamics with stochastic gradients. [XCZG18] studies the convergence of the SGLD algorithm and the variance reduced SGLD to global minima of nonconvex functions satisfying the dissipativity condition.
Our work continues these lines of research, the most similar setting to ours is the recent paper [GGZ18]. We summarize our contributions below: • Diffusion approximation. In Lemma 10 of [GGZ18], the upper bound for the 2-Wasserstein distance between the SGHMC algorithm at step k and underdamped SDE at time t = kλ is (up to constants) given by which depends on the number of iteration k. Therefore obtaining a precision ε requires a careful choice of k, λ and even kλ. By introducing the auxiliary SDEs (13), (14), we are able to achieve the rate (δ 1/4 + λ 1/4 ), see Theorem 2.8 for the case p = 2. This upper bound is better in the number of iterations and hence, improves Lemma 10 of [GGZ18]. Our analysis for variance of the algorithm is also different. The iteration does not accumulate mean squared errors, as the number of step goes to infinity.
• Our proof for Theorem 2.8 is relatively simple and we do not need to adopt the techniques of [RRT17] which involve heavy functional analysis, e.g. the weighted Csiszár -Kullback -Pinsker inequalities in [BV05] is not needed.
• Dependence structure of the dataset in the sampling mechanism, can be arbitrary, see the proof of Theorem 2.8. The i.i.d assumption on dataset is used only for the generalization error. We could also incorporate non-i.i.d data in our analysis, see Remark 4.5, but this is left for further research.

A contraction result
In this section, we recall a contraction result of [EGZ17]. First, it should be noticed that the constant u and the function U in their paper are β −1 and βF z in the present paper, respectively. Here, the subscript c stands for "contraction". Using the upper bound of Lemma 5.1 for f below, there exist constants λ c ∈ 0, min{1/4, m/(M + 2B + γ 2 /2)} small enough and A c ≥ β/2(b + 2B + A 0 ) such that Therefore, Assumption 2.1 of [EGZ17] is satisfied, noting that L c := βM and We define the Lyapunov function For any ( where α c , ε c > 0 are suitable positive constants to be fixed later and h : For any two probability measures µ, ν on R 2d , we define Note that ρ and W ρ are semimetrics but not necessarily metrics. A result from [EGZ17] is recalled below. For a probability measure µ on B(R 2d ), we denote by µp t the law of (V t , X t ) when L(V 0 , X 0 ) = µ.
Theorem 4.1. There exists a continuous non-decreasing concave function h with h(0) = 0 such that for all probability measures µ, ν on R 2d , and 1 ≤ p ≤ 2, we have where the following relations hold: and if r ≥ min{1, R 1 } then These bounds and Theorem 2.3 of [EGZ17] imply that The proof is complete.

Proof of Theorem 2.8
Here, we summarize our approach. For a given step size λ > 0, we divide the time axis into intervals of length T = ⌊1/λ⌋. For each time step k ∈ [nT, (n + 1)T ], n ∈ N, we compare the SGHMC to the version with exact gradients relying on the Doob inequality, and then compare the later to the auxiliary continuous-time diffusion ( V (k, 0, (v 0 , x 0 )), X(k, 0, (v 0 , x 0 ))) with the scaled Brownian motion. At this stage we reply on the contraction result from [EGZ17] and uniform boundedness of the Langevin diffusion and its discrete time versions. Since the auxiliary dynamics evolves slower than the original Langevin dynamics, or more precisely at the same speed as that of the SGHCM, our upper bounds do not accumulate errors and are independent from the number of iterations.
Let T := ⌊1/λ⌋. For each n ∈ N, and for each nT ≤ k < (n + 1)T , we set For each n ∈ N, it holds by definition that V λ nT =Ṽ λ nT and the triangle inequality implies for nT ≤ k < (n + 1)T , Denote g k,nT (x) := E [g(x, U z,k )|H nT ] , x ∈ R d . By Assumption 2.4, the estimation continues as follows Using (25), one obtains noting that T λ ≤ 1. Therefore, the estimation in (26) continues as Applying the discrete-time version of Grönwall's lemma and taking squares, noting also that (x + y) 2 ≤ 2(x 2 + y 2 ), x, y ∈ R yield Taking conditional expectation with respect to H nT , the estimation becomes Since the random variables U z,i are independent, the sequence of random variables g(X λ i , U z,i ) − g i,nT (X λ i ), nT ≤ i < (n + 1)T are independent conditionally on H nT , noting thatX λ i is measurable with respect to H nT . In addition, they have zero mean by the tower property of conditional expectation. By Assumption 2.4, by the independence of U z,i , i > nT from H nT . Doob's inequality and (29) imply Taking one more expectation and using Lemma 5.3 give , and therefore, where we define LetṼ int andX int be the continuous-time interpolation ofṼ λ k , and ofX λ k on [nT, (n + 1)T ), respectively, with the initial conditionsṼ int nT =Ṽ nT = V λ nT andX int nT =X nT = X λ nT . For each n ∈ N and for nT ≤ t < (n + 1)T , define also where the dynamics of V , X are given in (13), (14). In this way, the processes ( V t ) t≥0 , ( X t ) t≥0 are right continuous with left limits. From Lemma 4.4, we obtain for nT ≤ t < (n + 1)T Combining (30), (31) and (35) gives Define A t = ( V t , X t ) and B(t, s, (v s , x s )) = ( V (t, s, (v s , x s )), X(t, s, (v s , x s ))) for s ≤ t and v s , x s are R d -valued random variables. The triangle inequality and Theorem 4.1 imply that for nT ≤ t < (n+1)T , and for 1 ≤ p ≤ 2, noting the rate of contraction of ( V t , X t ) is e −c * λt . Using Lemma 5.4, we obtain In L 2 norm, the first and second terms of (38) is bounded by (c 2 + c 7 ) √ λ + c 3 √ δ, see (36) and the fifth term is estimated by √ λ. We consider the third term in (38). From the dynamics of V , we find that for iT − 1 ≤ t ≤ iT ,

The first term T 1
The first term in the right hand side of (45) is rewritten as where µ ⊗n is the product of laws of independent random variables Z 1 , ..., Z n . By Assumptions 2.1 and 2.2, the function F z satisfies ∇F z (x) ≤ M x + B. Using Lemma 5.2, we have where p > 1, q ∈ N, 1/p + 1/(2q) = 1, x 2q π z (dx, dv) 1/(2q) < ∞ by Lemma 5.5. On the other hand, Theorems 2.8 and 4.1 imply Therefore, an upper bound for T 1 is given by

The second term T 2
Since the x-marginal of π z (dx, dv) is π z (dx), the Gibbs measure of (3), we compute Therefore the argument in [RRT17] is adopted, The constant c LS comes from the logarithmic Sobolev inequality for π z and where λ * is the uniform spectral gap for the overdamped Langevin dynamics Remark 4.5. One can also find an upper bound for T 2 when the data z is a realization of some non-Makovian processes. For example, if we assume that f is Lipschitz on the second variable z and Z satisfies a certain mixing property discussed in [CKRS16]) then the term T 2 is bounded by 1/ √ n times a constant, see Theorem 2.5 therein.

The third term T 3
For the third term, we follow [RRT17]. Let x * be any minimizer of F (x). We compute where the last inequality comes from Proposition 3.4 of [RRT17]. The condition β ≥ 2m is not used here, see the explanation in Lemma 16 of [GGZ18].
The next lemma generalizes continuity for functions of quadratic growth in Wasserstein distances given in [PW16].
Lemma 5.2. Let µ, ν be two probability measures on R 2d with finite second moments and let G : R 2d → R be a C 1 function with ∇G(w) ≤ c 1 w + c 2 for some c 1 > 0, c 2 ≥ 0. Then for p > 1, q > 1 such that 1/p + 1/q = 1, we have Proof. Using the Cauchy-Schwartz inequality, we compute Then for any ξ ∈ Π(µ, ν) we have Since this inequality holds true for any ξ ∈ Π(µ, ν), the proof is complete.

Explicit dependence of constants on important parameters
Similar to [GGZ18], we choose µ 0 in such a way that Then we get C c x = C c v = C a x = C a v = O((β + d)/β). It follows that c 2 = c 3 = c 7 = c 16 = O( (β + d)/β). The constant c * , C * are µ * and C respectively in [GGZ18]. In addition, we check .

It is checked that
From Lemma 16 of [GGZ18], we get Furthermore, it is observed that Therefore, for a fixed k, the term B 1 is bounded by Since c * is exponentially small in (β + d), our bound for B 1 is worse than that of J 1 (ε)+ J 0 (ε) given in [GGZ18].