Escaping Saddle Points Efficiently with Occupation-Time-Adapted Perturbations

Motivated by the super-diffusivity of self-repelling random walk, which has roots in statistical physics, this paper develops a new perturbation mechanism for optimization algorithms. In this mechanism, perturbations are adapted to the history of states via the notion of occupation time. After integrating this mechanism into the framework of perturbed gradient descent (PGD) and perturbed accelerated gradient descent (PAGD), two new algorithms are proposed: perturbed gradient descent adapted to occupation time (PGDOT) and its accelerated version (PAGDOT). PGDOT and PAGDOT are shown to converge to second-order stationary points at least as fast as PGD and PAGD, respectively, and thus they are guaranteed to avoid getting stuck at non-degenerate saddle points. The theoretical analysis is corroborated by empirical studies in which the new algorithms consistently escape saddle points and outperform not only their counterparts, PGD and PAGD, but also other popular alternatives including stochastic gradient descent, Adam, AMSGrad, and RMSProp.


Introduction
Gradient descent (GD), which dates back to (Cauchy, 1847), aims to minimize a function f : R d → R via the iteration: x x x t+1 = x x x t − η∇f (x x x t ), t = 0, 1, 2, . . ., where η > 0 is the step size and ∇f is the gradient of f .Due to its simple form and fine computational properties, GD and its variants (e.g., stochastic gradient descent) are essential for many machine learning tools: principle component analysis (Candès et al., 2011), phase retrieval (Candès et al., 2015), and deep neural network (Rumelhart et al., 1986), just to name a few.In the era of data deluge, many problems are concerned with large-scale optimization in which the intrinsic dimension d is large.GD turns out to be efficient in dealing with high-dimensional convex optimization, where the first-order stationary point ∇f (x x x) = 0 is necessarily the global minimum point.Algorithmically, it involves finding a point with small gradient ||∇f (x x x)|| < .A classical result of (Nesterov, 2004) showed that the time required by GD to find such a point in a possibly non-convex problem is of order −2 , independent of the dimension d.
In non-convex settings, applying GD will still lead to an approximate first-order stationary point.However, this is not sufficient: for non-convex functions, first-order stationary points can be either global minimum, local minimum, local maximum, or saddle points.As we will explain, saddle points are the main bottleneck for GD in many non-convex problems.The goal of this paper is therefore to develop efficient algorithms to escape saddle points in high-dimensional non-convex problems, and hence overcome the curse of dimensionality.
Escape local minima: Inspired by annealing in metallurgy, (Kirkpatrick et al., 1983) developed simulated annealing to approximate the global minimum of a given function.(Geman & Hwang, 1986) proposed a diffusion simulated annealing and proved that it converges to the set of global minimum points.However, subsequent works (Holley et al., 1989;Menz et al., 2018;Miclo, 1992;Monmarché, 2018;Tang & Zhou, 2021) revealed that it might take an exponentially long time (of order exp(d)) for diffusion simulated annealing to get close to the global minimum.Some work, e.g., methods based on Lévy flights (Pavlyukevich, 2007) or Cuckoo's search (Yang & Deb, 2009) showed empirically faster convergence to the global minimum.Yet the theory of these approaches is far-fetched.There are recent efforts in approximating the global minimum in non-convex problems via Langevin dynamics-based stochastic gradient descent (Raginsky et al., 2017;Chen et al., 2020), along with its variants using non-reversibility (Hu et al., 2020) and replica exchange (Chen et al., 2019;Dong & Tong, 2021).Typically, these algorithms take polynomial time in the dimension d, and thus may scale poorly when d is large.
Escape saddle points: Fortunately, in many non-convex problems, it suffices to find a local minimum.Indeed, there has been a line of recent work arguing that local minima are less problematic, and that for many non-convex problems there are no spurious local minima.That is, all local minima are comparable in value with the global minimum.Examples include tensor decomposition (Ge et al., 2015;2018;Ge & Ma, 2017;Sanjabi et al., 2019), semidefinite programming (Bandeira et al., 2016;Mei et al., 2017), dictionary learning (Sun et al., 2017), phase retrieval (Sun et al., 2018), robust regression (Mei et al., 2018), low-rank matrix factorization (Bhojanapalli et al., 2016;Ge et al., 2017;2016;Park et al., 2017), and certain classes of deep neural networks (Choromanska et al., 2015;Draxler et al., 2018;Kawaguchi, 2016;Kazemipour et al., 2019;Liang et al., 2018;Nguyen & Hein, 2017;Venturi et al., 2019;Wu et al., 2018).Nevertheless, as shown in (Dauphin et al., 2014b;Du et al., 2017;Jain et al., 2017), saddle points may correspond to suboptimal solutions, and it may take exponentially long time to move from saddle points to a local minimum point.Meanwhile, it has been observed in empirical studies (Dauphin et al., 2014a;Swirszcz et al., 2016) that GD and its variants with momentum such as Adam (Kingma & Ba, 2015) may be trapped in saddle points.(Ge et al., 2015) took the first step to show that by adding noise at each iteration, GD can escape all saddle points in polynomial time.Additionally, (Du et al., 2018;Lee et al., 2016) proved that with random initialization, GD converges to a local minimizer.Moreover, (Jin et al., 2017) proposed the perturbed gradient descent (PGD) algorithm, which (Jin et al., 2018) further improved to the perturbed accelerated gradient descent (PAGD) algorithm.They showed that PGD and PAGD are efficient -the time complexity is almost independent of the dimension d.See also (Jin et al., 2021) for a summary of results in this direction.
Our idea.Motivated by the "fast exploration" of self-repelling random walk, this paper develops a new perturbation mechanism by adapting the perturbations to the history of states.Recall that (Jin et al., 2017;2021) used the following perturbation update when perturbation conditions hold: where Unif(B d (0 0 0, r)) is a point picked uniformly in the ball of radius r.On the empirical side, (Neelakantan et al., 2015;Zhou et al., 2019) applied this idea of GD with noise to train deep neural networks.
Our idea is to replace Unif(B d (0 0 0, r)) with non-uniform perturbations, whose mechanism depends on the current state x x x t and the history of states {x x x s ; s ≤ t}.There are conceivably many ways to add non-uniform perturbation based on the current and previous states; here we choose to adapt perturbations to the "occupation time".
The intuition is illustrated by the one-dimensional function f (x) = x 3 (see Figure 1).There is a saddle point at 0, and imagine GD approaches 0 from the right.It can be shown that GD converges monotonically to a stationary point (see Appendix A).The uniform perturbation will add noise with probability 1/2 both to the right and to the left.To the right, GD will again get stuck at the saddle point 0. However, to the left, there is a possibility of escaping from 0 and finding a local minimum (−∞ in this case).Therefore, it is reasonable to add noise with a larger probability to the left, since it has spent a long time on the right and has yet to explore the left side.
The previous intuition can be quantified via the notion of occupation times L t (the number of {x s } s<t to the left of x t ) and R t (the number of {x s } s<t to the right of x t ).By definition, R t + L t = t, for each t = 0, 1, . ... If L t is larger, the perturbation will push the iterate x t to the right; and if R t is larger, push to the left.More precisely, x t+1 = x t − r Unif(0, 1) with probability p, x t + r Unif(0, 1) with probability 1 − p, where p = w(Rt) w(Lt)+w(Rt) and w : {0, 1, . ..} → (0, ∞) is an increasing weight function on the nonnegative integers (e.g., w(n) = 1 + n α for α > 0).
The dynamics (1) is closely related to the vertex-repelling random walk defined by where R t := {s < t : Z s = Z t + 1} and L t := {s < t : Z s = Z t − 1}.This (non-Markovian) random walk model was introduced by (Peliti & Pietronero, 1987) in the statistical physics literature.Based on the scaling arguments and simulations, it was conjectured that the walk (Z t , t ≥ 0) is recurrent and is further super-diffusive in the sense that EZ 2 t ∼ t 4 3 , whereas for a simple random walk (S t , t ≥ 0) its exploration range is ES 2 t ∼ t t 4 3 .These properties have only been proved rigorously for a simpler variant -the edge-repelling random walk, see (Davis, 1990;Tóth, 1995).A counterpart to the vertex-repelling walk is the vertex-reinforced walk (Pemantle, 1992;Volkov, 2006) defined by Z t+1 = Z t − 1 with probability w( Lt) w( Lt)+w( Rt) , and Z t+1 = Z t + 1 with probability w( Rt) w( Lt)+w( Rt) .It is well known (Tarrès, 2004;Volkov, 2006) that vertex-reinforced random walk exhibits localization at a finite number of points for some choices of w(•), e.g., w(n) ∼ n α with α ≥ 1.
Our results.We will first show that vertex-repelling walk will never be localized or stuck at some points in contrast with vertex-reinforced walk (see Theorem 3.1 below).The non-localization and the (conjectured) super-diffusive properties of the vertex-repelling walk (2) facilitate exploration, and thus the corresponding perturbation scheme (1) makes it more likely to escape from saddle points.
We will then propose a new perturbation mechanism based on the dynamics (1), which can be integrated into the framework of (any) perturbation-based optimization algorithms.In particular, integrating the above-mentioned mechanism into the framework of PGD and PAGD, we propose two new algorithms: perturbed gradient descent adapted to occupation time (PGDOT, Algorithm 1) and its accelerated version, perturbed accelerated gradient descent adapted to occupation time (PAGDOT, Algorithm 2).
We will prove that Algorithm 1 (resp.Algorithm 2) converges to a second-order stationary point at least as fast as PGD (resp.PAGD).
Algorithm 1 Perturbed Gradient Descent Adapted to Occupation Time (Meta Algorithm) Algorithms 1 and 2 are state-dependent adaptive algorithms, perturbing GD and accelerated gradient descent (AGD) (Nesterov, 1983) non-uniformly according to the history of states.
We will finally corroborate our theoretical analysis by experimental results.In particular, we will demonstrate that Algorithms 1 and 2 escape saddle points faster than not only their counterparts, PGD and PAGD, but also several momentum methods such as Adam, AMSGrad, and RMSProp in training multilayer perceptrons (MLPs) on some well-studied datasets such as MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky et al., 2009).
Algorithm 2 Perturbed Accelerated Gradient Descent Adapted to Occupation Time (Meta Algorithm) Notations: Below we collect the notations that will be used throughout this paper.For S a finite set, let #S denote the number of elements in S. For D as a domain, let Unif(D) be the uniform distribution on D, e.g., Unif(0, 1) is the uniform distribution on [0, 1].For a function f : R d → R, let ∇f and ∇ 2 f denote its gradient and Hessian, and denote its global minimum.For A A A a square matrix, let λ min (A A A) be its minimum eigenvalue.
The notation || • || is used for both the Euclidean norm of a vector and the spectral norm of a matrix.For x x x = (x 1 , . . ., x d ) and r > 0, let B d (x x x, r) := {y y y : ||y y y − x x x|| ≤ r} be the d-dimensional ball centered at x x x with radius r, and C d (x x x, r) := {y y y : x with distance r to each of its surfaces.We use the symbol O(•) to hide only absolute constants which do not depend on any problem parameter.
The rest of the paper is organized as follows.Section 2 provides background on the continuous optimization and recalls some existing results.Section 3 presents the main results.Section 4 contains numerical experiments to corroborate our analysis.Section 5 concludes.

Results of GD
We consider non-convex optimization (convex optimization results are recalled in Appendix B).In this case, it is generally difficult to find the global minima.A popular approach is to consider the first-order stationary points instead.
Definition 2.1.Let f : R d → R be a differentiable function.We say that (i) x x x is a first-order stationary point of We say that a differentiable function f : For gradient Lipschitz functions, GD converges to the first-order stationary points, which is quantified by the following theorem from (Nesterov, 2004)[Section 1.2.3].
Theorem 2.2.Assume that f : R d → R is -gradient Lipschitz.For any > 0, if we run GD with step size η = −1 , then the number of iterations to find an -first-order stationary point is Note that in Theorem 2.2, the time complexity of GD is independent of the dimension d.For a non-convex function, a first-order stationary point can be either a local minimum, a saddle point, or a local maximum.The following definition is taken from (Jin et al., 2017)[Definition 4].
Definition 2.3.Let f : R d → R be a differentiable function.We say that (i) x x x is a local minimum if x x x is a first-order stationary point, and f (x x x) ≤ f (y y y) for all y y y in some neighborhood of x x x; (ii) x x x is a saddle point if x x x is a first-order stationary point but not a local minimum.Assume further that f is twice differentiable.We say a saddle point x x x is strict if For a twice differentiable function f , note that λ min (∇ 2 f (x x x)) ≤ 0 for any saddle point x x x.So by assuming a saddle point x x x to be strict, we rule out the case λ min (∇ 2 f (x x x)) = 0.The next subsection will review two perturbation-based algorithms that allow jumping out of strict saddle points.

Results of PGD and PAGD
One drawback of GD in non-convex optimization is that it may get stuck at saddle points.(Jin et al., 2017) and (Jin et al., 2018) proposed PGD and PAGD, respectively, to escape saddle points, which we review here.To proceed further, we need some vocabulary regarding the Hessian of the function f .
To simplify the presentation, assume that all saddle points are strict (Definition 2.3).In this situation, all second-order stationary points are local minima.The basic idea of these two algorithms is as follows.Imagine that we are currently at an iterate x x x t which is not an -second-order stationary point.There are two scenarios: (i) The gradient ||∇f (x x x t )|| is large and a usual iteration of GD or AGD is enough; (ii) The gradient ||∇f (x x x t )|| is small but λ min (∇ 2 f (x x x t )) ≤ − √ ρ (large negative).So x x x t is around a saddle point, and a perturbation ξ is needed to escape from the saddle region: x x x t = x x x t + ξ.
The main result for PGD, Theorem 3 in (Jin et al., 2017), and for PAGD, Theorem 3 in (Jin et al., 2018), are stated below showing that the time complexity of these two algorithms are almost dimension-free (with a log factor).
Theorem 2.5.(Jin et al., 2017) Assume that f : R d → R is -gradient Lipschitz and ρ-Hessian Lipschitz.Then there exists c max > 0 such that for any δ > 0, ≤ 2 /ρ, ∆ f ≥ f (x x x 0 ) − f * , and c ≤ c max , PGD outputs an -second-order stationary point with probability 1 − δ, terminating within the following number of iterations: Compared with Theorem 2.2, PGD takes almost the same order of time to find a second-order stationary point as GD does to find a first-order stationary point.
Theorem 2.6.(Jin et al., 2018) Assume that f : R d → R is -gradient Lipschitz and ρ-Hessian Lipschitz.Then there exists an absolute constant c max > 0 such that for any δ > 0, ≤ 2 /ρ, ∆ f ≥ f (x x x 0 ) − f * , and c ≥ c max , with probability 1 − δ, one of the iterates x x x t of PAGD will be an -second-order stationry point in the following number of iterations:

Main Results
In this section, we first prove the non-localization property of the vertex-repelling random walk.Then, we formalize the idea of perturbations adapted to occupation time and provide the full version of PGDOT and PAGDOT in Algorithms 3 and 4, respectively.Our main results show that these algorithms converge rapidly to second-order stationary points.

Non-Localization Property of Vertex-Repelling Random Walk
The following theorem suggests that the new perturbation mechanism helps perturbation-based algorithms to avoid getting stuck at saddle points, as the dynamics of vertex-repelling random walk prescribed in (1) does not localize.

Perturbed Gradient Descent Adapted to Occupation Time
PGD adds a uniform random perturbation when stuck at saddle points.From the discussion in the introduction, it is more reasonable to perturb with non-uniform noise whose distribution depends on the occupation times.Recall that w : {0, 1, . ..} → (0, ∞) is an increasing weight function on the nonnegative integers.The following algorithm adapts PGD to random perturbation depending on the occupation dynamics.We follow the parameter setting as in (Jin et al., 2017).Our algorithm performs GD with step size η and gets a perturbation of amplitude r √ d near saddle points at most once every t thres iterations.The threshold t thres ensures that the dynamics of the algorithm is mostly GD.The threshold g thres determines if a perturbation is needed, and the threshold f thres decides when the algorithm terminates.
Algorithm 3 Perturbed Gradient Descent Adapted to Occupation Time: PGDOT(x x x0, , ρ, , c, δ, ∆ f ) The next theorem gives the convergence rate of Algorithm 3: PGDOT finds a second-order stationary point in the same number of iterations (up to a constant factor) as PGD does.
Theorem 3.2.Assume that f : R d → R is -gradient Lipschitz and ρ-Hessian Lipschitz.Then there exists c max > 0 such that for any δ > 0, ≤ 2 /ρ, ∆ f ≥ f (x x x 0 ) − f * , and c ≤ c max , PGDOT (Algorithm 3) outputs an -second-order stationary point with probability 1 − δ terminating within the following number of iterations: The proof of Theorem 3.2 is based on a geometric characterization of saddle points -thin pancake property (Jin et al., 2017).
In Appendix D, we will discuss this property, and show how it is used to prove Theorem 3.2.

Perturbed Accelerated Gradient Descent Adapted to Occupation Time
Similar to the way we combined our perturbation mechanism with PGD, we can adapt PAGD to this mechanism as well resulting in the accelerated version of PGDOT (Algorithm 4).We follow the parameter setting as in (Jin et al., 2018).
Algorithm 4 Perturbed Accelerated Gradient Descent Adapted to Occupation Time: PAGDOT(x x x0, η, θ, γ, s, r, T ) Algorithm 4, similar to PAGD, enjoys a feature enabling it to reset the momentum and decide whether to exploit the negative curvature when the function becomes "too convex" (see Algorithm 5).
Algorithm 5 Negative Curvature Exploitation: NCE(x x xt, v v vt, s) The next theorem gives the convergence rate of Algorithm 4: PAGDOT finds a second-order stationary point in the same number of iterations (up to a constant factor) as PAGD does, and therefore achieves a faster convergence rate than PGD and PGDOT.The proof of Theorem 3.3 is similar to that of Theorem 3.2 (see Appendix D).Theorem 3.3.Assume that f : R d → R is -gradient Lipschitz and ρ-Hessian Lipschitz.Then there exists an absolute constant c max > 0 such that for any δ > 0, ≤ 2 /ρ, ∆ f ≥ f (x x x 0 ) − f * , and c ≥ c max , one of the iterates x x x t of PAGDOT (Algorithm 4) will be an -second-order stationry point in the following number of iterations, with probability 1 − δ: It is worth mentioning that Algorithms 3 and 4 share some spirit with simulated annealing and GD with momentum methods such as the heavy ball method (Polyak, 1964).In simulated annealing, the perturbation is time-adapted while the perturbation in Algorithms 3 and 4 is state-adapted (to the history of states).In the heavy ball method, a momentum term, which is a function of the current and previous states, is explicitly added to control the oscillations and accelerate in low curvatures along the direction close to momentum.In Algorithms 3 and 4, however, no explicit momentum term is added.Instead, the perturbation is adapted to the history of states providing the current state with an explicit direction.

Empirical Results
This section presents empirical results to corroborate the theoretical analysis presented in the previous section.Different machine learning tasks are considered including a nonlinear regression problem adapted from learning time series data, a regularized linear quadratic problem, the phase retrieval problem, and training MLPs on the MNIST and CIFAR-10 datasets.
As shown in these examples, integrating our new perturbation mechanism into the framework of perturbation-based algorithms boosts their performance: PGDOT and PAGDOT escape saddle points or plateaus faster than their counterparts.Specifically, example 4 shows that in training MLPs on the MNIST and CIFAR-10 datasets, PGDOT and PAGDOT are robust against different initialization and manage to escape saddle points efficiently; in contrast, PGD and PAGD as well as other popular algorithms such as stochastic gradient descent (SGD), Adam, AMSGrad, and RMSProp fail to do so.In addition, PAGDOT converges faster than PGDOT in all these examples, which is in line with the theoretical results in Theorems 3.2 and 3.3.
In these experiments, we use Here h is a hyperparameter characterizing the occupation time over a small interval.t count is another hyperparameter prescribing how long one should keep track of the history of x x x t in order to approximate the occupation time with a constant memory cost.We choose the weight function in Algorithms 3 and 4 as w(n) = 1 + n 5 .All other hyperparameters used in the numerical examples are reported in Appendix E.
where a gives the visualization of the case N = 4, L = 1 and also the training curves of f given by 5 different algorithms when d = 4.The initial values are all the same, and all the algorithms except for GD are run 3 times considering the randomness of perturbations.We can see that while GD gets stuck at the saddle points, all the rest of the algorithms escape from them.Moreover, the new algorithms PGDOT and PAGDOT outperform their counterparts.Example 2 We consider a nonlinear regression problem, adapted from learning time series data with a continuous dynamical system (Li et al., 2021).The loss function is defined as f For the specific regression model, we assume M = 4 and use N = 50 data points with s i = i/10, i = 0, . . ., 49. Figure 3 shows the target function and the fitted function obtained by PGDOT.Also, the learning curves of 5 different algorithms are plotted.Again, PGDOT and PAGDOT escape the saddle point faster than GD and outperform PGD and PAGD, respectively.Example 3 The next two non-convex optimization problems are taken from (Wang et al., 2019).The first problem is a regularized linear-quadratic problem (Reddi et al., 2018a), whose loss function is where we take N = 10, H = diag([1, −0.1]) and b b b i 's instances of N (0, diag([0.1,0.001])).The second problem is the phase retrieval problem (Candès et al., 2013) with loss function where we choose N = 200, x x x * an instance of N (0, I d /d) and a a a i 's instances of N (0, I d ) with d = 10.
We initialize the regularized linear-quadratic problem with x x x 0 = 0, and the phase retrieval problem with x x x 0 sampled from N (0, I d /(10000d)).Figure 4 presents the learning curves of 5 different algorithms.In both problems, all other algorithms escape saddle points faster than GD, with PGDOT and PAGDOT outperforming their counterparts.Example 4 (Dauphin et al., 2014a) observed that in training simple MLPs (MLPs with only one hidden layer) on the MNIST and CIFAR-10 datasets, SGD might get stuck at saddle points.Moreover, as demonstrated in (Swirszcz et al., 2016), Adam also gets stuck and performs poorly when a simple MLP with specific initialization is trained on the MNIST dataset.
Inspired by (Swirszcz et al., 2016), we conduct two sets of experiments on the MNIST and CIFAR-10 datasets, in which we train several simple MLPs using the mini-batch version of our proposed algorithms as well as the mini-batch version of other popular alternatives such as SGD, Adam, AMSGrad, and RMSProp.For both of the datasets, the batch size is set to 128 all the images are downsized to be of size 10 × 10.
In the first set of experiments, we train several simple MLPs whose weights and biases are initialized with N (0, 0.01) on the aforementioned datasets.The top two rows in Figure 5 show the training curves of SGD, Adam, PGD, PGDOT, PAGD, and PAGDOT.Note that n hidden is the number of neurons in the hidden layer of simple MLP.Observe that all algorithms manage to escape saddle points in all the cases with the exception of Adam, which fails on the CIFAR-10 dataset.
For the second set of experiments, we consider several simple MLPs whose weights and biases are initialized with N (−1, 0.01).The bottom two rows in Figure 5 show the training curves of different algorithms.For both of the datasets, while SGD and Adam are stuck at the saddle points, the new algorithms PGDOT and PAGDOT escape the saddle points and significantly outperform their counterparts.Note that for the CIFAR-10 dataset, the training curves of SGD, Adam, PGD, and PAGD are almost identical.
It is also worth mentioning that other variants of Adam such as AMSGrad (Reddi et al., 2018b) and RMSProp also fail to escape the saddle points in the training process (see Figure 6 in Appendix F).Moreover, comparing these results with that of the first set of experiments (Figure 5, top two rows) we conclude that the new perturbation mechanism helps the algorithms to be robust against different initialization.

Conclusion
In this paper, we develop a new perturbation mechanism in which the perturbations are adapted to the history of states via the notion of occupation time.This mechanism is integrated into the framework of PGD and PAGD resulting in two new algorithms: PGDOT and PAGDOT.We prove that PGDOT and PAGDOT converge rapidly to second-order stationary points, which is corroborated by empirical studies ranging from time series analysis and the phase retrieval problem to neural networks.

A twice differentiable function
The gradient Lipschitz condition controls the amount of decay in each iteration, and the strong convexity condition guarantees that the unique stationary point is the global minimum.The ratio /α is often called the condition number of the function f .The following theorem shows the linear convergence of gradient descent to the global minimum x x x , see (Bubeck, 2015)[Theorem 3.10] and (Nesterov, 2004) [Theorem 2.1.15].Theorem B.2. (Bubeck, 2015;Nesterov, 2004) Assume that f : R d → R is -gradient Lipschitz and α-strongly convex.For any > 0, if we run gradient descent with step size η = −1 , then the number of iterations to be -close to x x x is 2 α log ||x x x0−x x x || .

C. Proof of Theorem 3.1
Suppose by contradiction that with positive probability, the walk is localized at some points {k, . . ., }.We focus on the left end k.Let τ k n be the time at which the point k is visited n times.For n sufficiently large, the point k + 1 is visited approximately at least n times by τ k n .So at time τ k n , the walk moves from k to k + 1 with probability bounded from above by C/w(n) for some constant C > 0. Consequently, the probability that the walk is localized at {k, . . ., } is less than This leads to the desired result.

D. Proof of Theorem 3.2
We show how the thin-pancake property of saddle points is used to prove Theorem 3.2.Recall that an -second-order stationary point is a point with a small gradient, and where the Hessian does not have a large negative eigenvalue.Let us put down the basic idea in Section 2.2 with the parameters in Algorithm 3 (PGDOT).If we are currently at an iterate x x x t which is not an -second-order stationary point, there are two cases: (1) The gradient is large: ||∇f is easy to deal with by the following elementary lemma.
Lemma D.1.Assume that f : R d → R is -gradient Lipschitz.Then for GD with step size η < −1 , we have f The case (2) is more subtle, and the following lemma gives the decay of the function value after a random perturbation described in Algorithm 3 (PGDOT).
Lemma D.2.Assume that f : R d → R is -gradient Lipschitz and ρ-Hessian Lipschitz.If ||∇f (x t x t x t )|| ≤ g thres and λ min (∇ 2 f (x t x t x t )) ≤ − √ ρ , then adding one perturbation step as in Algorithm 3 followed by t thres steps of GD with step size η, we have f (x x x t+t thres ) − f (x x x t ) ≤ −f thres with probability at least 1 − d √ ρ e −χ .(Jin et al., 2017) proved Lemma D.2 for PGD, and used it together with Lemma D.1 to prove Theorem 2.5.We will use the same argument, with Lemmas D.1 and D.2, leading to Theorem 3.2 for PGDOT.Now, let us explain how to prove Lemma D.2 via a purely geometric property of saddle points.Consider a point x x x satisfying the condition ||∇f ( x x x)|| ≤ g thres and λ min (∇ 2 f ( x x x)) ≤ − √ ρ .After adding the perturbation in Algorithm 3, the resulting vector can be viewed as a distribution over the cube C (d) ( x x x, r/ √ d).Similar as in (Jin et al., 2017), we call C (d) ( x x x, r/ √ d) the perturbation cube which is divided into two regions: (1) escape region χ escape which consists of all points x x x ∈ C (d) ( x x x, r/ √ d) whose function value decreases by at least f thres after t thres steps; (2) stuck region χ stuck which is the complement of χ escape in C (d) ( x x x, r/ √ d).The key idea is that the stuck region χ stuck looks like a non-flat thin pancake, which has a very small volume compared to that of C (d) ( x x x, r/ √ d).This claim can be formalized by the following lemma, which is a direct corollary of (Jin et al., 2017) To prove Lemma D.2, it suffices to check that P(χ stuck ) ≤ Cδ for some C > 0. This criterion is general for any (random) perturbation.Let O 1 , . . ., O 2 d be the orthants centered at x x x; that is, the space R d is divided into 2 d subspaces according to the coordinate signs of • − x x x.The symbol sgn(O i ) ∈ {−1, 1} d denotes the coordinate signs of y y y − x x x for any y y y ∈ O i .For

Figure 2 .
Figure 2. Graph of f (top left) and the landscape of f (x x x) (top right) with x x x ∈ R 2 in the case of N = 4, L = 1, and the performance of different algorithms (bottom) when N = 4, L = 1, d = 4 in Example 1.
where {s i } N i=1 are N sample points, y * (s) is the target function, and ŷ(s) is the function to fit with the form ŷ(s; x x x) = M m=1 (a m cos (λ m s) + b m sin (λ m s))e wms .Here x x x = {a m , b m , λ m , w m } M m=1 and the optimization problem is non-convex.We assume y * (s) = Ai(ω[s − s 0 ]), where ω = 3.2, s 0 = 3.0, and Ai(s) is the Airy function of the first kind, given by the improper integral Ai(s) = 1 π ∞ 0 cos u 3 3 + su du.

Figure 3 .
Figure 3.The target function y (t) and fitted function ŷ(t) obtained by PGDOT (left), and the performance of different algorithms (right) in Example 2.

Figure 4 .
Figure 4.The performance of different algorithms for f1 (left) and f2 (right) in Example 3.
[Lemma 11] as C(d) ( x x x, r/ √ d) ⊆ B d ( x x x, r): Lemma D.3.Assume that x x x satisfies ||∇f ( x x x)|| ≤ g thres and λ min (∇ 2 f ( x x x)) ≤ − √ ρ .Let e e e 1 be the smallest eigendirction of ∇ 2 f ( x x x).For any δ < 1/3 and anyu u u, v v v ∈ C (d) ( x x x, r/ √ d), if u u u − v v v = µre e e 1 and µ ≥ δ/(2 √ d), then at least one of u u u and v v v is not in the stuck region χ stuck .