Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem

In this paper we investigate how gradient-based algorithms such as gradient descent, (multi-pass) stochastic gradient descent, its persistent variant, and the Langevin algorithm navigate non-convex loss-landscapes and which of them is able to reach the best generalization error at limited sample complexity. We consider the loss landscape of the high-dimensional phase retrieval problem as a prototypical highly non-convex example. We observe that for phase retrieval the stochastic variants of gradient descent are able to reach perfect generalization for regions of control parameters where the gradient descent algorithm is not. We apply dynamical mean-field theory from statistical physics to characterize analytically the full trajectories of these algorithms in their continuous-time limit, with a warm start, and for large system sizes. We further unveil several intriguing properties of the landscape and the algorithms such as that the gradient descent can obtain better generalization properties from less informed initializations.


Introduction
Algorithms based on gradient-descent (GD) are the workhorses of many machine learning applications involving the optimization of a high-dimensional non-convex loss function. In particular, stochastic gradient descent (SGD) has proved to be extremely efficient in navigating complex loss landscapes. However, despite its practical success, the theoretical understanding of the reasons behind the good generalization properties of the algorithm remains sparse. Empirical evidence suggests that the interplay between the optimization algorithm and the landscape is crucial to achieve good performances. It has been shown, for instance, that the loss landscape of state-of-the-art deep neural networks is far from simple: adversarial initialization can trap SGD into global minima with poor generalization [1]. Therefore, understanding the dynamics of SGD is paramount in machine learning and optimization.
Investigating the dynamics of stochastic gradient descent and the role of the stochasticity is consequently an active direction of research. While the practical success of SGD compared to GD is rather generally accepted, it is still far from clear what is really the key factor responsible for this. Cases where the superiority of SGD with respect to GD was shown theoretically are sparse, but see e.g. [6,7]. One hurdle that appears in theoretical analysis is how to properly define the continuous limit of SGD. In the limit of learning rate going to zero, SGD is considered to lead to the gradient-flow limit, see e.g. [8], thus the difference with gradient flow disappears. For learning rate kept finite, a line of works characterizes SGD as a discretization of a continuous-time Langevin-type process [9,10,11,8]. The dependence of the variance of the noise on the current-weights and time is, however, not given in a closed form in these works and thus difficult to analyze explicitly. Another line of work challenged the central-limit-theorem assumption of finite noise-variance behind these works by proposing the stochastic noise is heavy-tail distributed [13]. In our work, we instead consider a variant of the stochastic gradient descent called persistent-SGD, as recently defined in [14]. Persistent-SGD has a well define flow limit η → 0 and our analysis thus does not require other assumptions about the nature of stochastic noise.
Learning theory and computer science usually proceed in a manner that makes minimalistic assumptions on the data ditribution. Statistical physics usually takes a complementary way of understanding well prototypical settings that capture the essesce of the question. This is the path we take in this paper and compare the behaviour of gradient descent-based algorithms on a prototypical choice of data and learning model leading to high-dimensional and non-convex landscape. Specifically, we consider the problem of phase retrieval where the task consists in recovering an unknown signal from a set of observationsthe absolute value of the signal's projections onto measurement vectors. This problem appears in a series of applications, including optics [2,3], acoustics [4], and quantum mechanics [5]. We will consider the problem where the measurements are i.i.d. Gaussian vectors and the number of measurements M is only constant times the dimensionality of the signal N, α = M/N. We consider the high-dimensional limit where both the number of training samples and the input dimensions go to infinity, at ratio α of order one, typically 2 − 3. In this work we view the phase retrieval as a prototypical example of a simple single-layer neural network where the measurement vectors correspond to the input samples, and the signal corresponds to the teacher-network weights. The measurements then correspond to the output labels. In the spirit of learning with neural networks we are interested in the corresponding generalization error, i.e. the ability to predict labels for previously unseen samples. We stress that it is not the goal of this work to provide a competitive algorithm for the phase retrieval. In the setting considered in this paper (i.e. iid Gaussian inputs and teacher produced labels) it was conjectured that the approximate message passing algorithm cannot be beaten in the large size limit [16]. Instead, the main goal of this paper is to study the performance of gradient-based algorithms and the loss landscape of the phase retrieval problem serves us as a high-dimensional intrinsically non-convex prototype having multiple spurious minima and only one solution (with a Z 2 symmetry) leading to perfect generalization error.
We note that the landscape of phase retrieval problem is somewhat different than the one of deep neural networks, that are highly overparametrized and present entire regions of solutions with zero training error and a good generalization. Consequences of this difference and thus relevance of the present work for learning with state-of-the art neural networks is left for future work. Instead the present work investigates the performance of gradient-based algorithms in an archetypal non-convex high-dimensional setting providing a benchmark to assess the role played by stochasticity in non-convex optimization problems in general.
Our main contributions can be summarized as follows: • We perform a series of numerical simulations in order to assess the generalization performance of the gradient descent, multi-pass stochastic gradient descent, its persistent-SGD version, and the Langevin algorithm as a function of the control parameters (minibatch size, persistence time, temperature). Our experimental findings reveal that in the considered problem stochasticity is beneficial for generalization. We also shed light on the qualitative difference between the sources of noise in the algorithms.
• We investigate the role of the warm start and we find that gradient descent can be trapped very close to the signal, while perfect recovery can be reached starting from less informed initializations.
• We then apply dynamical mean-field theory (DMFT) from statistical physics to provide an analytic characterization of the full trajectory of the continuous limit of the GD, persistent-SGD and Langevin algorithms in the high-dimensional limit where the number of samples and dimension are both large, but their ratio α = O(1), at times linear in the dimension. We use the theoretical curve as a baseline to show that the observed behavior is not due to finite-size or finite-learning-rate effects.
Further related works -In this paper, we consider the phase retrieval problems with Gaussian measurements and signal in the high-dimensional limit. The loss landscape complexity of this problem was studied using the Kac-Rice method in [15], however, bringing this analysis to concrete results seemed to be technically challenging. Signal recovery in this problem was studied from the information-theoretic point of view and using approximate message passing (AMP) algorithms that are considered optimal among all polynomial algorithms for this case [16,17,18]. In particular it is known that while information-theoretically zero generalization error can be reached for α > 1, the AMP algorithm is able to do so for α > 1.13. Performance of gradient descent for phase retrieval is worse than the one of AMP in terms of sample complexity and also harder to analyze. In practice, one often uses gradient descent initialized spectrally [19], i.e. in the eigenvector corresponding to the leading eigenvalue of a suitable matrix constructed from the labels and the measurement vectors [20]. Such spectral initialization is also motivating our use of warm start that is mimicking it. Concerning randomly initialized gradient descent, [21] showed that gradient descent needs a training set of size ∼ O(Npoly(log N)) to retrieve the hidden signal. Several other works in computer science consider gradient descepnt-type algorithms for phase retrieval requiring O(Npoly(log N)) samples [25]. The analysis carried out in [22] suggests that the randomly-initialized algorithm can achieve perfect generalization with much lower linear sample complexity. Authors of [23] then show that linear (with unspecified large constant) sample complexity is achievable with randomly initialized gradient descent for a suitably chosen loss function. Finally [24] have shown that over-parametrization can bring the sample complexity of randomly initialized gradient descent down to α = 2. While in the present work we will not be considering overparametrization, we are interested in performance of gradient-based algorithms for similarly small sample complexity α. We will be investigating several gradient-based algorithms and judge their performance by the number of samples they require for recovery of the signal. The fewer samples the better. This is why we focus on the regime of α = O(1).
The online SGD for phase retrieval has been studied, e.g., in [26]. Results for (multi-pass) stochastic gradient-descent in phase retrieval are not known up to our best knowledge. A theoretical understanding of the performance of (multi-pass) SGD at small sample complexity requires taking into account the full trajectory of the algorithm which is challenging and done in the present paper.

The model and the data
We study the supervised learning problem of recovering an N−dimensional real-valued vector We consider the signal w (0) to be extracted with the uniform measure on the N-dimensional hyper sphere |w (0) | 2 = N. We take the components of the vectors ξ µ to be i.i.d. Gaussian random variables with zero mean and unit variance. The non-linear measures of the signal vector w (0) are encoded in the labels We note that in applications the complex-valued phase retrieval is more relevant, yet for the purpose of the present paper, which is studying the performance of the gradient-based algorithms, the real-valued version is sufficiently rich. We consider learning with a single-layer neural network by the minimization of the empirical risk where v is a cost function having a global minimum at h µ = h (0) µ and we have defined In what follows we consider a loss of the form: However, the analytic derivation of the DMFT can be carried out for every twice-differentiable function v. Note that the empirical risk depends on the labels y µ only through h (0) µ . We consider a particular regularization of the weights where the training dynamics of w(t) is constrained on the hyper-sphere. In Appendix C, we show that our results hold in a qualitative same manner for the more standard ridge regularization.

The analyzed algorithms
In this section we define the gradient-descent-based algorithms under consideration and their continuous-time limit that will then be studied using dynamical mean-field theory. The discrete dynamics of full-batch gradient descent is given by the weights update: for all i = 1, ...N, where η > 0 is the discrete time step and ∂ 1 v(h; h 0 ) indicates the derivative of the loss function with respect to its first argument. We have introduced a Lagrange multiplier ν(t) to enforce the spherical constraint on the weights at each time step. This is equivalent to a projection on the sphere at each iteration, which is how we implement the numerical simulations. In the following, we analyze different ways to add stochasticity to the dynamics.
Multi-pass stochastic gradient descent dynamics -We study multi-pass stochastic gradient descent, where the samples are reused multiple times during training. The mini-batches are sampled with replacement with size B = bM, b ∈ (0, 1] at each time step. If we introduce a set of binary variables s µ (t) ∈ {0, 1}, µ = 1, ..., M to select which samples are used compute the gradient, then in the large N limit the vanilla-SGD algorithm described above is equivalent to draw independently at each time step. However, the continuous-time limit η → 0 different from the gradient-flow is not well-defined in this case. As done in [14], in order to consider the continuous-time dynamics, we define a discrete-time process for the variables s µ (t) that has a well-defined continuous-time limit. We divide the time interval in finite steps of size η and we define the following persistent version of the stochastic gradient descent algorithm: In this case, each pattern stays out of the training mini-batch for a typical time τ, that we will refer to as the persistence time. The stochastic gradient flow (SGF) dynamics is obtained by taking the η → 0 limit of Eq. (7) for all i = 1, ...N, where we have rescaled the gradient by the fraction of samples in the minibatch. Note that, in this setting, there are two parameters controlling the stochasticity of the algorithm: the mini-batch size b and the persistence time τ. The standard SGD algorithm is recovered from (7) by setting τ = η/b and finite learning rate η. In Appendix D, we show by numerical simulations with decreasing learning rate that the η → 0 limit of the persistent SGD algorithm is different than gradient descent, while this is not the case for standard SGD.
Langevin dynamics -A different kind of stochastic dynamics is provided by the Langevin algorithm at temperature T , whose flow limit is defined by the following system of stochastic differential equations: for all i = 1, ...N. The random vector ς (t) is Gaussian white noise: Note that by setting b = 1 in Eq. (8) or T = 0 in Eq. (9) we recover the full-batch gradient-flow algorithm.
Warm initialization -In order to explore the energy landscape more thoroughly we consider here, next to the usual random initialization, informed/warm initializations. We initialize the weight vector as follows: where m 0 > 0 is (on average) the initial projection of the weight vector onto the signal, i.e., the average magnetization at time t = 0. The components of z are i.i.d. standard Gaussian variables and the coefficient c is such that |w(t = 0)| 2 = N. Note that the warm initialization breaks the Z 2 symmetry of the problem. Therefore, in the following m(t) ∈ (0, 1], ∀t. We stress here that while in learning we are usually concerned with the test error/performance, in the setting considered here (under the spherical constraint) the test error is monotonic in the magnetization, see Appendix B. Thus, in the following we directly use the magnetization as a measure of accuracy.
This warm initialization can be thought of as a proxy for algorithms where gradient-descent (or its variants) is run after the weights have been spectrally initialized, i.e. using the principal eigenvalue of a given pre-processing matrix as initial guess for the weights. Spectrally initialized GD is used in a range of applications, see e.g. [19], as well as studied theoretically, see e.g. [27]. Warm initialization is formally needed to obtain non-trivial results for times linear in the dimension. Indeed, because of the Z 2 symmetry we would need time at least logarithmic in dimension in order to escape the space of weights uncorrelated with the solution. This was referred to as the "escape from mediocrity" in [28].

Characterization of the dynamics
In this section we provide a closed-form characterization of the flow dynamics of the persistent-SGD and Langevin algorithms presented above in the high-dimensional limit. To this end, we apply dynamical mean-field theory from statistical physics [29,30,31]. This analytic framework is useful to study the stochastic evolution of large systems of interacting degrees of freedom [32,33,34]. DMFT has been rigorously proven in some specific cases [35], but not yet in the present one. Here we present the main analytic results, more details are provided in Appendix A. The derivation follows the line of [36,14], for a different data structure and loss function. We also need to include the spherical constraint and Langevin noise in the dynamics. We consider the high-dimensional limit N → ∞, at fixed sample complexity α = M/N, mini-batch fraction b, persistent time τ and temperature T . For simplicity, we regroup the flow dynamics of multi-pass SGD (8) and Langevin (9) in the same equation: The performance of the algorithms as a function of training time is encoded in the magnetization m(t) defined in Eq. (12), that is equal to 1 for perfect recovery of the signal. In the highdimensional limit, we obtain that the evolution of the magnetization is described by the following deterministic differential equation: whereν The brackets · stand for the average over different sources of noise: • the binary variable s(t), distributed as in Eq. (7) for η → 0; • the standard Gaussian variable h 0 ∼ N (0, 1); • the effective stochastic process for the typical gap h(t) defined in Eq. (3).

The evolution of h(t) is given by
with initial condition P(h(0)) = 1 The dynamical noise χ(t) in Eq. (16) is Gaussian distributed, with The expressions for the kernels M C (t,t ) and M R (t,t ) and the auxiliary function δ ν(t) are given by where P(t ) is a linear perturbation applied to h at time t and then set to zero, that we only need to define M R (t,t ). Overall, we obtain that the performance of the algorithms over time is described by a system of integro-differential equations that must be solved numerically in a self-consistent way. As done in [14], we start from a simple guess of the kernels and the auxiliary functions and we compute many realizations of the curve h(t) in Eq. (16). We use these curves to update the kernels and we iterate the procedure until convergence. The details are relegated to Appendix A.1. A first implementation of this procedure was proposed in [37,38]. Once the stochastic process defined in Eq. (16) has reached convergence, we can compute other quantities of interest. For instance, we can track the evolution of the average training loss defined in Eq. (2) in the high-dimensional limit: These equations provide a dimension-independent way to track the performance of the algorithm in the limit of high-dimensions and infinitesimal learning rate as a function of time. Indeed, since the solution of the problem is planted and the measurements are noiseless, in this case zero training loss corresponds to zero generalization error. Note that this formalism allows to study the dynamics of the corresponding algorithms without any approximation on the distribution of the noise introduced by stochasticity. This is at variance with the works that consider SGD as a noisy approximation of GD invoking variants of central limit theorem [9,10,11,8,13].

Results for the dynamics
In this section, we discuss our findings on the dynamics of the gradient-based algorithms under consideration. We compare the results from simulations to the DMFT theoretical prediction. The DMFT is valid in the continuous flow limit and in the high-dimensional limit. The simulations are performed for sizes large enough and learning rate small enough so that this limit is closely approached. This analysis sheds light on how stochasticity helps to navigate the loss landscape and on the impact of the different control parameters, notably the batch size b, temperature T , and persistence time τ, on the test performance.
The trapping landscape - Fig. 1 illustrates the performance of gradient descent starting from three increasing initializations: m 0 = 0.5 (left), m 0 = 0.65 (center), and m 0 = 0.8 (right) at α = 2, i.e. number of samples twice the dimension. In the three lower panels, we plot the magnetization for different seeds -corresponding to different realizations of the noise vector z defined in Eq. (11) -with a dataset, i.e. the inputs and labels, drawn at random and fixed. The evolution of different instances from simulations is thus probing the very same loss-landscape, the figure then highlights the complexity of the landscape. First, we observe that a warm start is not enough to reach perfect recovery. This suggests that the landscape is very rough, with multiple local minima at all heights. Indeed, we see that gradient descent can get stuck even very close to the global minimum at m = 1. From the right panel of the figure, we see that at time t ∼ 10 all seeds initialized with magnetization m 0 = 0.8 have achieved perfect recovery m = 1. However, the left and center panels show that some seeds starting at m 0 < 0.8 and reaching m = 0.8 only at t > 0 can get stuck for long times. Hence we deduce that the topological complexity of the landscape is such that some regions of the weights space can be trapping even if they are closer to the signal than others that do not trap the dynamics. We observe that a more informed initialization does not guarantee a better generalization. This can be further seen comparing the left panel to the central one. Indeed, we find that some seeds initialized at m 0 > 0.6 are stuck at m < 1 at time t ∼ 10, while some seeds starting at m 0 < 0.6 have already reached perfect generalization. Consequently, in this regime of parameters, the full trajectory of the algorithm is crucial to achieve perfect recovery.
In the upper panels of Fig. 1, we compare the average magnetization from numerical simulations at finite system size and finite learning rate (grey dots) to the theoretical prediction (red line) obtained by integrating the DMFT equations derived in the high-dimensional continuous limit. In this case, we generate a new dataset for each simulations in order to remove sampleto-sample fluctuations. We find a very good agreement between asymptotic theory and the   Fig. 2. For each seed, a new instance is generated. For visibility purposes, we plot t + η on the x-axes.
average from simulations already for the used system sizes and learning rates, indicating that the observed behavior is not a feature of finite size or finite learning rate effects. Additional simulations supporting this evidence are left to the Appendix D.
Multi-pass SGD outperforms GD - Fig. 2 shows the average magnetization -defined in Eq. (12) -and the average training loss -defined in Eq. (2) -as a function of time for full-batch gradient descent, multi-pass SGD and its persistent version. In the case of multi-pass SGD, we sample (with replacement) minibatches of size bM at each time step. In Fig. 3, we depict different instances of the dynamics, corresponding to different realizations of the dataset and the noise vector z (Eq. (11)). We find that SGD and persistent-SGD with τ = 1 outperform GD in recovering the hidden signal. Indeed, at time scales at which persistent-SGD has already reached magnetization one and zero loss, gradient descent is stuck in regions of poorer generalization. The average magnetization of SGD lies between the two. Therefore, a finite batch size is beneficial for the performance. Furthermore, the behavior of the curves for different seeds unveils an important role played by the persistence time. Indeed, while the evolution of the magnetization for GD is characterized by long plateaus alternated by sudden jumps, persistent-SGD is not stuck in the same region for long times. Again, the behavior of SGD is intermediate between the two: we see from the central panel of Fig. 3 that the disappearance of the plateaus is a feature of a finite persistence time. These findings suggest that the interplay of the finite batch size and the persistence time is crucial to achieve the optimal performance. Additional simulations supporting this numerical evidence are provided in Appendix D.
The role of the noise - Fig. 4 illustrates the effect of different sources of stochasticity on the generalization performance. In particular, we compare the role played by the white noise at temperature T in the Langevin algorithm to the double source of noise in the SGD algorithm: the finite batch size b and the persistence time τ. In the left panel, we depict the dependence of the SGD algorithm on the batch size, at fixed persistence time. We find that the generalization performance is non-monotonic in the batch size and the optimal value is attained at intermediate b. Therefore, at variance with what observed in deep neural networks trained on real datasets [10,39], in our case we obtain that the optimal batch size is an extensive fraction of the total number of samples. The central panel displays the (median) performance of SGD for different values of the persistence time τ, at fixed batch size. For times t ≤ τ, the samples used to compute the gradient (on average) do not change, and thus the dynamics presents plateaus. However, as soon as t > τ, the mini-batch is refreshed. This results in a sudden increase in performance at times t ∼ τ, that becomes more visible the larger τ. Moreover, we observe a non-monotonic behavior of the performance as a function of τ. On the one hand, increasing τ shifts the final plateau at larger times, delaying the recovery of the signal.
On the other hand, if the persistence time is too small, the dynamics gets trapped close to the signal, displaying plateaus followed by sudden jumps similarly as for GD (see Fig. 3).
There is therefore an intermediate range of persistence times τ for which the performance is the best (better than vanilla SGD). Since the literature often compares the SGD noise to the Langevin noise [8,9,10,11,12] we compare here to the performance achieved by the Langevin algorithm at fixed temperature. The right panel of Fig. 4 depicts the performance of the Langevin algorithm for different values of temperature T . At large times (t = 700 in the figure) the temperature is switched to zero. We find that the best performance is again reached for intermediate values of the temperature T . We underline the qualitative difference between the effective noise introduced by multi-pass SGD and the white noise of Langevin algorithm. The variance of the noise in Langevin is fixed by the temperature, therefore -in order to reach a minimum -an annealing protocol must be implemented and optimized. In contrast, the noise introduced by SGD is automatically reduced during training and it's zero at the global minimum. Therefore, multi-pass SGD has a built-in self annealing protocol, that can be optimized by tuning only two parameters (b and τ) instead of the whole trajectory of the temperature over time.
The analytic characterization Fig. 5 shows the comparison between the average performance of persistent-SGD obtained from numerical simulations (grey symbols) with the prediction derived by integrating the DMFT equations (red line). The left panel depicts the average magnetization, while the right panel displays the average training loss as a function of time.      Fig. 7 investigates the behavior of full-batch gradient descent (full red lines) and persistent SGD (dashed blue lines) starting from random initialization at fixed α = 2.5. Persistent SGD is run at fixed b = 0.5, τ = 2. We show the median magnetization (main plots) and the median loss (insets) as a function of time for increasing values of the dimension: N = 100 (above-left panel), N = 500 (above-right panel), N = 1000 (below-left panel). and N = 2500 (below-right panel). In this case m 0 = 0 and the warm start in the four panels is only given by finite size effects. We clearly see that, at time scales shown here, gradient descent is stuck at a plateau of height decreasing as the dimension N increases. As studied in [22], the recovery transition of gradient descent starting from random initialization for comparable system sizes happens at α ≈ 6, which is few times larger than the value α = 2.5 considered here. However, we observe that already at α = 2.5 the persistent-SGD algorithm can reach perfect recovery for the system sizes under consideration. The time to reach the solution from random initialization is, as expected, compatible with logarithmic increase in the system size. These observations suggest that the recovery transition for stochastic gradient descent starting from random initialization is shifted to lower values of α when compared to gradient descent. This is an interesting direction for future investigations.

Conclusion
In this paper, we have considered the real-valued phase retrieval problem as a paradigmatic highly non-convex optimization problem to test the generalization performance of full-batch gradient descent and some of its stochastic variants: multi-pass SGD, its persistent version, and the Langevin algorithm. We have shown that stochasticity is crucial to achieve perfect recovery of the hidden signal at low sample complexity so that stochastic gradient descent outperforms gradient descent in this task. We have observed intriguing features of the loss profile and illustrated how various sources of noise allow the dynamics to circumvent the traps in the landscape. We have provided an analytic description of the learning curve in the infinite-dimensional and continuous-time limit via the dynamical mean-field theory, showing that the observed behavior is not due to finite size effects or to a finite learning rate. The present work leads to interesting extensions both on the analytic and numerical sides. On the one hand, the characterization of the dynamical evolution of the algorithms via DMFT can be extended to include realistic initializations (e.g., spectral initialization). On the other hand, it would be interesting to test the persistent variant of multi-pass SGD and investigate the role of the persistence time on real datasets and architectures, which we leave for future work.

Acknowledgements
We thank Stefano Sarao Mannelli for useful discussions. This work was supported by ERC under the European Union's Horizon 2020 Research and Innovation Programme Grant Agreement 714608-SMiLe, and by "Investissements d'Avenir" LabExPALM (ANR-10-LABX-0039-PALM).

Appendix A. Derivation of DMFT equations
In this section, we provide additional details on the derivation and numerical integration of the theoretical equations describing the learning performance via dynamical mean-field theory (DMFT), presented in Sec. 4 of the main text. The computation is on the line of the one presented in [36,14]. Here we consider a different loss function, i.i.d. Gaussian input data and labels generated by a teacher vector. We have to take into account the spherical constraint and the additional white noise of the Langevin algorithm. We use the Martin-Siggia-Rose-Janssen-deDominicis (MSRJD) path-integral formalism. We start by writing the dynamical partition function Since the dynamical partition function is strictly equal to one, we can safely average its expression on the random patterns ξ µ and on the Langevin noise ς . Following [36] we can use a supersymmetric formalism to proceed in a compact way. In this case the dynamical variables w(t) are merged with their auxiliary fieldsŵ(t) in a superfield involving a couple of Grassmann variables θ a ,θ a [40]: In this way the dynamical partition function (averaged over the Langevni noise) can be written as 4) and the kinetic kernel K (a, b) is implicitly defined in such a way that In particular, we have (a, b), The local action S loc is defined as where s(a) = s(t a ), and we have introduced the dynamical order parameters Performing the Gaussian integration over the superfields w(a), we obtain At this point we can evaluate the integral over Q and m through a saddle point computation. The saddle point equations read We can evaluate the functional derivatives (a, b), where we have defined The average in brackets denotes the average with the following measure Along the lines of [36], we can rewrite the average in Eq. (A.16) as an average over h 0 ∼ N (0, 1), the variables s(t) defined in Eq. (7) of the main text, and an effective stochastic process is an effective Gaussian noise: and we have defined the following auxiliary functions: (A. 19) The expression for the Langrange multiplierν(t) is obtained enforcing the spherical constraint ∑ N i=1 dw 2 i /dt = 0 by applying Itô's formula to Eq. (13) of the main text. The kernels M C (t,t ) and M R (t,t ) are obtained expanding M(a, b): The variable P(t ) indicates a linear perturbation applied on the gap variable h at time t and then set to zero. The kernel M R (t,t ) can be also expressed as where T (t,t ) = δ h(t)/δ P(t ) satisfies t ).
Furthermore, from Eq. (A.13) we get the behavior of the magnetization m(t) as a function of time: where m 0 is defined in Eq. (11) of the main text and Moreover, setting m(t) = 0 one gets a set of equations that coincide with [36]. From the solution Q(a, b) of the saddle point Eq. (A.12), we can obtain the equations for the dynamical correlation function C(t,t ) = ∑ i w i (t)w i (t )/N and the response R(t,t ) = ∑ i δ w i (t)/δ H i (t )/N to a linear perturbation of the weights by an infinitesimal local field H i (t). Indeed, we can write the closure relation Now we can express the overlap explicitly in time and Grassman coordinates We can derive two equations from the scalar and Grassman terms (the terms in θ aθa and θ bθb result in the same contribution): • We use the previous guess to generate multiple realizations of the curve h(t).
• We update the kernels and auxiliary functions, computing the averages over h(t), h 0 , s(t).
We introduce a damping in the update to control the oscillations. We integrate Eq. (A.23) to obtain the magnetization m(t).
• We repeat the above procedure until the kernels and auxiliary functions reach a fixed point.
We use Eq. (A.29) in order to compute the Lagrange multiplierν(t) because we find that it is more stable to fluctuations. We integrate Eq. (A.22) to compute the kernel M R . We typically use a discrete time step dt = 10 −3 − 10 −2 .

Appendix B. Generalization error
In this section, we sketch the computation of the average generalization error in the phase retrieval problem under consideration. Given a previously unseen data point ξ new ∼ N (0, I N ), the generalization error can be defined for a generic error function f : R 2 → R, taking as first argument the true label and as second argument the estimated one. The average generalization error is then: is the true label,ŷ new = 1 √ N ξ new · w is the estimated one, and the weight vector w implicitly depends on the training set {ξ µ } M µ=1 as well as on the hidden signal w (0) . We can introduce Dirac's δ −functions to rewrite where we have substituted the δ −functions with their Fourier representation. We first compute the average over the new sample ξ new , that is independent both of w (0) and w: In the following, we denote By integrating over the conjugate variablesx andẑ, we obtain where z ∼ N (0, q) and x ∼ N (my/q, q 0 − m 2 /q). In the infinite dimensional limit, q 0 , q, and m concentrate to their average value, therefore we simply obtain where now the quantities q 0 , q, m are intended in the infinite dimensional limit. This computation shows that the generalization error depends on the signal and the training set only through q 0 , q, m. In particular, in the spherical case q 0 = q = 1 and the performance depends only on m.
Mean Squared Error As a measure of the error, we can consider for instance the commonlyused mean squared error, here defined as: From equation (B.5), we obtain that the mean squared error between the true label of a new sample and its estimate -in the infinite dimensional limit -is which in the spherical case is a monotonically decreasing function of m. Fig. B1 shows the mean squared error for gradient descent, multi-pass SGD and its persistent version in the case of ridge regularization. The dots mark the average from simulations, while the black line displays the prediction obtained from Eq. (B.8), computed in the average values of q 0 , q and m from simulations at dimension N = 1000. We find a very good agreement between theory and simulations.

Appendix C. Ridge regularization
In this section, we consider a variant of the training algorithm presented in Eqs. (5), (8), and (9) of the main text, where instead of projecting the weights on the hyper sphere |w(t)| 2 = N at each iteration, we apply a ridge regularization of strength λ . The parameter λ is fixed during training and can be tuned a posteriori, e.g., by cross-validation. The flow dynamics given by Eq. (13) of the main text is modified as follows: Note that this change simply amounts to substituting the time-dependent Lagrange multiplier ν(t) with the constant λ . All the other variables are defined in the main text and stay the same. We modify accordingly the initial condition (defined by Eq. (11) of the main text) as follows: where m 0 is the average initial magnetization and z ∼ N (0, I N ). This change is reflected in the initial condition for the effective stochastic process in Eq. (A.17), that becomes : P(h(0)) = e −h(0) 2 /2 / √ 2π. We still consider a teacher on the hyper sphere |w (0) | 2 = N.
Appendix C.1. Results In this section, we discuss the results obtained for ridge regularization. The behavior of the gradient-descent-based algorithms is qualitatively the same as what we have observed for the spherical case in the main text: stochasticity is beneficial for generalization also in the case of ridge regularization and without any regularization. Fig. B1 illustrates the performance of gradient descent, multi-pass SGD and persistent SGD for three different values of regularization strength: λ = 0 (left panel), λ = 0.1 (central panel), λ = 1 (right panel), fixing the values of all the other control parameters. The generalization performance is evaluated by measuring the MSE and the average training loss is shown in the inset. We observe that the effect of ridge regularization is different on SGD and persistent-SGD: while the former benefits from a finite regularization λ = 1, the latter generalizes better at low values of λ . Evaluating the optimal regularization is beyond the scope of this work. Furthermore, the left and central panels of Fig. B1 display a peculiar phenomenon of double descent of the generalization error as a function of time that has also been observed in real data [41]. Fig. B2 depicts the MSE as a function of time for the three algorithms under consideration at fixed regularization (λ = 0.5) and for two different dynamics. In particular, we illustrate the effect of rescaling the gradient by the fraction of samples in the mini batch (b) on the dynamics. In the left panel, the gradient is rescaled by b, while in the right panel we do not rescale it. We observe that, while the rescaling is beneficial for persistent-SGD, SGD performs better without it. At variance with the spherical case considered in the main text, in the case of ridge regularization of fixed strength λ rescaling the gradient by b does not result in a simple rescaling of the learning rate. Instead, the regularization is also affected.

Appendix D. Additional figures
In this section we provide additional figures in support to our observations in Sec. 3 and Sec. 5 of the main text. All the figures illustrate the spherical case treated in the main text. Therefore, the generalization performance is entirely captured by the magnetization. Fig. B3 compares the average magnetization (left panel) and loss (right panel) as a function of training time for gradient-descent, SGD and persistent-SGD for decreasing values of the learning rate. We observe that, in the limit of small learning rate, the learning curves of SGD collapse to the ones of gradient descent. On the contrary, the persistent-SGD algorithm has a well-defined continuous time limit that is different than the one of full batch gradient initialization m 0 = 0.2. On the one hand, we observe that increasing the persistence time gradually diminishes the number of seeds that get stuck at intermediate plateau, resulting in an improved generalization performance. On the other hand, until time t ∼ τ the samples in the mini-batch have not been reshuffled yet (on average). Therefore, for large values of τ the plateaus disappear but the magnetization is stuck at the beginning of the training and only at training time t > τ it has a sudden increase.