Landscape and training regimes in deep learning

Deep learning algorithms are responsible for a technological revolution in a variety of tasks including image recognition or Go playing. Yet, why they work is not understood. Ultimately, they manage to classify data lying in high dimension – a feat generically impossible due to the geometry of high dimensional space and the associated curse of dimensionality . Understanding what kind of structure, symmetry or invariance makes data such as images learnable is a fundamental challenge. Other puzzles include that (i) learning corresponds to minimizing a loss in high dimension, which is in general not convex and could well get stuck bad minima. (ii) Deep learning predicting power increases with the number of fitting parameters, even in a regime where data are perfectly fitted. In this manuscript, we review recent results elucidating (i, ii) and the perspective they offer on the (still unexplained) curse of dimensionality paradox. We base our theoretical discussion on the ( h ,α ) plane where h controls the number of parameters and α the scale of the output of the network at initialization, and provide new systematic measures of performance in that plane for two common image classification datasets. We argue that different learning regimes can be organized into a phase diagram. A line of critical points sharply delimits an under-parametrized phase from an over-parametrized one. In over-parametrized nets, learning can operate in two regimes separated by a smooth cross-over. At large initialization, it corresponds to a kernel method, whereas for small initializations features can be learnt, together with invariants in the data. We review the properties of these different phases, of the transition separating them and some open questions. Our treatment emphasizes analogies with physical systems, scaling arguments and the development of numerical observables to quantitatively test these results empirically. Practical implications are also discussed, including the benefit of averaging nets with distinct initial weights, or the choice of parameters ( h ,α ) optimizing performance. © 2021TheAuthor


Introduction
One of the prerequisites of human or artificial intelligence is to make sense of data that often lie in large dimension. A classical case is computer vision, where one seeks to classify the content of a picture [1] whose dimension is the number of pixels. In the supervised setting considered in this review, algorithms are trained from known, labeled data. For example, one is given a training set of one million pictures of cats and dogs, and knows which is which. The goal is to build an algorithm that can learn a rule from these examples, and predict if a new picture presents a cat or a dog. After sixty years of rather moderate progress, machine learning is undergoing a revolution. Deep learning algorithms [2], which are inspired from the organization of our visual cortex, are now remarkably successful at a wide range of tasks including speech [3] and text recognition [4], self-driving cars [5] and beating the best humans at Go [6] or video games [7]. Yet, there is very limited understanding as to why these algorithms work. As explained below, there are several reasons why deep learning in particular and supervised learning in general should not work. Most prominently, the curse of dimensionality associated with the geometry of space in large dimension prohibits learning in a generic setting. If high dimensional data can be learned, then they must have a lot of structure, invariances and symmetries. Understanding what is the nature of this structure and how it can be harvested by neural nets with suitable architectures is a challenge of our times.
In part 1.1 of this introduction, we review the basic procedure of supervised learning with deep nets, together with the quantification of its success via a learning curve exponent. In part 1.2, we discuss several reasons making this success surprising. It includes the curse of dimensionality mentioned above, the putative presence of the bad minima in the loss landscape and the fact that deep learning typically works in an over-parametrized regime where the number of fitting parameters is much larger than the number of data. In part 1.3, we review two recent ideas seeking to address these paradoxes. First, in the limit where the number of network parameters diverges, deep learning converges to two distinct, well-defined algorithms (lazy training and feature learning) depending on the scale of initialization. In each regime, a global minimum of the loss is found. Second, in the context of image classification it was hypothesized that deep learning efficiently deals with the curse of dimensionality because neural nets learn to become insensitive to smooth deformations of the image.
These two bodies of work raise various questions, detailed below. In a nutshell: (i) Is there a critical number of parameters where one enters in the over-parametrized regime and bad minima in the loss disappear? What is the nature of this transition, and the geometry of the loss-landscape in its vicinity? (ii) Why does performance tend to improve asymptotically with the number of parameters even past this transition? (iii) Once in the over-parametrized regime, where does the cross-over between lazy training and feature learning take place? Which regime performs better, and how does it depend on the data structure and network architecture? (iv) How are simple invariants learnt in the feature learning regime, and how does it affect the learning curve exponent? The goal of this manuscript, laid out in the part 1.4 of this introduction, is both to review recent results addressing these questions in the context of classification in deep nets, as well as to provide new systematic empirical data unifying these results into a phase diagram.

Presentation of deep learning and quantification of its successes
Signal propagation in some architectures. Deep learning is a fitting procedure, in which the functional form used to interpolate the data depends on many parameters, and can be represented iteratively. State-of-the-art (SOTA) architectures characterizing this functional form can be extremely complex, here we restrict ourselves to essential properties. We denote byf θ (x) the output of a network corresponding to an input x, parametrized by θ. We reserve the notation f θ (x) for the prediction model, the relation between the two will be defined in Eq. (9). For fully connected nets (FC) -which, from a theoretical point of view, are the most studied architectures -the output function can be written recursively as: a (1) β = ∑ α 1 √ d W (1) α,β x α + B (1) β , where a (i) α is the preactivation of neuron α, located at depth i and L the total number of layers. We consider networks with a fixed number of hidden layers L, each of size h, the network width. In the following, given that the total number of parameters follows N ∼ Lh 2 , we will use both h and N to refer to the network size. In our notation the set of In red: Neural representation of the data at different depths (illustrative). Neurons in the first layers respond to local, simple features, such as edges. Deeper in the network, neurons respond to more and more high-level features such as cats or human faces. Source: Adapted from [8,9]. parameters θ includes both the weights W (i) α,β and the biases B (i) α . σ (z) is the non-linear activation function, e.g. the ReLU σ (z) = max(z, 0). As is commonly done, we use the ''LeCun initialization'' of the weights, for which they are scaled by a factor 1/ √ h. It ensures that the limit of infinite width is not trivial, see e.g. discussions in [10]. We represent the network in Fig. 1.
Convolutional neural nets (CNNs), inspired from the primate brain, perform much better than FCs for a variety of tasks including image classification. Each hidden layer is composed of a number of channels, each representing the data as a feature map -itself an image. CNNs perform the same operations at distinct locations in the image, and enforce that these operations are local. CNNs are thus more constrained than FCs, and they can be obtained from the latter by imposing that some weights are identical (to enforce translational invariance) and others zero (to enforce locality).
Learning dynamics. Weights are learnt by fitting the training set, which is done by minimizing a loss or cost function L. To simplify notations, we consider a binary classification problem, with a set of P distinct training data denoted {(x µ , y µ )} P µ=1 .
The vector x µ is the input, which lives in a d-dimensional space, and y µ = ±1 is its label. 1 L is a sum on all the training set of how each datum is well-fitted: where f is a model and ℓ is a decreasing function of its argument. Popular choices include the linear (γ = 1) or quadratic (γ = 2) hinge loss ℓ(y µ f (x µ )) = max ( 0, ∆ µ ) γ with ∆ µ ≡ 1 − y µ f (x µ ), or the cross-entropy ℓ(y µ f (x µ )) = log(1 + exp(−y µ f (x µ ))). The hinge loss has the advantage that bringing the loss equation (4) to zero is equivalent to satisfying a set of constraints ∆ µ < 0, corresponding to fitting the data by a margin unity. This fact is the basis for the analogy with other satisfiability problems encountered in physics, as discussed below. This choice however does not influence significantly performance for SOTA architectures on usual benchmarks [8].
Different procedures can then be used to minimize L, starting from a random initialization of the weights (an important fact to keep in mind). In particular, gradient descent (GD) or stochastic gradient descent (SGD) is commonly used. At each step of GD, L is computed and the parameters are updated following the direction of the negative gradient in the loss landscape: where the learning rate η controls the step size. For SGD, instead of computing the loss on the whole training set as defined in Eq. (4) only a random subset of the terms entering the sum is considered at each minimization time step. Various tricks can improve performance depending on the data considered, including early stopping (minimization is stopped before the loss hits bottom) or weight decay (terms are then added to the loss to prevent weights to become too large).

Fig. 2.
A: Curse of dimensionality. In high dimensions, every point lies far away from its neighbors, forbidding classification based on distances alone. B: Sketch of the loss landscape. Starting from a random initialization, the dynamics could get stuck in a bad minimum of the loss. C: Usual behavior (blue line) of the test error as a function of the number of parameters, expected to be minimal when N is of order of the smallest value N * where data can be fitted. Green: observations indicate that for deep learning, increasing the number of parameters generally improves behavior, even in a regime where data are perfectly fitted.
Performance. Once learning is done, generalization performance can be estimated from a test set (data not used to train, but whose distribution is identical to the training set) by computing the probability to misclassify one new datum, the so-called test error ϵ. How many data are needed to learn a task is characterized by the learning curve ϵ(P) . 2 Generally, it is observed that the test error is well described by a power law decay P −β in the range of training set size P available 3 with an exponent β that depends jointly on the data and the algorithm chosen. In [11], β is reported for SOTA architecture for several tasks: in neural-machine translation β ≈ 0.3-0.36; in language modeling β ≈ 0.06-0.09; in speech recognition β ≈ 0.3 and in image classification (ImageNet) β ≈ 0.3-0.5.
Deep nets learn a hierarchical representation of the data. Once learning took place, it is possible to analyze to which features in the data neurons respond mostly to [9,12]. Strikingly, findings are very similar to what is found in the visual cortex of primates: neurons learn to respond to more and more abstract aspects of the data as the signal progresses through the network, as illustrated in Fig. 1. This observation is believed to be an important aspect of why deep learning works, yet understanding how abstract features are learnt dynamically, and how much it contributes to performance, remains a challenge.

Why deep learning should not work
Curse of dimensionality. General arguments suggest that β should be extremely small -and learning thus essentially impossible -when the dimension d of the data is large, which is generally the case in practice. For example in a regression task, if the only assumption on the target function is that it is Lipschitz continuous, then the test error cannot be guaranteed to decay faster than with an exponent β ∼ 1 /d [13]. This curse of dimensionality [14] stems from the geometrical fact that the distance δ among nearest-neighbor data points decays extremely slowly in large d as δ ∼ P −1/d , as depicted in Fig. 2.A. Thus, interpolation or classification methods based on distances are expected to be very imprecise for generic data.
Bad minima in the loss landscape. Another problem, which made deep learning less popular in the 90's, concerns learning.
The loss function has no guarantees to be convex, and the parameters space is high-dimensional. What prevents then the learning dynamics to get stuck in poorly performing minima with high loss, as sketched in Fig. 2.B? In other words, under which conditions can one guarantee that training data are well fitted? Is the loss landscape similar to the energy landscape of glassy systems in physics, where the number of minima is exponential in the number of degrees of freedom [15,16]?
Over-parametrization. Finally, neural nets are often trained in the over-parametrized regime, where the number of parameters N is significantly larger than the number of data points P. In that regime, their capacity is very large: they can fit the training set even if labels are randomized [17]. Statistical learning theory then gives no guarantees that over-fitting does not occur, and that the model learnt has any predictive power. Indeed from statistics text books one expects a bell-shape learning curve relating the test error to the number of parameters of the model, as depicted in Fig. 2.C. In this view, at small N the hypothesis class of the learning algorithm is too small, leading to a bias in the learnt predictor. At large N, it presents a large variance due to over-fitting noise in the data. A very puzzling aspect of deep learning is that increasing N passed the point where all data are perfectly fitted does not destroy predictive power, but instead improves it [18][19][20][21] as shown in Fig. 2.

Current insights on these paradoxes
No bad minima in over-parametrized networks. Recently, it was realized that the landscape of deep learning is not glassy after all, if the number of parameters is sufficient. Theoretical works that focus on nets with a huge number of parameters (and one hidden layer), and empirical work with more realistic architectures [22][23][24][25], support that the loss function is characterized by a connected level set: any two points in parameter space with identical loss are connected by a path in which the loss is constant. Moreover, empirical studies of the curvature of the loss landscape (captured by the spectrum of the Hessian) [26][27][28] and of SGD dynamics [29,30] reveal that the landscape displays a large number of flat directions, even at its bottom. It is not the case for under-parametrized networks [30]. These observations raise key questions. Is there a phase transition in the landscape geometry as the number of parameters grows? If so, what is its universality class? How does it affect performance? Implicit regularization and infinite width nets. How can over-parametrized networks be predictive, if they can fit any data? It must imply that the GD or SGD dynamics lead to specific minima of the loss landscape where the function is more regular than for generic minima -a phenomenon coined implicit regularization [18,19]. As sketched in Fig. 2, performance is best as N → ∞, triggering a huge interest in that limit. To which algorithms does GD correspond to in that case?

NTK:
The propagation of the input signal through infinite-width FC nets at initialization is now well-understood. If the weights -defined in Eqs. (3) -are initialized as i.i.d. random variables with zero mean and variance one (a set-up we consider throughout this work), the output functionf (x) is a Gaussian random process. Its covariance can be computed recursively [31][32][33][34][35][36].
Very recently it was realized that the learning dynamics also simplifies in this limit [10,[37][38][39][40][41]. The key insight of [10] is that the output becomes a linear function of the weights then, as sketched in Fig. 3A, B. Physically, very tiny changes of weights can interfere positively and change the output by O(1) which is sufficient to learn, but is not sufficient to change the gradient ∇ θf .
More formally, the gradient flow dynamics (continuous time limit of GD) at finite time can always be described by the neural tangent kernel (NTK) defined as: where x, y are two inputs and ∇ θ is the gradient with respect to the parameters θ. Specifically, ∂f (x)/∂t can be expressed as a linear combination of the Θ(θ , x, x i ); i.e.f evolves within a space of dimension P. In general, Θ depends on θ and thus on time and on the randomness of the initialization. Yet as h → ∞, Θ(θ , x, y) converges to a well-defined limit independent of initialization, and does not vary in time [10]: deep learning in that limit is thus essentially a kernel method. 4 Feature learning (hydrodynamic) regime: The infinite width limit can be taken differently by adding a factor 1/ √ h in the definition of the output. Then the weights must change significantly for the output to become O(1) and fit the data. This recently discovered limit is called ''mean field'', ''rich'' but also ''feature learning regime'' in the literature, because the neurons learn how to respond to different aspects of the input data, as in Fig. 1 -whereas, in the NTK limit, the neuron's response evolves infinitesimally. This regime has been studied in several works focusing mostly on one-hidden layer networks [43][44][45][46][47][48][49], with recent development for deeper nets, see e.g. [50]. In this setting the output function for a one hidden layer reads: 4 Kernel methods are a class of machine learning algorithms that employ a function K (x, x ′ ) -the kernel -as a similarity measure between datapoints. Typically, the predictor function can be written in the form f (x) = ∑ P µ=1 α µ K (x, x µ ), where the parameters α µ are learned and x µ are the points in the training set. Support-Vector Machines are an example of such algorithms. We suggest [42] as a reference on kernel methods.
The law of large number can then be invoked to replace Eq. (7) by an integral: where ρ is the density of parameters. It is then straightforward to show that gradient flow leads to a dynamics on ρ of the usual type for conserved quantities: it is the divergence of a flux ∂ρ/∂t = −∇ · J. Here the divergence is computed with respect to the set of weights associated with a single neuron, and the flux follows J = ρΨ (W (2) , W (1) , B; ρ t ) where Ψ is some function that can be expressed in terms of the loss [43]. This formulation is equivalent to the hydrodynamics description of interacting particles in some external potential. Fig. 3.C illustrates that when the number of particles (or neurons) is large, ρ is a more appropriate description than keeping track of all the particle positions (the weights in that case). Lazy training: The fact that deep learning converges to well-defined algorithms as h diverges, explains why performance converges to a well-defined value in that limit (which will depend on the limit considered), as sketched in Fig. 2. Yet, this distinction between a regime where the NTK does not change, and one where it evolves and features are learnt can be made at finite h. Chizat and Bach proposed a model of the form [51]: α = O(1) corresponds to the NTK initialization and α = O(h −1/2 ) to the mean-field limit. In the over-parametrized regime (our work below defines it sharply) where data are fitted, infinitesimal changes of weights are sufficient to learn when α → ∞ at finite h, and the NTK does not evolve in time (but has fluctuations at initialization). This regime is sometimes referred to as lazy training, in our context we will interchangeably denote it the NTK regime.
Overall, these findings are recent breakthrough in our understanding of neural nets. Yet they ask many questions that are central to this review. If two limits exist, which one best characterizes neural networks that are used in practice? Which limit leads to a better performance? How does it depend on the architecture and data structure? Which effect causes the improvement of performance as h increases?
Curse and invariance toward diffeomorphisms. Mallat and Bruna [52,53] proposed that invariance toward smooth deformations of the image, i.e. diffeomorphisms, may allow deep nets to beat the curse of dimensionality. Specifically, consider the case where the input vector x is an image. It can be thought as a function x(s) describing intensity in position s, where s ∈ [0, 1] 2 . 5 A diffeomorphism is a bijective deformation that changes the location s of the pixels to s ′ = τ (s). In our review below, it will be useful to introduce the pixel displacement field ξ (s) = τ (s) − s, analogous to the displacement field central to elasticity. We denote by T τ [x] the image deformed by τ , i.e. T τ [x](s) = x(τ −1 (s)). One expects that smooth diffeomorphisms do not affect the label of an image: Mallat and Bruna could handcraft CNNs, the ''scattering transform'', that are insensitive to smooth diffeomorphisms and perform well: They hypothesized that during training, CNNs learn to satisfy Eq. (10). In this view, by becoming insensitive to many aspects of the data irrelevant for the task, CNNs effectively reduce the data dimension and make the problem tractable.
A limitation of this framework is that it says little about how invariants are learnt dynamically. Yet, it is the center of the problem. FC nets are more expressive than CNNs, and nothing a priori prevents them from learning these invariants -yet, they presumably do not, since their performance do not compare with CNNs. Along this route, an interesting question is how to develop observables to characterize empirically the dynamical emergence of invariants. Attempts have been made using mutual information approaches [54] which display problems in such a deterministic setting [55]; or measures based on the effective dimension of the neural representation of the data [56,57] which are informative but can sometimes lead to counter-intuitive results. 6 In Section 5 we discuss other observables based on kernel PCA of the NTK.

Organization of the manuscript
To think about the results above and the questions they raise in a unified manner, it is useful to consider Fig. 4. It shows novel empirical data for the performance of neural nets in the plane (h,α) where h is the width andα the rescaled initialization amplitudeα = √ hα. Specifically, the ensemble test error of fully-connected nets trained with GD on MNIST (a data set of handwritten digits [59]) and CIFAR10 (images of planes, dogs, etc. [60]) is shown -samples from these datasets are reported in Fig. 5. This ensemble test error is computed by preparing, for each point in the phase diagram, M = 15 nets with different initialization. The test error of the mean predictor ⟨f ⟩ over the 15 trained nets is then shown.
The black region corresponds to the under-parametrized regime where the loss of individual nets does not converge to 5 This space is in fact discrete due to the finite number of pixels, but this does not alter the discussion. 6 The effective dimension is defined from how the distance δ among neighboring points varies with their number P, i.e. δ ∼ P −1/d eff . Its interpretation can be delicate, as it is observed to sometimes increase as the information propagates in the network, which cannot be the case for the true dimension of the manifold representing the data. Eq. (9). The architecture corresponds to a fully-connected neural network with two hidden layers trained with gradient flow [58]. Binary classification was performed for Left MNIST binary (odd/even) and Right CIFAR10 binary (classes have been regrouped in two sets). The training sets have been reduced to 5000 samples. These data sets were first projected on 10 and 30 PCAs, respectively. This dimensionality reduction of the problem is useful to study the jamming transition 7 but not necessary to investigate the over-parametrized regime [58].  zero after learning. As discussed below, its boundary corresponds to a line of critical points corresponding to a ''jamming transition'' where the system stops displaying a rough landscape. In the colored over-parametrized regime, the black line indicates a cross-over separating the lazy training regime (where the total change of the NTK of individual nets during learning is small in relative terms) from a feature learning regime (where the NTK evolves significantly). Fig. 6 shows the curves for the test error and the ensemble test error for three values ofα, as h varies. At the jamming transition, the test error displays a peak as observed in [61]. A similar peak occurs in regression problems [21], where it has a simpler origin and occurs when the number of parameters matches the number of data. This shape for the test error has since then been coined double-descent [62]. Interestingly, as shown in Fig. 6 this phenomenon occurs independently of the value ofα (and thus of the over-parametrized regime one enters into passed jamming) and vanishes when the ensemble average is taken, as observed in [63] and recently confirmed in [64].
Our goal is to review, in non-technical terms, concepts and arguments justifying such a phase diagram and discuss the learning of invariants in this context. In Section 2, we will argue that the boundary of the black region in Fig. 4 is a line of critical point, analogous to the jamming transition occurring in repulsive particles with finite-range interactions. For wider nets, the landscape is not glassy and displays many flat directions. We will provide a simple geometric argument justifying why this transition in deep nets fall into the universality classes of ellipsoid (rather than spherical) particles, which fixes the properties of the landscape (such as the spectrum of the Hessian) near the transition. We will mention an argument à la Landau justifying the cusp in the test error at that point. In Section 3, we will provide a quantitative explanation as to why the test error keeps improving as the width increases passed this transition. Increasing width turns out to eliminate the noise due to the random initialization of the weights, eventually leading to the well-defined algorithms introduced above. Ensemble averaging at finite width efficiently eliminates this source of noise as well as the double descent, as shown in Fig. 6. In Section 4, we will explain why the cross-over between the lazy and feature learning regimes corresponds asymptotically to a flat line in Fig. 4. We will review observations that lazy training outperforms feature learning for standard data sets of images for fully connected architectures, but not for CNNs architectures, corresponding to a larger learning curve exponent β. In Section 5, we review arguably the simplest model in which a neural net learns invariants in the data structure, by considering that labels do not depend on some directions in input space. As observed in real data, two distinct training exponents β in the lazy and feature learning can then be computed. We conclude by discussing open questions.

Loss landscape and jamming transition
Minimizing a loss is very similar to minimizing an energy. Describing the energy landscape of physical systems is a much studied problem, especially in the context of glasses. Progress was made on this topic in the last fifteen years by considering finite range interactions, for which bringing the energy to zero is equivalent to satisfying a set of constraints. In that case, the landscape is controlled by a critical point [65][66][67], the so-called jamming transition. It occurs as the particle density φ increases. For φ > φ c , a gradient descent from a random initial position of the particles gets stuck in one -out of many -meta-stable states, corresponding to a glassy solid as depicted in Fig. 7B, D. For lower densities, the gradient descent reaches zero energy, as sketched in Fig. 7.E: particles can freely move without restoring forces, and the landscape has many flat directions. It corresponds to the situations depicted in Fig. 7A, C.
There are two universality classes for jamming, leading to distinct properties for the curvature of the landscape (i.e. the spectrum of the Hessian) [68][69][70][71][72][73], for the structure of the packing obtained [74][75][76][77] and for the dynamical response to a perturbation [78,79]. Spheres and ellipses fall in distinct classes as illustrated in Fig. 7. More generally, the jamming transition occurs generically in satisfiability problems with continuous degrees of freedom (it can be defined with discrete degrees of freedom [80], but then differs qualitatively; in particular, the present discussion does not apply to the discrete case). It occurs in the perceptron [81][82][83] but also for deep nets [8,61]. discontinuously to a finite value, which is unity for spheres but smaller for ellipses. (g, h) This difference has dramatic consequence on the energy landscape, in particular on the spectrum of the Hessian. In both cases, the spectrum becomes non-zero at jamming, but it displays a delta function with finite weight for ellipses (indicating strictly flat directions), followed by a gap with no eigenvalues, followed by a continuous spectrum (h, full line). For spheres, there is no delta function nor gap (g, full line). As one enters the jammed phase, in both cases a characteristic scale λ ∼ √ U appears in the spectrum (g and h, dotted lines). Source: From [8]. We will recall below a geometric argument introduced in [8] determining the universality class of the jamming transition. For deep nets, jamming belongs to the universality class of ellipses [8,61]. The spectrum of the loss near the jamming transition displays zero modes, a gap and a continuous part, as measured in Fig. 8.C. Another implication of this analogy is the number N ∆ of data whose margin is smaller than unity after learning, which contributes to the loss. These data are conceptually similar to the support vectors central to SVM algorithms. For particles, N ∆ corresponds the number of pairs of particles still overlapping after energy minimization. As illustrated in Fig. 8.B, N ∆ /N jumps from zero to a value strictly smaller than one at jamming where the loss becomes positive, precisely as ellipses do.
Geometric argument fixing the universality class. Here we seek to give a simple intuition of the result of [8] (a more rigorous argument can be found there). We consider continuous satisfiability problems where one seeks to minimize an energy or loss function of the type: where the sum is made on all constraints that are not satisfied, corresponding to ∆ µ > 0. The quantities ∆ µ can depend in general on the N degrees of freedom of the system. For systems of particles, U is an energy and ∆ µ is the overlap between a pair µ = (i, j) of particles i and j. For spheres it reads ∆ ij = 2R − r ij where R is the particle radius and r ij = ||r i − r j || the distance between particles i and j. In that case, N is the number of particles times the spatial dimension.
For the perceptron or deep nets, U corresponds to a quadratic hinge loss generally denoted L defined in Eq. (4) (it is straightforward to extend these arguments to other hinge losses with γ > 1 [8,65], the case γ = 1 of the linear hinge shares many similarities with those but also display interesting differences [84]). As defined in the introduction, in that As the jamming transition is a singular point, it is useful -but not strictly necessary 8 -to think of approaching from the glassy (large density φ or under-parametrized) phase. It corresponds to U → 0 as sketched in Fig. 7E, F, implying that ∆ µ → 0 ∀µ ∈ m. As argued in [85], for each µ ∈ m the constraint ∆ µ = 0 defines a manifold of dimension N − 1. Satisfying N ∆ such equations thus generically leads to a manifold of solutions of dimension N − N ∆ . Imposing that solutions exist thus implies that, at jamming, one has: Note that this argument implicitly assumes that the N ∆ constraints are independent, see [8] for more discussions. An opposite bound can be obtained by considerations of stability, by imposing that in a stable minimum the Hessian must be positive definite [68]. The Hessian is an N × N matrix which can be written as: where H ∆µ is the Hessian of ∆ µ , and H 0 and H p correspond to the first and the second sum, respectively. H 0 is positive semi-definite, since it is the sum of N ∆ positive semi-definite matrices of rank unity; thus rk(H 0 ) ≤ N ∆ , implying that the null-space of H 0 is at least of dimension N − N ∆ . H p becomes very small approaching jamming, since the ∆ µ 's vanish. Let us denote by N − the number of negative eigenvalues of H p . Requiring that H U has no negative eigenvalues thus implies: For spheres [66] (as well as for the perceptron if a negative margin is used while constraining the norm of the weight vector), H p is negative definite and N − = N. This results stems from the fact that H ∆µ is negative semi-definite. Indeed H ∆µ characterizes the second order change of overlap between particles, and is negative because the distance between spheres moving transversely to their relative direction always increase following Pythagoras theorem. In that case we thus have N ∆ ≥ N. Together with Eq. (12) that leads to N ∆ = N: as spheres jam the number of degrees of freedom and the number of constraints (stemming from contacts) are equal, as empirically observed [65]. This property is often called isostaticity. However, in other problems such as ellipses and deep nets, H ∆µ and thus H p have positive eigenvalues. Indeed the overlap between two ellipses can increase if one of them rotates. Likewise, for a fully-connected Relu net and random data at initialization, H p has a symmetric spectrum and N − ≈ N/2. Generically one expects then jamming to occur with N ∆ < N as sketched for ellipses in Fig. 7 and shown empirically for neural nets learning MNIST in Fig. 8. Jamming is then referred to as ''hypostatic''. The associated consequences on the spectrum of the hessian shown on the same figures are derived in specific cases in [70,71,86]: it always presents a delta function in zero and a gap at jamming.
Effect of the number of training data P. The location of the jamming transition N * (P) depends on P. This dependence is linear for random data but sub-linear for structured data [8], as exemplified in Fig. 8.A. Denoting N − = C 0 N, and using Eq. (14) together with N ∆ ≤ P, we obtain N * (P) ≤ P/C 0 , guaranteeing convergence to a zero loss for a number of parameters linear in P if C 0 > 0. We conjecture C 0 to remain bounded by a strictly positive value for large N for generic data and architectures used in practice. 9 C 0 can be measured a posteriori, leading to the bound corresponds to the dotted line in Fig. 8.A. C 0 a priori depends on the choice of architecture, dynamics and data set. Controlling its value a priori is yet out of reach, 10 and would lead to a rather tight guarantee of convergence toward a global minimum of the loss.
Effect of the scale of initializationα. It is apparent in Fig. 4 that the location of the jamming transition depends on the scale of initializationα. From this figure, we observe the following general trend: the jamming transition occurs with less parameters when the test error of the ensemble-averaged predictor is small. It follows the intuition that an easier rule to learn should correspond to less parameters to fit the data.
Effect of the jamming transition on performance. As the jamming transition is approached, the norm ∥f N ∥ of the predictor diverges [8], which is responsible in the cusp in the double descent displayed by the test error in Fig. 6. This divergence was first pointed out for regression [21]. Yet for classification, the divergence defers quantitatively and is compatible with an inverse power-law, as illustrated in Fig. 8.D. It can be understood using an argument à la Landau, inspired by results on the perceptron [81,83]. The intuitive idea is that, by increasing the norm of the predictor, one reduces effectively the unit margin required by the hinge loss to fit the data. For N < N * , data cannot be fitted even with a vanishing margin.
Specifically, consider that the norm of the predictor ∥f N ∥ is fixed during training and denote by N * (ϵ) the jamming transition as a function of the margin ϵ. Assume (as occurs for the perceptron) that N * (ϵ) is a smooth function of ϵ, such that N * (ϵ) ≈ N * (0) + N ′ * (0)ϵ. For N just above N * (0), margins of magnitude unity cannot be fitted for a fixed norm of the predictor, but they can if that norm is allowed to increase by a factor N ′ * (0)/(N − N * (0)). Indeed it effectively reduces the margin by that amount to some effective valueε ≡ ϵ(N − N * (0))/N ′ * (0). By construction,ε satisfies N * (ε) = N. This argument justifies the observed inverse power-law.
Note that any perturbation to gradient flow (such as early stopping, using stochastic gradient descent or weight decay) will destroy this divergence. 11 Such regularizations are thus most efficient near jamming [61].

Double descent and the benefits of overparametrization
To avoid bad minima in the loss landscape, it is thus sufficient to crank up N passed N * . But then, one would naively expect to lower performance and overfit, as illustrated in the right panel of Fig. 2. It is not the case for classification and deep nets, as illustrated in Fig. 6: performance is very poor and displays a cusp right at the jamming transition point N * , and then continuously improves as N → ∞! A scaling theory can be built to explain quantitatively why performance keeps improving with N passed that point, and how fast the test error reaches its asymptotic behavior [63]. Essentially, at infinite width learning corresponds to welldefined algorithms as discussed in introduction, but these algorithms become noisier when the width is finite. Indeed as N ∼ h 2 increases, the fluctuations on the learnt output function induced by the random initialization of the weights decrease [88]. It follows ∥f N − ⟨f N ⟩∥ ∼ N −1/4 as shown in Fig. 8.D, where ⟨f N ⟩ is obtained by ensemble-averaging outputs trained with different random initializations. This result is true both for the lazy training [63] and the feature learning [58] regime, as justified below. The associated increase in test error is quadratic in the fluctuations (it is obvious for a mean square error loss, but is also true for classification under reasonable assumptions [63]), leading to ⟨ϵ(f N )⟩−ϵ(⟨f N ⟩) ∼ N −1/2 as observed [63]. This effect is responsible for the double descent, as can be directly checked by considering the training curve ϵ(⟨f N ⟩) of the ensemble average function ⟨f N ⟩ in Fig. 6 which does not display a second descent. More generally as apparent in Fig. 4, at fixedα = √ hα, the ensemble average test error weakly varies with h once in the over-parametrized regime.
Similar fluctuations induced by initial conditions ⟨∥f N − ⟨f N ⟩∥⟩ ∼ N −1/4 with identical consequences occur in the feature learning regimes in neural nets [58]. Such fluctuations are certainly expected at initialization for a one-hidden layer, since the density of neurons parameters ρ introduced in Eq. (8) must have finite sampling fluctuations of order δρ ∼ 1/ √ h ∼ N −1/4 as expected from the central limit theorem. Because in the asymptotic regime N → ∞ the value of ρ at initialization affects the learnt function, the magnitude of fluctuations δρ (and thus of the learnt function f N ) induced by the random initialization will still scale as N −1/4 after learning, as rigorously proven for a one-hidden layer in [91].
Overall, this scaling theory supports that for generic data sets and deep architectures, the second descent results from the noise induced by finite N effects and initialization on the limiting algorithms reached as N → ∞. In other words, the variance of the learnt predictor is reduced in this limit, as its dependence on initialization vanishes. One interesting practical implication of this result, apparent in Figs. 4 and 6, is that optimal performance is found by ensemble averaging nets of limited width passed their jamming transition.
Since these arguments were proposed, the double descent curve has been analytically computed in the simple case of linear data in the limit of infinite dimension and random features machines [47,92], where it was also shown [93,94] to result from the fluctuations of the kernel that vanish with increasing number of neurons, confirming the present explanation. Fig. 9. Left, Center: Ensemble-average test error vs. the scaling parameterα = √ hα. It illustrates that the NTK regime (large √ hα) tends to outperform the feature learning regime (small √ hα) for FC architecture (Left) but not in CNNs (Center). In the latter case, the training curve exponent β is larger in the feature learning (blue curve, β ≈ 1/2) regime than in the NTK one (orange curve, β ≈ 1/3).

Disentangling the NTK and feature learning regimes
As discussed in the previous section, wider nets are more predictive. A comparable gain of performance can however be obtained already at intermediate width, by ensemble averaging outputs over multiple initialization of the weights. As apparent in Fig. 4, the rescaled magnitude of initializationα = α √ h which allows to cross-over between the lazy training and feature learning regimes does affect performance. Which regime best characterizes deep nets used in practice? Which one performs better? Some predictions of the NTK regime appear to hold in realistic architectures [39], and training nets in NTK limit can achieve good performance on real datasets [40,95,96]. Yet, in several cases the feature learning regime beats the NTK [51,97], in line with the common idea that building an abstract representation of the data such as sketched in Fig. 1 is useful. Several theoretical works show that NTK under-performs for specific, simple models of data [14,[98][99][100][101].
In [58], this question was investigated systematically for deep nets in the (α, h) plane, where α is the scale of initialization introduced in [51] as defined in Eq. (9). This study developed numerical methods of adaptive learning rates to follow gradient flow while changing α by ten decades. The main results are as follows: (i) The cross-over between the two regimes occur when α √ h = O(1) as apparent in Fig. 4, extending the result of [51] limited to one hidden layer nets. For α √ h ≪ 1, ∥∆Θ∥/∥Θ∥ ≫ 1 while the opposite holds true when α √ h ≫ 1. Here ∆Θ = Θ(t) − Θ(t = 0) characterizes the evolution of the tangent kernel and t is the learning time. It is convenient to divide the loss by α 2 and considerL ≡ L/α 2 , which can be shown to ensure that the dynamics occurs on a time scale independent of α in the large α limit [51,58]. This result holds true for the usual choices of loss (cross-entropy, hinge, etc.) at any finite time. It also holds true for t → ∞ if the hinge loss is used, since in that case convergence to a zero loss occurs in finite time independently of h and α.
Here we provide a schematic argument justifying this result, see [58] for a more detailed analysis. The variation of the output f with respect to the pre-activation a (which are of order one at initialization) of a given neuron is of order ∂f /∂a ∼ α/ √ h. This result is obvious for the last hidden layer (using that the last weights are of order 1/ √ h), but can be justified recursively at all layers, as discussed in [58] and derived implicitly in the NTK study [10]. For gradient flow, the variation ∆a of pre-activation due to the evolution of the bias (considering the previous weights leads to a similar scaling) must be of order ∆a ∼ t(∂L/∂f )(∂f /∂a) ∼ t/α √ h using that ∂L/∂f ∼ 1/α 2 . Thus ∆a is of order 1/α √ h at the end of training (since a zero hinge loss is reached on a time scale t = O(1)). The NTK regime must thus break down when ∆a ∼ 1 when the relation between weights and output must become non-linear, corresponding to a cross-over for α * ∼ 1/ √ h for large h. A similar line of thought can be used at intermediate width. When reducing h so as to approach the jamming transition from the over-parametrized phase, the norm of the output -and therefore weights and pre-activations -explode as reviewed in Section 2. One is never then in the lazy training regime, and the relationship between the variation of weights and the output must become non-linear. Thus, in the vicinity of the jamming transition, the networks always lie in the feature learning regime. Consequently, the cross-over lines separating lazy and feature learning must bend up and never cross the jamming line in Fig. 4. We do observe curves qualitatively bending up as expected (yet it is hard to make precise measurements very close to jamming).
(ii) For fully-connected nets and gradient descent, the NTK tends to outperform the feature learning regime, as exemplified in the Left panel of Fig. 9. This result was found for a variety of data sets (MNIST, Fashion-MNIST, CIFAR10, etc.), except for MNIST 10 PCA. It is apparent in Fig. 4, where the best performance for MNIST 10 PCA appears to occur the cross-over regionα ∼ 1.
(iii) For CNN architectures, feature learning outperforms the NTK regime as shown in the central panel of Fig. 9. It corresponds to a larger training curve exponent β, as appears in the right panel of Fig. 9. (ii, iii) were recently confirmed by the Google team [64].
These observations raise various questions. What are the advantages and drawbacks of feature learning, and why are the latter more apparent in FC rather than CNN architectures? For modern CNN architectures, is the improvement of the learning curve exponent β in the feature learning regime key to understand how the curse of dimensionality is beaten?
Is it associated with learning invariants in the data? In our opinion, the answers to these questions are yet unknown. To start tackling these questions, in the next section we study a simple model of invariant data for which the improvement of the learning curve exponent β in the feature learning regime can be computed.

Learning simple invariants by compressing irrelevant input dimensions
How can the curse of dimensionality be beaten? A favorable aspect of various data sets such as images is that the data distribution P(x) is very anisotropic, consistent with the notion that the data lie in a manifold of lower dimension. In that case, the distance between neighboring data reduces faster with growing P, improving kernel methods [102]. The positive effect of a moderate anisotropy can also be shown for kernel classification [103] and regression [101,104]. Yet, even in simple data sets like MNIST of CIFAR10, the intrinsic dimensions remain significant (d int ≈ 14 and d int ≈ 30 respectively [102]). This effect helps but cannot resolve the curse of dimensionality for complex data sets: indeed using Laplace or Gaussian kernels 12 (which are very similar to the NTK of fully connected nets) lead to β ≈ 0.4 for MNIST (which is decent) but β ≈ 0.1 for CIFAR10, which implies very slow learning as the number of data increases.
As discussed in the introduction, another popular idea is that data sets as images are learnable because they display many invariant transformations that leave the labels unchanged. By becoming insensitive to those, the network essentially reduces the dimension of the data. This view is consistent with the notion that deep learning leads to an abstract neural representation of the data sketched in Fig. 1. This effect may be responsible for our observation in the right panel of Fig. 9 that the training curve exponent β is more favorable in the feature learning regime. However, understanding how invariants are built dynamically and how it affects performance and β remains a challenge.
Specific models of data can be built [44,92,101,105] in which lazy training does not learn at all, whereas neural nets trained in the feature learning regime succeed. However, these models that consider the limit d → ∞, P → ∞ with P/d fixed do not capture the two finite and distinct learning curve exponents β observed in the two regimes.
The stripe model. In [98,103,106], a simple model of invariant is considered. The labels do not depend on x ⊥ could correspond for example to uninformative pixels near the boundaries of pictures. Clearly kernel methods whose kernel are designed to only depends on x ∥ would perform better on these data than isotropic kernels, since they would operate in d ∥ (instead of d) dimensions, see e.g. [101]. How do neural nets discover such an invariance on their own?
For gradient descent, with the logistic loss, a one-hidden layer can be shown to correspond to a max-margin classifier in a certain non-Hilbertian space of functions [98]. Dimension-independent guarantees on performance can then be obtained if the data can be separated after projection in a low dimensional space. The analysis is rigorous and general, but requires to go to extremely long times not used in practice and does not predict values for β.
In [103,106], the hinge loss is considered. In that case, the dynamics stops after a reasonable time. If the density of data points does not vanish at the interface between labels, and if the latter is sufficiently smooth (e.g. planar or cylindrical), it is found that the test error decays as a power law in both regimes. For the lazy training regime, scaling arguments inspired by electrostatics lead to β Lazy = d/(3d − 2). Feature learning performs better, as one find β Feature = (d + d ⊥/2)/(3d − 2). The key effect leading to an improvement of the feature learning regime is illustrated in Fig. 10 for d ∥ = 1, called the stripe model: due to the absence of gradient in the orthogonal direction, the weights only grow along the d ∥ dimensions. Thus, for an infinitesimal initialization of the weights (α → 0), the neurons align along the informative coordinate in the data.
Accordingly, in relative terms, they become less sensitive to x ⊥ , which merely acts as a source of noise. This denoising is limited by the finite size of the training set. Specifically, it is found that Λ ≡ W ∥ /W ⊥ ∼ √ P where W ∥ and W ⊥ are the characteristic scales of the first layer of weights in the informative and uninformative directions respectively. This effect is equivalent to a geometrical compression of magnitude Λ of the input in the uninformative direction. Indeed performing this compression by considering the transformed data (x ∥ , x ⊥ /Λ), and learning these in the NTK regime gives very similar performance as learning the original data in the feature regime, see the right panel of Fig. 10.
Kernel PCA of the NTK reveals a geometric compression of invariants. The stripe models illustrates that neural nets in the feature learning regime can learn to become insensitive to transformations in the data that do not affect the labels. This insensitivity is limited by the sampling noise associated with a finite data set. Is this phenomenon also occurring for more modern architectures and more subtle invariants characterizing images? Measurements of the intrinsic dimension of the neural representation of data [56,57] or mutual information estimates [54] suggest that it may be so, but as mentioned in introduction these observables have some limitations.
In the feature learning regime, the NTK evolves in time. This evolution leads to a better kernel for the task considered. Indeed, using a kernel method based on the NTK obtained at the end of training leads to essentially identical performance than the neural net itself [106]. This observation is consistent with previous ones showing that the NTK ''improves'' in 12 A Laplace kernel has the functional form K (x, . The bandwidth σ controls the decay speed of K as its argument increases. W (1) pondered by the neuron second weight W (2) shown at different time points during the learning process (as characterized by the value of the loss in legend) for the stripe model. It reveals an alignment of the first layer of weight vectors in the informative direction. Training data points are also shown, and colored depending on their labels. Right: Training curve ϵ(P) of a one hidden layer in the feature learning regime (blue), the NTK regime (red) and the NTK regime after compressing the data in the perpendicular direction by the factor Λ (dashed red). Note the similarity between this curve and the blue one, supporting that the main effect of learning feature for these data is the geometric compression of uninformative directions in input space. Source: From [106]. time [107,108], in the following sense. For kernels we can always write Θ(x, y) = ψ(x) · ψ(y), where ψ(x) is a vector of features. In the case of the NTK, ψ(x) can be chosen as the gradient of the output f with respect to the weights -see Eq. (6). Kernels tend to perform better [109] if the vector of labels {y(x i )} i=1...P has large coefficients along the first PCAs of the features vectors {ψ(x i )} i=1...P (an operation called kernel PCA [110]). Such an alignment between the first PCAs of the features vectors of the NTK and the vector of labels is observed during learning [107,108].
Although there is no general theory as to why such an alignment occurs, the stripe model provides a plausible explanation for this effect. In that case, this improvement must occur because one evolves from an isotropic kernel to an anisotropic one with diminished sensitivity to uninformative directions x ⊥ . As a result, the top kernel PCA becomes more informative on the label (Fig. 11.A) on which they project more ( Fig. 11.C). The same result is observed for a CNN trained on MNIST as shown in Fig. 11.B, D. Overall, this view supports that the improvement of the NTK reveals the geometric compression of uninformative directions in the data.

Conclusion
Do not be afraid of bad minima! Just crank-up the number of parameters until you pass a jamming transition. Beyond that point, bad minima are not encountered, and you can bring the loss to zero. Depending on the network width and on how you scale the value of your function at initialization, deep learning can then behave as a kernel method, or can alternatively learn features. Simple scaling arguments delimit the corresponding phase diagram, and appear to hold for benchmark data sets. Remarkably, these arguments depend very little on the specific data considered. Their practical implication is that ensemble averaging nets with different initialization past the jamming transition can be an effective procedure, as found by different groups [58,63,64].
Taking into account where you are in this phase diagram is arguably key to understand outstanding questions on deep learning. It is true for example for the role of stochasticity in the dynamics, which appears to improve performance. It is natural that using stochastic gradient descent instead of gradient flow will help near jamming, because the noise will regularize the divergence of the predictor norm -an effect that can already be obtained with early stopping [61]. Likewise, for repulsive particles temperature regularizes singularity near the jamming transition [111]. Yet, it would be useful to study its effect on performance for the distinct regions of the phase diagram. It would be particularly interesting to know if one of the main effects of stochasticity in the over-parametrized limit is to push upward the cross-over between the lazy and feature learning regime in Fig. 4, thus improving performance in CNNs where feature learning outperforms lazy training.
Ultimately, the central question left to understand is performance, and how the curse of dimensionality is beaten in deep nets. Even at an empirical level, the idea that invariance toward diffeomorphisms is the central phenomenon behind this success is not established. At the theoretical level, this view has not been combined with the recent improvement of our understanding of learning, which is mostly focused on fully connected nets. In comparison, the effort to model how CNNs learn features is more modest, despite that only those architectures tend to work well in practice. Such studies would arguably require to design simple canonical models of data presenting complex invariants, which are currently scarce.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.