Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

We overview several properties—old and new—of training overparameterized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous rectified linear unit networks. We study the convergence to a solution with the absolute minimum ρ, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers is used together with weight decay under different forms of gradient descent. A main property of the minimizers that bound their expected error for a specific network architecture is ρ. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next, we prove that quasi-interpolating solutions obtained by stochastic gradient descent in the presence of weight decay have a bias toward low-rank weight matrices, which should improve generalization. The same analysis predicts the existence of an inherent stochastic gradient descent noise for deep networks. In both cases, we verify our predictions experimentally. We then predict neural collapse and its properties without any specific assumption—unlike other published proofs. Our analysis supports the idea that the advantage of deep networks relative to other classifiers is greater for problems that are appropriate for sparse deep architectures such as convolutional neural networks. The reason is that compositionally sparse target functions can be approximated well by “sparse” deep networks without incurring in the curse of dimensionality.


Introduction
A widely held belief in the last few years has been that the cross-entropy loss is superior to the square loss when training deep networks for classification problems. As such, the attempts at understanding the theory of deep learning have been largely focused on exponential-type losses [1,2], such as the crossentropy. For these losses, the predictive ability of deep networks depends on the implicit complexity control of gradient descent (GD) algorithms that lead to asymptotic maximization of the classification margin on the training set [1,3,4]. Recently, however, Hui and Belkin [5] have empirically demonstrated that it is possible to achieve a similar level of performance, if not better, using the square loss, paralleling older results for support vector machines [6]. Can a theoretical analysis explain when and why regression should work well for classification? This question was the original motivation for this paper and preliminary versions of it [7,8].
In deep learning binary classification, unlike the case of linear networks, we expect from previous results (in the absence of regularization) several global minima with zero square loss, thus corresponding to interpolating solutions (in general degenerate, see [9,10] and reference therein), because of overparametrization. Although all the interpolating solutions are optimal solutions to the regression problem, they will generally correspond to different (normalized) margins and to different expected classification performances. In other words, zero square loss does not imply by itself neither large margin nor good classification on a test set. When can we expect the solutions to the regression problem obtained by GD to have a large margin?
We introduce a simplified model of the training procedure that uses square loss, binary classification, gradient flow (GF), and Lagrange multipliers (LMs) for normalizing the weights. With this model, we show that obtaining large margin interpolating solutions depends on the scale of initialization of the weights close to zero, in the absence of regularization [also called weight decay (WD)]. Assuming convergence, we describe the qualitative dynamics of the deep network's parameters and show that ρ, which is the product of the Frobenius norms of the weight matrices, grows nonmonotonically until a large margin, which is small ρ solution, is found reached. Assuming that local minima and saddle points can be avoided, this analysis suggests that with WD (or sometimes with just small initialization), GD techniques may yield convergence to a minimum with a ρ biased to be small.
In the presence of WD, perfect interpolation of all data points cannot occur and is replaced by quasi-interpolation of the labels. In the special case of binary classification case in which y n = ±1, quasi-interpolation is defined as ∀ n:|f(x n ) − y n | ≤ ϵ, where ϵ > 0 is small. Our experiments and analysis of the dynamics show that in the presence of regularization, there is a weaker dependence on initial conditions, as has been observed in [5]. We show that WD helps stabilize normalization of the weights, in addition to its role in the dynamics of the norm.
We then apply basic bounds on expected error to the solutions provided by stochastic gradient descent (SGD) (for WD λ > 0), which have locally minimum ρ. For normal training set sizes, the bounds are still vacuous but much closer to the test error than previous estimates. This is encouraging because in our setup, large overparametrization, corresponding to interpolation of the training data [11], coexists with a relatively small Rademacher complexity because of the sparsity induced by the locality of the convolutional kernel. [By several orders of magnitude. ] We then turn to show that the quasi-interpolating solutions satisfy the recently discovered neural collapse (NC) phenomenon [12], assuming SGD with minibatches. According to NC, a dramatic simplification of deep network dynamics takes place-not only do all the margins become very similar to each other, but the last layer classifiers and the penultimate layer features also form the geometrical structure of a simplex equiangular tight frame (ETF). Here, we prove the emergence of NC for the square loss for the networks that we study-without any additional assumption (such as unconstrained features).
Finally, the study of SGD reveals surprising differences relative to GD. In particular, in the presence of regularization, SGD does not converge to a perfect equilibrium: There is always, at least generically, SGD noise. The underlying reason is a rank constraint that depends on the size of the minibatches. This also implies an SGD bias toward low-rank solutions that reinforces a similar bias due to maximization of the margin under normalization (which can be inferred from [13]).

Contributions
The main original contributions in this paper are as follows: • We analyze the dynamics of deep network parameters, their norm, and the margins under GF on the square loss, using Lagrange normalization (LN). We describe the evolution of ρ and the role of WD and normalization in the training dynamics. The analysis in terms of the "polar" coordinates ρ, V k is new, and many of the observed properties are not. Arguably, our analysis of the bias toward minimum ρ and its dynamics with and without WD is an original contribution.
• Our norm-based generalization bounds for convolutional neural networks (CNNs) are new. We outline in this paper a derivation for the case of nonoverlapping convolutional patches. The extension to the general case follows naturally and will be described in a forthcoming paper. The bounds show that generalization for CNNs can be orders of magnitude better than that for dense networks. In the experiments that we describe, the bounds turn out to be loose but close to nonvacuous. They appear to be much better than the other empirical tests of generalization bounds-all for dense networks-that we know of. The main reason for this, in addition to the relatively simple task (binary classification in CIAFR10), is the sparsity of the convolutional network, which is the low dimensionality (or locality) of the kernel.
• We prove that convergence of GD optimization with WD and normalization yields NC for deep networks trained with square loss in the binary and in the multiclass classification case. Experiments verify the predictions. Our proof is free of any assumption-unlike other recent papers that depend on the "unconstrained feature assumption".
• We prove that training the network using SGD with WD induces a bias toward low-rank weight matrices. As we will describe in a separate paper, low rank can yield better generalization bounds.
• The same theoretical observation that predicts a low-rankbias also predicts the existence of an intrinsic SGD noise in the weight matrices and in the margins.

Related Work
There has been much recent work on the analysis of deep networks and linear models trained using exponential-type losses for classification. The implicit bias of GD toward margin maximizing solutions under exponential-type losses was shown for linear models with separable data in [14] and for deep networks in [1,2,15,16]. Recent interest in using the square loss for classification has been spurred by the experiments in [5], although the practice of using the square loss is much older [6]. Muthukumar et al. [17] recently showed for linear models that interpolating solutions for the square loss are equivalent to the solutions to the hard margin support vector machine problem (see also [7]). Recent work also studied interpolating kernel machines [18,19] that use the square loss for classification.
In the recent past, there have been a number of papers analyzing deep networks trained with the square loss. These include the works of Zhong et al. [20] and Soltanolkotabi et al. [21] that show how to recover the parameters of a neural network by training on data sampled from it. The square loss has also been used in analyzing convergence of training in the neural tangent kernel (NTK) regime [22][23][24]. Detailed analyses of 2-layer neural networks such as [25][26][27] typically use the square loss as an objective function. However, these papers do not specifically consider the task of classification.
A large effort has been spent in understanding generalization in deep networks. The main focus has been solving the puzzle of how overparameterized deep networks (with more parameters than data) are able to generalize. An influential paper [11] showed that overparameterized deep networks that usually fit randomly labeled data also generalize well when they trained on correctly labeled data. Thus, the training error does not give any information about test error: There is no uniform convergence of training error to test error. This is related to another property of overparametrization: Standard Vapnik-Chervonenkis bounds are always vacuous when the number of parameters is larger than the number of data. Although often forgotten, it is, however, well known that another type of bounds-on the norm of parameters-may provide generalization even if there are more parameters than data. This point was made convincingly in [28], which provides norm-based bounds for deep networks.
[The focus of this paper on ρ is directly related.] Bartlett bounds and related ones [29,30] in practice turn out to be very loose. Empirical studies such as [31] found little evidence so far that norms and margins correlate well with generalization.
NC [12] is a recently discovered empirical phenomenon that occurs when training deep classifiers using the cross-entropy loss. Since its discovery, there have been a few papers analytically proving its emergence when training deep networks. Mixon et al. [32] show NC in the regime of "unconstrained features". Recent results in [33] perform a more comprehensive analysis of NC in the unconstrained features paradigm. There have been a series of papers analytically showing the emergence of NC when using the cross-entropy loss [34][35][36]. In the study of the emergence of NC when training using the square loss, Ergen and Pilanci [37] (see also [38]) derived it through a convex dual formulation of deep networks. In addition to that, Han et al. [39] and Zhou et al. [40] show the emergence of NC in the unconstrained features regime. Our independent derivation is different from these approaches and shows that NC emerges in the presence of normalization and WD.
Several papers in recent years have studied the relationship between implicit regularization in linear neural networks and rank minimization. A main focus was on the matrix factorization problem, which corresponds to training a depth-2 linear neural network with multiple outputs with respect to the square loss (see references in [13]). Beyond factorization problems, it was shown that in linear networks of output dimension 1, GF with respect to exponential-type loss functions converges to networks where the weight matrix of every layer is of rank 1. However, for nonlinear neural networks, things are less clear. Empirically, several studies (see references in [13]) showed that replacing the weight matrices by low-rank approximations results in only a small drop in accuracy. This suggests that the weight matrices in practice are not too far from being low rank.

Problem Setup
In this section, we describe the training settings considered in our work. We study training deep neural network with rectified linear unit (ReLU) nonlinearity using square loss minimization for classification problems. In the proposed analysis, we apply a specific normalization technique: weight normalization (WN), which is equivalent to LM, and regularization (also called WD), since such mechanisms seem commonly used for reliably training deep networks using GD techniques [5,41].

Assumptions
Throughout the theoretical analysis, we make, in some places, simplifying assumptions relative to standard practice in deep neural networks. We mostly consider that the case of binary classification though our analysis of NC includes multiclass classification. We restrict ourselves to the square loss. We consider GD techniques, but we assume different forms of them in various sections of the paper. In the first part, we assume continuous GF instead of GD or SGD. GF is the limit of discrete GD algorithm with the learning rate being infinitesimally small (we describe an approximation of GD within a GF approach in [8]). SGD is specifically considered and shown to bias rank and induce asymptotic noise that is unique to it. The analysis of NC is carried out using SGD with small learning rates. Furthermore, we assume WN by an LM term added to the loss function, which normalizes the weight matrices. This is equivalent to WN but is not equivalent to the more commonly used batch normalization (BN).
We also assume throughout that the network is overparameterized and so that there is convergence to global minima with appropriate initialization, parameter values, and data.

Classification with square loss minimization
In this work, we consider a square loss minimization for classification along with regularization and WN. We consider a binary classification problem, given a training dataset  = x n ,y n N n=1 , where x n ∈ ℝ d is the input (normalized such that ∥x n ∥ ≤ 1) and y n ∈ {±1} is the label. We use deep rectified homogeneous networks with L layers to solve this problem. For simplicity, we consider networks f W : x ∈ ℝ d is the input to the network and σ : ℝ → ℝ, σ(x) = max (0, x) is the ReLU activation function that is applied coordinate-wise at each layer. The last layer of the network is linear (see Fig. 1).
Because of the positive homogeneity of ReLU [i.e., σ(αx) = ασ(x) for all x ∈ ℝ and α > 0], one can reparametrize [We choose the Frobenius norm here.] Because of homogeneity of the ReLU, it is possible to pull out the product of the layer norms as ρ = ∏ k ρ k and write f W (x) = ρf V (x) = ρV L σ(V L − 1 …σ(V 1 x)…). Notice that the 2 networks-f W (x) and ρf V (x)-are equivalent reparameterizations of the same function (if ρ = ∏ k ρ k ) but their optimization generally differ. We We adopt in our definition the convention that the norm ρ j of the convolutional layers is the norm of their filters and not the norm of their associated Toeplitz matrices. The reason is that this is what our novel bounds for CNNs state (see also section 3.3 in [42,43]). The total ρ calculated in this way is the quantity that enters the generalization bounds of Generalization: Rademacher Complexity of Convolutional Layers.
In practice, certain normalization techniques are used to train neural networks. This is usually performed using either BN or, less frequently, WN. BN consists of standardizing the output of the units in each layer to have zero mean and unit variance with respect to training set. WN normalizes the weight matrices (section 10 in [4]). In our analysis, we model normalization by normalizing the weight matrices, using an LM term added to the loss function. This approach is equivalent to WN.
In the presence of normalization, we assume that all layers are normalized, except for the last one, via the added LM. Thus, the weight matrices V k L k=1 are constrained by the LM term to be close to, and eventually converge to, unit norm matrices (in fact, to fixed norm matrices); notice that normalizing V L and then multiplying the output by ρ are equivalent to letting W L = ρV L be unnormalized. Thus, f V is the network that, at convergence, has L − 1 normalized layers (see Fig. 1B).
We can write the Lagrangian corresponding to the minimization of the regularized loss function under the constraint ∥V k ∥ 2 = 1 in the following manner: where ν k values are the LMs and λ > 0 is a predefined parameter.

Separability and margins
Two important aspects of classification are separability and margins. For a given sample (x, y) (train or test sample) and model f W , we say that f W correctly classifies x, if f n = y n f n > 0. In addition, for a given dataset  = x n ,y n N n=1 , separability is defined as the condition in which all training samples are classified correctly, ∀ n ∈ [N]: f n > 0. Furthermore, when ∑ N n=1 f n > 0, we say that average separability is satisfied. The minimum of   for λ = 0 is usually zero under our assumption of overparametrization. This corresponds to separability.
Notice that if f W is a zero loss solution of the regression problem, then ∀n : f W (x n ) = y n , which is also equivalent to ρf n = y n , where we call y n f n = f n the margin for x n . By multiplying both sides of this equation by y n and summing both sides over n ∈ [N], we obtain that ∑ n f n = N. Thus, the norm ρ of a minimizer is inversely proportional to its average margin μ in the limit of λ = 0, with . It is also useful to define the margin variance n f 2 n = 2 + 2 and that both M and σ 2 are not negative. [Notice that the term "margin" is usually defined as min n∈[N] f n . Instead, we use the term "margin for x n " to distinguish our definition from the usual one.]

Interpolation and quasi-interpolation
Assume that the weights V k are normalized at convergence. Then Lemma 1. For λ = 0, there are solutions that interpolate all data points with the same margin and achieve zero loss. For λ > 0, there are no solutions that have the same margins and interpolate. However, there are solutions with the same margins that quasi-interpolate and are critical points of the gradient.

Proof. Consider the loss
For λ = 0, a zero of the loss   = 0 implies ∀ n ∈ [N]: = f n and = 1 . However, for λ > 0, the assumption that all f n values are equal yields M = μ 2 and, thus,   = 2 2 − 2 + 1 + 2 . Setting   = 0 gives a second-order equation in ρ that does not have real-valued solutions for λ > 0. Thus, in the presence of regularization, there exist no solutions that have the same margin for all points and reach zero empirical loss. However, solutions that have the same margin for all points and correspond to zero gradient with respect to ρ exist. To see this, assume σ = 0, setting the gradient of   with respect to ρ equal to zero, yielding ρμ 2 − μ + λρ = 0. This gives = 2 + . This solution yields ρμ < 1, which corresponds to noninterpolating solutions.
The Neural collapse section shows that the margins [which are never interpolating; interpolation is equivalent to ρy n f(x n ) = 1] tend to become equal to each other as predicted from the lemma during convergence.

Experiments
We performed binary classification experiments using the standard CIFAR10 dataset [44]. Image samples with class labels 1 and 2 were extracted for the binary classification task. A total number of training and test data points are 10,000 and 2,000, respectively. The model architecture in Fig. 1B contains 4 convolutional layers and 2 fully connected layers with hidden sizes of 1,024 and 2. A number of channels for the 4 convolutional layers are 32, 64, 128, and 128, and the filter size is 3 × 3. The first fully connected layer has 3,200 × 1,024 = 3,276,800 weights, and the very last layer has 1,024 × 2 = 2,048 weights. At the top layer of our model, there is a learnable parameter ρ (Fig. 1B). In our experiments, instead of using LMs, we used the equivalent (see proof of the equivalence [2]) WN algorithm, freezing the weights of the WN parameter "g" [45] and normalizing the V k L−1 k=1 matrices at each layer with respect to their Frobenius norm, while the (1) . top layer's norm is denoted by ρ and is the only parameter entering in the regularization term (see Eq. 11).

Landscape of the empirical risk
As a next step, we establish key properties of the loss landscape. The landscape of the empirical loss contains a set of degenerate zero-loss global minima (for λ = 0) that under certain overparametrization, assumptions may be connected in a single zeroloss degenerate valley for ρ ≥ ρ 0 . Figure 2 shows a landscape that has a saddle for ρ = 0 and then goes to zero loss (zero crossing level, that is the coastline) for different values of ρ (look at the boundary of the mountain). As we will see in our analysis of the GF, the descent from ρ = 0 can encounter local minima and saddles with nonzero loss. Furthermore, although the valley of zero loss may be connected, the point of absolute minimum ρ may be unreachable by GF from another point of zero loss even in the presence of λ > 0, because of the possible nonconvex profile of the coastline (see inset of Fig. 2).
If we assume overparameterized networks with d ≫ n, where d is the number of parameters and N is the number of data points, the study of Cooper [10] proved that the global minima of the unregularized loss function [This result is also what one expects from Bezout theorem for a deep polynomial network. As mentioned in T. Tao's blog "from the general "soft" theory of algebraic geometry, we know that the algebraic set V is a union of finitely many algebraic varieties, each of dimension at least d − N, with none of these components contained in any other. In particular, in the underdetermined case N < d, there are no zero-dimensional components of V , and, thus, V is either empty or infinite"(see references in [46]).] Theorem 1 ([46], informal). We assume an overparameterized neural network f W with smooth ReLU activation functions and square loss. Then, the minimizers W * achieve zero loss and are highly degenerate with dimension d − N.
Furthermore, for "large" overparametrization, all the global minima-associated with interpolating solutions-are connected within a unique, large valley. The argument is based on Theorem 5.1 of [47]: , informal). If the first layer of the network has at least 2N neurons, where N is the number of training data, and if the number of neurons in each subsequent layer decreases, then every sublevel set of the loss is connected.
In particular, the theorem implies that zero-square-loss minima with different values of ρ are connected. A connected single valley of zero loss does not, however, guarantee that SGD with WD will converge to the global minimum, which is now >0, independently of initial conditions.
For large ρ, we expect many solutions. The existence of several solutions for large ρ is based on the following intuition: The last linear layer is enough-if the layer before the linear classifier has more units than the number of training points-to provide solutions for a given set of random weights in the Fig. 2. A speculative view of the landscape of the unregularized loss-which is for λ = 0. Think of the loss as the mountain emerging from the water with zero loss being the water level. ρ is the radial distance from the center of the mountain as shown in the inset, whereas the V k can be thought as multidimensional angles in this "polar" coordinate system. There are global degenerate valleys for ρ ≥ ρ 0 with V 1 and V 2 weights of unit norm. The coastline of the loss marks the boundary of the zero-loss degenerate minimum where  = 0 in the high-dimensional space of ρ and V k ∀ k = 1, ⋯, L. The degenerate global minimum is shown here as a connected valley outside the coastline. The red arrow marks the minimum loss with minimum ρ. Notice that, depending on the shape of the multidimensional valley, regularization with a term λρ 2 added to the loss biases the solution toward small ρ but does not guarantee convergence to the minimum ρ solution, unlike the case of a linear network.
previous layers (for large ρ and small f i ). This also means that the intermediate layers do not need to change much under GD in the iterations immediately after initialization. The emerging picture is a landscape in which there are no zero-loss minima for ρ smaller than a certain minimum ρ, which is network and data dependent. With increasing ρ from ρ = 0, there will be a continuous set of zero-square-loss degenerate minima with the minimizer representing an interpolating (for λ = 0) or almost interpolating solution (for λ > 0). We expect that λ > 0 results in a "pull" toward the minimum ρ 0 within the local degenerate minimum of the loss.

Landscape for λ > 0
In the case of λρ 2 > 0, the landscape may become a Morse-Bott or Morse function with shallow almost zero-loss minima. The question is open because the regularizer is not the sum of squares.

GF equations
The GF equations are as follows (see also [8]): In the second equation, we can use the unit norm constraint on the ∥V k ∥ to determine the LMs ν k , using the following structural property of the gradient: Lemma 2 (Lemma 2.1 of [48]) Then, we can write: The constraint ∥V k ∥ 2 = 1 implies using the lemma above Thus, the GF is the following dynamical system In particular, we can also write Hence, at critical points (when ̇ = 0 and V k = 0), we used the definitions of μ and M, Thus, the gap to interpolation due to λ > 0 is = =0 − = 1 − M + that gives Notice that since the V k values are bounded functions, they must take their maximum and minimum values on their compact domain-the sphere-because of the extremum value theorem. In addition, notice that for normalized V k , V T kV k = 0 always that is V k can only rotate. If V k = 0, then the weights V k are given by where n = 1 − f n . [This overdetermined system of equationswith as many equations as weights-can also be used to reconstruct the training set from the V k , the y n , and the f n .]

Convergence
A favorable property of optimization of the square loss is the convergence of the relevant parameters. With GD, the loss function cannot increase, while the trainable parameters may potentially diverge. A typical scenario of this kind happens with cross-entropy minimization, where the weights typically tend to infinity. In light of the theorems in the Landscape of the empirical risk section, we could hypothetically think of training dynamics in which the loss function's value  , V k L k=1 decreases, while ρ oscillates periodically within some interval. As we show next, this is impossible when the loss function's value converges to zero. Proof. Note that if lim t→∞  , V k L k=1 = 0, then, for all n ∈ [N], we have (ρf n − y n ) 2 → 0. In particular, ρf n → y n and f n → 1. Hence, we conclude that μρ → 1. Therefore, by Lemma 4, ̇ → 0. We note that ρ → 0 would imply ρf n → 0 that contra- → 0, since the labels y n are nonzero. Therefore, we conclude that ̇ → 0. To see why V k → 0, we recall that So far, we have assumed convergence of GF, GD, or SGD to zero loss. Convergence does not seem too far-fetched given overparametrization and the associated high degeneracy of the global minima (see Landscape of the empirical risk section and theorems there). Proofs of convergence of descent methods have been, however, lacking until a recent paper [49] presented a new criterion for convergence of GD and used to show that GD with proper initialization converges to a global minimum. The result has technical limitations that are likely to be lifted in the future: It assumes that the activation function is smooth, that the input dimension is greater than or equal to the number of data points, and that the descent method is GF or GD.

Qualitative dynamics
We consider the dynamics of model in Fig. 1B. During training the norm of each layer, weight matrix is kept constant by the LM constraint that is applied to all layers but the last one, thus leaving ρ at the top to change depending on the dynamics. Recall that ∀ n ∈ [N]: 0 ≤ | f n | ≤ 1 because the assumption ∥x ∥ ≤ 1 yields ∥f(x) ∥ ≤ 1 by taking into account the definition of ReLUs and the fact that matrix norms are submultiplicative. Depending on the number of layers, the maximum margin that the network can achieve for a given dataset is usually much smaller than the upper bound 1, because the weight matrices have unit norm and the bound ≤1 is conservative. Thus, to guarantee interpolation, namely, ρf n y n = 1, ρ must be substantially larger than 1. For instance, in the experiments plotted in this paper, the maximal f n is ≈0.002, and, thus, the ρ needed for interpolation (for λ = 0) is in the order of 500. We assume then that for a given dataset, there is a maximal value of y n f n that allows interpolation. Correspondingly, there is a minimum value of ρ that we call, as mentioned earlier, ρ 0 . We now provide some intuition for the dynamics of the model. Notice that ρ(t) = 0 and f V (x) = 0 (if all weights are zero) are critical unstable points. A small perturbation will either result in ̇< 0 with ρ going back to zero or in ρ growing if the average margin is just positive, that is, μ > λρ > 0.

Small ρ initialization
First, we consider the case where the neural network is initialized with a smallish ρ, that is, ρ < ρ 0 . Assume then that at some time t, μ > 0, that is, average separability holds. Notice that if the f n values were zero-mean, random variables, then there would be a 50% chance for average separability to hold. Then, Eq. 5 shows that ̇> 0. If full separability takes place, that is, ∀n : f n > 0, then ̇ remains positive at least until ρ = 1. This is because Eq. 5 implies that ̇ ≥ 2( − ) since M ≤ μ. In general, assuming eventual convergence, ρ may grow nonmonotonically, that is, there may oscillations in ρ for "short" intervals, until it converges to ρ 0 .
To see this, consider the following lemma that gives a representation of the loss function in terms of ρ, ̇ , and μ.
The square loss can be written as Therefore, we conclude that as desired.
Following this lemma, if ̇ becomes negative during training, then the average margin μ must increase since GD cannot increase but only decrease . In particular, this implies that ̇ cannot be negative for long periods of time. Notice that short periods of decreasing ρ are "good" since they increase the average margin.
If ̇ turns negative, then it means that it has crossed ̇ = 0. This may be a critical point for the system if the values of V k corresponding to V k = 0 are compatible (since the matrices V k L k=1 determine the value of f n ). We assume that this critical point-either a local minimum or a saddle-can be avoided by the randomness of SGD or by an algorithm that restarts optimization when a critical point is reached for which  > 0.
Thus, ρ grows (nonmonotonically) until it reaches an equilibrium value, close to ρ 0 . Recall that for λ = 0, this corresponds to a degenerate global minimum  = 0, usually resulting in a large attractive basin in the loss landscape. For λ = 0, a zero value of the loss ( = 0) implies interpolation: Thus, all the f n have the same value, that is, all the margins are the same.

Large ρ initialization
If we initialize a network with large norm ρ > ρ 0 , then Eq. 1 shows that ̇< 0. This implies that the norm of the network will decrease until, eventually, an equilibrium is reached. In fact, since ρ ≫ 1, it is likely that there exists an interpolating (or near interpolating) solution with ρ that is very close to the initialization. In fact, for large ρ, it is usually empirically possible to find a set of weights V L , such that f n ≈ 1. To understand why this may be true, recall that if there are at least N units in the top layer of the network (layer L) with given activities and ρ ≫ ρ 0 , then there exist values of V L that yield interpolation due to Theorem 2. In other words, it is easy for the network to interpolate with small values f n. These large ρ, small f n solutions are reminiscent of the NTK solutions [24], where the parameters do not move too far from their initialization. A formal version of the same argument is based on the following result.
We now assume that the network in the absence of WD has converged to an interpolating solution such that, ∀ n ∈ [N]: f n = * = 1. Further assume that the classifier V L and the last layer features h are aligned, i.e., where the vector h denotes the activities of the units in the last layer. Then, perturbing with the property that f is an interpolating solution, corresponding to a critical point of the gradient but with a larger ρ.

Proof. Consider the margins of the network
Since the classifier weights and the last layer features are aligned (as it may happen for λ → 0), we have that y n h( We also have from the interpolating condition that f n = * = 1, which means ‖ h � x n � ‖ = 1 . Putting all this together, we have f n = 1, which concludes the proof.
Thus, if a network exists providing an interpolating solution with a minimum ρ and V L ∝ h, there exist networks that differ only in the last V L layer and are also interpolating but with larger ρ. As a consequence, there is a continuum of solutions that differ only in the weights V L of the last layer.
Of course, there may be interpolating solutions corresponding to different sets of weights in layers below L, to which the above statement does not apply. These observations suggest that there is a valley of minimizers for increasing ρ, starting from a zero-loss minimizer that has the NC property (see Neural Collapse).
In Fig. 3, we show the dynamics of ρ alongside train loss and test error. We show results with and without WD in the top and bottom rows of Fig. 3, respectively.   decreases with μ increasing and σ decreasing. The figures show that in our experiments, the large margins of some of the data points decrease during GD, contributing to a decrease in σ. Furthermore, Eq. 11 suggests that for small ρ, the term dominating the decrease in   is −2ρμ. For larger ρ, the term ρ 2 M = ρ 2 (σ 2 + μ 2 ) becomes important: Eventually,   decreases because σ 2 decreases. The regularization term, for standard small values of λ, is relevant only in the final phase, when ρ is in the order of ρ 0 . For λ = 0, the loss at the global equilibrium (which happens at ρ = ρ 0 ) is To sum up, starting from small initialization, gradient techniques will explore critical points with ρ growing from zero. Thus, quasi-interpolating solutions with small ρ (corresponding to large margin solutions) may be found before the many large ρ quasi-interpolating solutions that have worse margins (see Fig. 3, top and bottom rows). This dynamics can take place even in the absence of regularization; however, λ > 0 makes the process more robust and bias it toward small ρ.

Generalization: Rademacher Complexity of Convolutional Layers Classical Rademacher bounds
In this section, we analyze the test performance of the learned neural network. Following the standard learning setting, we assume that there is some underlying distribution P of labeled samples (x, y) and the training data  = x i ,y i N i=1 consist of N independent and identically distributed samples from P. The model f W is assumed to perfectly fit the training samples, i.e., f W (x i ) = y i = ± 1.
We would like to upper bound the classification error err f W ≔ (x,y)∼P I sign f W (x) ≠ y of the learned function f W in terms of the number of samples N and the norm ρ of f W . This analysis is based on the following data-dependent measure of the complexity of a class of functions.
Definition Rademacher complexity. Let ℍ be a set of realvalued functions h:  → ℝ defined over a set . Given a fixed sample S ∈  m , the empirical Rademacher complexity of ℍ is defined as follows: The expectation is taken over σ = (σ 1 , …, σ m ), where, σ i ∈ {±1} are independent and identically distributed and uniformly distributed samples.
The Rademacher complexity measures the ability of a class of functions to fit noise. The empirical Rademacher complexity has the added advantage that it is data dependent and can be measured from finite samples.
be a dataset of independent and identically distributed samples selected from P. Then, with probability at least 1 − δ over the selection of , for any f W that perfectly fits the data (i.e., f W (x i ) = y i ), we have

Proof. Let t ∈ ℕ ∪ {0} and
We consider the ramp loss function By Theorem 3.3 in [50], for any t ∈ ℕ ∪ {0}, with probability at least 1 − t(t + 1) , for any function f W ∈ t , we have We note that for any function f W for which f W (x i ) = y i = ± 1, we have ℓ ramp (f W (x i ), y i ) = 0. In addition, for any function f W and pair (x, y), Therefore, we conclude that with probability at least 1 − t(t + 1) , for any function f W ∈ t , we have We notice that by the homogeneity of ReLU neural networks, we have   t ≤ (t + 1) ⋅   ( ). By union bound over all t ∈ ℕ ∪ {0}, Eq. 14 holds uniformly for all t ∈ ℕ ∪ {0} and f W ∈ t with probability at least 1 − δ. For each f W with ∏ L i=1 ∥ W i ∥ 2 = , we can apply the bound with t = ⌊ρ⌋ since f W ∈ t and obtain the desired bound, The above theorem provides an upper bound on the classification error of the trained network f W that perfectly fits the training samples. The upper bound is decomposed into 2 main terms. The first term is proportional to the norm of the trained model ρ and the Rademacher complexity of that is the set of the normalized neural networks and the second term scales as √ log ( ∕ ) ∕ N. As shown in Theorem 1 in [51], this term is upper assuming that the samples are taken from the d-dimensional ball d of radius 1. The overall bound is then (assuming zero training error) We note that while the mentioned bound on ℝ N ( ) depends on the architecture of the network, it does not depend in an explicit way on the training set. However, as shown in Eq. 6 in [51], the bound may be improved further if the matrices' stable rank is low, which happens with low rank of the weight matrices. In practice, the value of ℝ N ( ) depends not only on the network architecture (e.g., convolutional) but also on the underlying optimization (e.g., L 2 versus L 1 ) and on the data (e.g., rank).

Relative generalization
We now consider 2 solutions with zero empirical loss of the square loss regression problem obtained with the same ReLU deep network and corresponding to 2 different minima with 2 different ρ values. Let us call them g a (x) = ρ a f a (x) and g b (x) = ρ b f b (x).
Using the notation of this paper, the functions f a and f b correspond to networks with normalized weight matrices at each layer. Let us assume that ρ a < ρ b . We now use Eq. 16 and the fact that the empirical L for both functions is the same to write 2N . The bounds have the form and Thus, the upper bound for the expected error L 0 (f a ) is better than the bound for L 0 (f b ). Of course, this is just an upper bound.
. As a consequence, this result does not guarantee that a solution with smaller ρ will always have a smaller expected error than a solution with larger ρ.
Notice that this generalization claim is just a relative claim about different solutions obtained with the same network trained on the same training set. Figure 4 shows clearly that increasing the percentage of random labels increases the ρ that is needed to maintain interpolation-thus decreasing the margin-and that, at the same time, the test error increases, as expected. This monotonic relation between margin and accuracy at test seems to break down for small differences in margin as shown in Fig.   5, although the significance of the effect is unclear. Of course, this kind of behavior is not inconsistent with an upper bound.

Novel bounds for sparse networks
In the Classical Rademacher bounds section, we describe generic bounds on the Rademacher complexity of deep neural networks. In these cases, ρ measures the product of the Frobenius norms of the network's weight matrices in each layer. For convolutional networks, however, the operation in each layer is computed with a kernel, described by the vector w, that acts on each patch of the input separately. Therefore, a convolutional layer is represented by a Toeplitz matrix W, whose blocks are each given by w. A naive application of Eq. 16 to convolutional networks give a large bound, where the Frobenius norm of the Toeplitz matrix is equivalent to norm of the kernel multiplied by the number of patches.
In this section, we provide an informal analysis of the Rademacher complexity, showing that it can be reduced by exploiting the first one of the 2 properties of convolutional layers: (a) the locality of the convolutional kernels and (b) weight sharing. These properties allow us to bound the Rademacher complexity by taking the products of the norms of the kernel w instead of the norm of the associated Toeplitz matrix W. Here, we outline the results with more precise statements and proofs to be published separately.
We consider the case of one-dimensional convolutional networks with nonoverlapping patches and one channel per layer. For simplicity, we assume that the input of the network lies in ℝ d , with d = 2 L and the stride and the kernel of each layer are 2. The analysis can be easily extended to kernels of different sizes. This means that the network h(x) can be represented as a binary tree, where the output neuron is computed , and so on. This means that we can write the ith row of the Toeplitz matrix of the lth layer (0, …, 0, −W l −, 0…, 0), where W l appears on the 2 i − 1 and 2 i coordinates. We define a set  of neural networks of this form, where each layer is followed by a ReLU activation function and ∏ L l=1 W l ≤ .

Theorem 4.
Let  be the set of binary-tree-structured neural networks over ℝ d , with d = 2 L for some natural number L. Let X = {x 1 , …, x N } ⊂ ℝ d be a set of samples. Then, Proof sketch. First, we rewrite the Rademacher complexity in the following manner: Next, by the proof of Lemma 1 in [51], we obtain that By applying this peeling process L times, we obtain the following inequality: where the factor 2 L − 1 is obtained because the last layer is linear (see [52]). We note that a better bound can achieved when using the reduction introduced in [51], which would give a factor of √ 2 log (2)L + 1 instead of 2 L − 1 . Thus, one ends up with a bound scaling as the product of the norms of the kernel at each layer. The constants may change depending on the architecture, the number of patches, the size of the patches, and their overlap.
This special nonoverlapping case can be extended to the general convolutional case. In fact, a proof of the following conjecture will be provided in [53].

Conjecture 1. If a convolutional layer has overlap among its patches, then the nonoverlap bound
where ρ is the product of the norms of the kernels at each layer becomes where K is the size of the kernel (number of components) and O is the size of the overlap.
Sketch proof. Call P the number of patches and O the overlap. With no overlap, then PK = D, where D is the dimensionality of the input to the layer. In general, P = D − O K − O . It follows that a layer with the most overlap can add at most < ∥ x ∥ √ K to the bound. Notice that we assume that each component of x i averaged across i will have norm

The bound is surprisingly small
In this section, we have derived bounds for convolutional networks that may potentially be orders of magnitude smaller than equivalent similar bounds for dense networks. We note that a naive application of Corollary 2 in [29] for the network that we used in Theorem 4 would require treating the network as if it were a dense network. In this case, the bound would be proportional to the product of the norms of each of the Toeplitz matrices in the network individually. In this case, the total bound becomes which is much larger than the bound we obtained earlier. The key point is that the Rademacher bounds achievable for sparse networks are much smaller than for dense networks. This suggests that convolutional network with local kernels may generalize much better than dense network, which is consistent in spirit with approximation theory results (compositionally sparse target functions can be approximated by sparse networks without incurring in the curse of dimensionality, whereas generic functions cannot be approximated by dense networks without the curse). They also confirm the empirical success of convolutional networks compared to densely connected networks.
It is also important to observe that the bounds we obtained may be nonvacuous in the overparameterized case, unlike Vapnik-Chervonenkis bounds that depend on the number of weights and are therefore always vacuous in overparameterized situations. With our norm-based bounds, it is, in principle, possible to have overparametrization and interpolation simultaneously with nonvacuous generalization bounds: This is suggested by Fig. 6. Figure 7 shows the case of a 3-layer convolutional network with a total number of parameters of ≈20,000.

Neural Collapse
A recent paper [12] described 4 empirical properties of the terminal phase of training (TPT) deep networks, using the crossentropy loss function. TPT begins at the epoch where training error first vanishes. During TPT, the training error stays effectively zero, while training loss progressively decreases. Direct empirical measurements expose an inductive bias that they call NC, involving 4 interconnected phenomena. Informally, (NC1) cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class means. (NC2) The class means collapse to the vertices of a simplex ETF. (NC3) Up to rescaling, the lastlayer classifiers collapse to the class means or, in other words, to the simplex ETF (i.e., to a self-dual configuration). (NC4) For a given activation, the classifier's decision collapses to simply choose whichever class has the closest train class mean (i.e., the nearest class center decision rule).
We now formally define the 4 NC conditions. We consider a network f W (x) = W L h(x), where h(x) ∈ ℝ p denotes the last layer feature embedding of the network and W L ∈ ℝ C × p contains the parameters of the classifier. The network is trained on a C-class classification problem on a balanced dataset  = x cn ,y cn N,C n=1,c=1 with N samples per class. We can compute the per-class mean of the last layer features as follows: The global mean of all features as follows: Furthermore, the second-order statistics of the last layer features are computed as follows: Here, Σ W measures the within-class covariance of the features, Σ B is the between-class covariance, and Σ T is the total covariance of the features (Σ T = Σ W + Σ B ).
We can now list the formal conditions for NC: • NC1 (variability collapse). Variability collapse states that the variance of the feature embeddings of samples from the same class tends to zero, or formally, Tr(Σ W ) → 0. • NC2 (convergence to simplex ETF). |∥μ c − μ G ∥ 2 − ∥ μ c′ − μ G ∥ 2 | → 0, or the centered class means of the last layer features become equinorm. Moreover, if we define , then we have ⟨̃ c ,̃ c� ⟩ = − 1 C − 1 for c ≠ c ′ , or the centered class means are also equiangular. The equinorm condition also implies that ∑ c̃ c = 0, i.e., the centered features lie on a simplex. • NC3 (self-duality). If we collect the centered class means into a matrix M = [μ c − μ G ], then we have

Related Work on NC
Since the empirical observation of NC was made in [12], a number of papers have studied the phenomenon in the so-called unconstrained features regime [32][33][34]39,40]. The basic assumption underlying these proofs is that the features of a deep network at the last layer can essentially be treated as free optimization variables, which converts the problem of finding the parameters of a deep network that minimize the training loss, into a matrix factorization problem of factoring one-hot class labels Y ∈ ℝ C × CN into the last layer weights W ∈ ℝ C × p and the last layer features H ∈ ℝ p × CN . In the case of the squared loss, the problem that they study is min W, In this section, we show instead that we can predict the existence of NC and its properties as a consequence of our analysis of the dynamics of SGD on deep binary classifiers trained on the square loss function with LN and WD without any additional assumption. We first consider the case of binary classification and show that NC follows from the analysis of the dynamics of the square loss in the previous sections. The loss function is the same one defined in Eq. 1, and we consider minimization using SGD with a batch size of 1. After establishing NC in this familiar setting, we consider the multiclass setting where we derive the conditions of NC from an analysis of the squared loss function with WD and WN.

Binary classification
We prove in this section that NC follows from the following property of the landscape of the squared loss that we analyzed in the previous section: Property 1 [symmetric quasi-interpolation (binary classification)]. Consider a binary classification problem with inputs in a feature space  and label space {+1, −1}. A classifier f W :  → ℝ symmetrically quasi-interpolates a training  dataset  = x n ,y n N n=1 if, for all training examples, f W n = y n f W x n = 1 − , where ϵ is the interpolation gap. We prove first that the property follows without any assumption at convergence from our previous analysis of the landscape of the squared loss for binary classification. Lemma 6. An overparameterized deep ReLU network for binary classification trained to convergence under the squared loss in the presence of WD and WN satisfies the symmetric quasi-interpolation property. Furthermore, the gap to interpolation of the regularized network is We recall the definitions made earlier in in the Classification with square loss minimization" section of the margin f i = y i f i , and the first-and second-order sample statistics of the margin = 1 We consider deep networks that are sufficiently overparameterized to attain 100% accuracy on the training dataset, which means f i > 0. Since the weights of the deep network V k L k=1 are normalized and the data x i lie within the unit norm ball, we have that | f i | ≤ 1. Although f i could take values close to 1, the typically observed values of f i in our experiments are approximately 5 × 10 −3 . For our purposes, it suffices to note that there exists a maximum possible margin, such that 0 < f i ≤ for all training examples for a given dataset and network architecture.
Using these definitions, we can rewrite the deep network training problem as follows: All critical points (including minima) need to satisfy   = 0, from which we get = M + . If we plug this back into the loss, then our minimization problem becomes: Hence, to minimize the loss, we have to find V k L k=1 that maximizes μ 2 and minimizes σ 2 . Since we assumed that the network is expressive enough to attain any value, the loss is minimized when σ 2 = 0 and = . Thus, all training examples have the same margin.
If σ 2 → 0, then all margins tend to the same value, f i → , and the optimum value of ρ is = 2 + . This means that the gap to interpolation is = 1 − = + 2 . The prediction σ → 0 has empirical support: we show in Fig. 8 that all the margins converge to be roughly equal. Once withinclass variability disappears and for all training samples, the last layer features collapse to their mean. The outputs and margins then also collapse to the same value. We can see this in the left plot of Fig. 10 where all of the margin histograms are concentrated around a single value. We visualize the evolution of the training margins over the training epochs in Fig. 8, which shows that the margin distribution concentrates over time. At the final epoch, the margin distribution (colored in yellow) is much narrower than at any intermediate epochs. Notice that our analysis of the origin of the SGD noise shows that "strict" NC1 never happens with SGD, in the sense that the margins are never, not even asymptotically, exactly equal to each other but just very close.
We now prove that NC follows from Property 1.

Theorem 5.
Let  = x n ,y n N n=1 be a dataset. Let (ρ, V) be the parameters of a ReLU network f, such that V L has converged when training using SGD with batches of size 1 on the square loss with LN + WD. Let Consider points of convergence of SGD that satisfy Property 1. Those points also satisfy the conditions of NC as described below.
• NC3: Proof. The update equations for SGD on the square loss function with LN+WD are given by: We can apply the unit norm constraints This gives us the following SGD update: Using Property 1, we can see that for every training sample in class y n = 1, h x n = (1 − ) V L and for every training sample in class y n = −1, h x n = ( − 1 + ) V L . This shows that within-class variability has collapsed and that all last layer features collapse to their mean, which is the condition for NC1. We can also see that μ + = −μ − , which is the condition for NC2 when there are 2 vectors in the simplex ETF. The SGD convergence condition also tells us that V L ∝ μ + and V L ∝ μ − , which gives us the NC3 condition. NC4 follows then from NC1 to NC2, as shown by theorems in [12].
In the previous section, we proved the emergence of NC in the case of a binary classifier with scalar outputs, to be consistent with our framework in Problem Setup. The phenomenon of NC was, however, defined in [12] for the case of multiclass classification with deep networks. In this section, we describe how NC emerges in this setting from the minimization of the squared loss with WN and WD regularization. We also show in Fig. 9 that our networks show NC, similar to experiments reported in [12]. We consider a classification problem with C classes with a balanced training dataset The labels are represented by the unit vectors e c C c=1 in ℝ C . Since we consider deep homogeneous networks that do not have bias vectors, we center the one-hot labels and scale them so that they have maximum output 1. We denote the resulting labels (for a classbalanced dataset) as , where the cth coordinate is 1. We consider a deep ReLU network f W : However, we stick to the normalized reparameterization of the deep ReLU network as We train this normalized network with SGD on the square loss with LMs and WD. This architecture differs from the one considered the Gradient dynamics section in that it has C outputs instead of a scalar output. Let the output of the network ⊤ and the target vectors be We will also follow the notation of [12] and use h : ℝ d → ℝ p to denote the last layer features of the deep network. This means that f (c) The squared loss function with WD is written as Property 2 [symmetric quasi-interpolation (multiclass classification)]. Consider a C-class classification problem with inputs in a feature space  and label space ℝ C . A classifier f :  → ℝ C symmetrically quasi-interpolates a training dataset Similar to the binary classification case, we show that this property arises from an analysis of the squared loss landscape for multiclass classification. Lemma 7. An overparameterized deep ReLU classifier trained to convergence under the squared loss in the presence of WD and WN satisfies the symmetric quasi-interpolation property Proof. Consider the regularized square loss In the multiclass case, we define the first-order statistics of the output of the normalized network as We consider deep networks that are overparameterized enough to attain 100% accuracy on the training dataset, which means ⟨ f V x cn ,ẽ c ⟩ > 0. Since the weights of the deep network V k L k=1 are normalized and the data x cn lie within the unit norm ball, we also have that ∥f V (x cn ) ∥ ≤ 1. However, similar to the binary case, we observe that the norm of f V (x cn ) takes values of the order of 10 −3 .
Using these definitions, we can rewrite the deep network training problem as: All critical points (including minima) need to satisfy   = 0, from which we get = M + . If we plug this back into the loss, then our minimization problem becomes: Hence, to minimize the loss we have to find V k L k=1 that maximizes 2 M + . Since the network is expressive enough to attain any value and the norm of f V (x cn ) is bounded, we see that the loss is minimized when μ 2 is maximized. That is, when f x cn ∝ẽ c for all training examples. 9. NC occurs during training for binary classification. This figure is similar to other published results on NC, such as for instance [12] for the case of exponential-type loss function. The key conditions for NC are: (a) NC1-variability collapse, which is measured by Tr Σ W Σ −1 B , where Σ W and Σ B are the within and between class covariances, (b) NC2-equinorm and equiangularity of the mean features {μ c } and classifiers {W c }. We measure the equinorm condition by the standard deviation of the norms of the means (in red) and classifiers (in blue) across classes, divided by the average of the norms, and the equiangularity condition by the standard deviation of the inner products of the normalized means (in red) and the normalized classifiers (in blue), divided by the average inner product (this figure is similar to Fig. 4 in [12]; notice the small scale of the fluctuations), and (c) NC3-self-duality or the distance between the normalized classifiers and mean features. This network was trained on 2 classes of CIFAR10 with WN and WD = 5 × 10 −4 and learning rate of 0.067, for 750 epochs with a stepped learning rate decay schedule.
We now consider the optimization of the squared loss on deep networks with WN and WD: At each time point t, the optimization process selects a random class-balanced batch Consider points of convergence of SGD that satisfy Property 2. Then, those points also satisfy the conditions of NC as described below.
x cn ,ŷ cn be such a balanced batch. We use SGD, where, at each time t, the batch  ′ is drawn at random from , to study the time evolution of the normalized parameters V L in the limit η → 0.
We can apply the unit norm constraints = 1 and ignore all terms that are O(η 2 ) to compute (t) L as: This means that the (stochastic) gradient of the loss with respect to the last layer V L and each classifier vector V c L with LN can be written as (we drop the time index t for clarity): Let us analyze the equilibrium parameters at the last layer, considering each classifier vector V c L of V L , separately: Using Property 2 and considering solutions that achieve symmetric quasi-interpolation, with f V x cn = ẽ c , we have In addition, consider a second batch  ′′ that differs from  ′ by only one sample x′ cn instead of x cn from class c. By applying the previous Eq. 40 for  ′ and  ′′ , we can obtain h(x cn ) = h(x′ cn ), which proves NC1. Thus, at equilibrium, with quasi-interpolation of the training labels, we obtain From the SGD equations, we can also see that at equilibrium, with quasi-interpolation, all classifier vectors in the last layer (V c L and, hence, μ c − μ G ) have the same norm: From the quasi-interpolation of the correct class label, we have that From the quasi-interpolation of the incorrect class labels, we have that , which means .
Plugging in the previous result and using Eq. 43 yields , and we use the fact that all the norms ∥ V c L ∥ 2 are equal. This completes the proof that the normalized classifier parameters form an ETF. Moreover, since V c L ∝ c − G and all the proportionality constants are independent of c, we obtain ∑ c V c L = 0. This completes the proof of the NC2 condition. NC4 follows then from NC1 to NC2, as shown by theorems in [12].

Remarks
• The analyses of the loss landscape and the qualitative dynamics under the square loss in the Qualitative dynamics and Landscape of the empirical risk sections imply that all quasiinterpolating solutions with ρ ≥ ρ 0 and λ > 0 that satisfy assumption 2 yield NC and have its 4 properties.
• SGD is a necessary requirement in our proof of NC1.
• Our analysis implies that there is no direct relation between NC and generalization. In fact, a careful look at our derivation suggests that NC1 to NC4 should take place for any quasiinterpolating solutions (in the square loss case), including solutions that do not have a large margin. In particular, our analysis predicts NC for datasets with fully random labels-a prediction that has been experimentally verified.

SGD Bias toward Low-Rank Weight Matrices and Intrinsic SGD Noise
In the previous sections, we assumed that ρ and V k are trained using GF. In this section, we consider a slightly different setting where SGD is applied instead of GF. Specifically, V k and ρ are first initialized and then iteratively updated simultaneously in the following manner where  ′ is selected uniformly as a subset of  of size B, η > 0 is the learning rate, and ν k is computed according to Eq. 4 with  replaced by  ′ .

Low-rank bias
An intriguing argument for low-rank weight matrices is the following observation that follows from Eq. 5 (see also [7]). The Lemma 8 shows that, in practice, SGD cannot achieve zero gradient for all the minibatches of size smaller than N, because, otherwise, all the weight matrices would have very low rank that is incompatible, for generic datasets, with quasi-interpolation.
Lemma 8. Let f W be a neural network. Assume that we iteratively train ρ and V k L k=1 using the process described above with WD λ > 0. Suppose that training converges, that is  � ,{V k } L k=1 = 0 and ∀ k ∈ [L]: Then, the ranks of the matrices V k are at most ≤ 2. Proof.
× d l and ∥V l ∥ = 1 for all l ∈ [L]. We would like to show that the matrix is of rank ≤1. We note that for any given vector z ∈ ℝ d , we have σ(v) = diag (σ ′ (v)) · v (where σ is the ReLU activation function). Therefore, for any input vector x ∈ ℝ n , the output of f V can be written as follows, We denote u l, i (x; V) as the ith coordinate of the vector u l (x; V). We note that u l (x; V) are continuous functions of V. Therefore, assuming that none of the coordinates u l, i (x; V) are zero, there exists a sufficiently small ball around V for which u l, i (x; V) does not change its sign. Hence, within this ball, σ ′ (u l, i (x; V)) is constant. We define sets  ≔ V | ∀ l ≤ L: ∥ V l ∥ = 1 and  l,i = V ∈ |u l,i (x;V ) = 0 . We note that as long as x ≠ 0, the set  l,i is negligible within . Since there is a finite set of indices l, i, the set ⋃ l,i  l,i is also negligible within . Let V be a set of matrices for which none of the coordinates u l, i (x; V) are zero. Then, the matrices D l (x;V ) L−1 l=1 are constant in the neighborhood of V, and therefore, their derivative with respect to V k are zero. Let a ⊤ = V L · D L − 1 (x; V)V L − 1 ⋯V k + with respect to V k are zero, by applying a ⊤ Xb X = ab ⊤ , we have that is a matrix of rank at most 1. Therefore, is a matrix of rank at most 1.
Therefore, for any input x n ≠ 0, with measure 1, is a matrix of rank at most 1.
Since interpolation is impossible when training with λ > 0, there exists at least one n ∈ [N] for which f n ≠ 1. We consider 2 batches  ′ i and  ′ j of size B that differ by sample, (x i , y i ) and (x j , y j ). We have Assume that there exists a pair i, j ∈ [N] for which In this case, we obtain that for all i, j ∈ [N], we have Since the network cannot perfectly fit the dataset when trained with λ > 0, we obtain that there exists i ∈ [N] for which 1 − f i ≠ 0. Since f i ≠ 0 for all i ∈ [N], this implies that α ≠ 0. We conclude that V k is proportional to U, which is of rank ≤1.
All GD methods try to converge to points in parameter space that have zero or very small gradient; in other words, they try to minimize ∥V k ∥ , ∀ k. Assuming separability, n = 1 − f n > 0, ∀ n. Equation 10 then implies which predicts that the norm of the SGD updates at layer k should reflect, asymptotically, the rank of V k .

Is low-rank bias related to generalization?
An obvious question is whether a deep ReLU network that fits the data generalizes better than another one if the rank of its weight matrices is lower. The following result is stated in [8]: Theorem 7. Let f V be a normalized neural network, trained with SGD under square loss in the presence of WN. Assume that the weight matrix V k of dimensionality (n, n) has rank r < n. Then, its contribution to the Rademacher complexity of the network will be √ r n (instead of 1 as in the typical bound).

Origin of SGD noise
Lemma 8 shows that there cannot be convergence to a unique set of weights V k L k=1 that satisfy equilibrium for all minibatches. More details of the argument are illustrated in [54,55]. When λ = 0, interpolation of all data points is expected: In this case, the GD equilibrium can be reached without any constraint on the weights. This is also the situation in which SGD noise is expected to essentially disappear: Compare the histograms on the left and the right hand side of Fig. 10. Thus, during training, the solution V k L k=1 is not the same for all samples: There is no convergence to a unique solution but instead fluctuations between solutions during training. The absence of convergence to a unique solution is not surprising for SGD when the landscape is not convex.

Summary
The dynamics of GF In this paper, we have considered a model of the dynamics of, first, GF, and then stochastic GD in overparameterized ReLU neural networks trained for square loss minimization. Under the assumption of convergence to zero loss minima, we have shown that solutions have a bias toward small ρ, defined as the product of the Frobenius norms of each layer's (unnormalized) weight matrix. We assume that during training, there is normalization using an LM of each layer weight matrix but the last one, together with WD with the regularization parameter λ. Without WD, the best solution would be the interpolating solution with minimum ρ that may be achieved with appropriate initial conditions that are appropriate.

Remarks
• The bias toward small ρ solutions induced by regularization with λ > 0 may be replaced-when λ = 0-by an implicit bias induced by small initialization. With appropriate parameter values, small initialization allows convergence to the first quasi-interpolating solution for increasing ρ from ≈ 0 to ρ 0 . For λ = 0, we have empirically observed solutions with large ρ that are suboptimal and probably similar to the NTK regime.
• A puzzle that remains open is why BN leads to better solutions than LN and WN, despite similarities between them. WN is easier to formalize mathematically as LN, which is the main reason for the role it plays in this paper.

Generalization and bounds
Building on our analysis of the dynamics of ρ, we derive new norm-based generalization bounds for CNNs for the special case of nonoverlapping convolutional patches. These bounds show (a) that generalization for CNNs can be orders of magnitude better than for dense networks and (b) that these bounds can be empirically loose but nonvacuous despite overparametrization.

Remarks
• For λ > 0, a main property of the minimizers that upper bounds their expected error is ρ, which is the inverse of the margin: We prove that among all the quasi-interpolating solutions, the ones associated with smaller ρ have better bounds on the expected classification error.
• The situation here is somewhat similar to the linear case: For overparameterized networks, the best solution in terms of generalization is the minimum norm solution toward which GD is biased.
• Large margin is usually associated with good generalization [56]; in the meantime, however, it is also broadly recognized that margin alone does not fully account for generalization in deep nets [28,31,57]. Margin, in fact, provides an upper bound on generalization error, as shown in Generalization: Rademacher Complexity of Convolutional Layers. Larger margin gives a better upper bound on the generalization error for the same network trained on the same data. We have empirically verified this property by varying the margin using different degrees of random labels in a binary classification task. While training gives perfect classification and zero square loss, the margin on the training set together with the test error decreases with the increase in the percentage of random labels. Of course, large margin in our theoretical analysis is associated with regularization that helps minimizing ρ. Since ρ is the product of the Frobenius norm, its minimization is directly related to minimizing a Bayes prior [58], which is itself directly related to minimum description length principles.
• We do not believe that flat minima directly affect generalization. As we described in the Interpolation and quasi-interpolation section, degenerate minima correspond to solutions that have zero empirical loss (for λ = 0). Minimizing the empirical loss is a (almost) necessary condition for good generalization. It is not, however, sufficient since minimization of the expected error also requires a solution with low complexity.
• The upper bound given in Generalization: Rademacher Complexity of Convolutional Layers, however, does not explain by itself details of the generalization behavior that we observe for different initializations (see Fig. 4), where small differences in margin are actually anticorrelated with small differences in test error. We conjecture that margin (related to ρ) together with sparsity of may be sufficient to explain generalization.

Neural collapse
Another consequence of our analysis is a proof of NC for deep networks trained with square loss in the binary classification case without any assumption. In particular, we prove that training the network using SGD with WD, induces a bias toward low-rank weight matrices and yields SGD noise in the weight matrices and in the margins, which makes exact convergence impossible, even asymptotically.

Remarks
• A natural question is whether NC is related to solutions with good generalization. Our analysis suggests that this is not the case, at least not directly: NC is a property of the dynamics, independently of the size of the margin that provides an upper bound on the expected error. In fact, our prediction of NC for randomly labeled CIFAR10 was confirmed originally in then preliminary experiments by our collaborators (Papyan et al. [12]) and more recently in other papers (see for instance, [33]).
• Margins, however, do converge to each other but only within a small ϵ, implying that the first condition for NC [12] is satisfied only in this approximate sense. This is equivalent to saying that that SGD does not converge to a unique solution that corresponds to zero gradient for all data point.

Conclusion
Finally, we would like to emphasize that the analysis of this paper supports the idea that the advantage of deep networks relative to other standard classifiers is greater for the problems to which sparse architectures such as CNNs can be applied. The reason is that CNNs reflect the function graph of target functions that are compositionally sparse and, thus, can be approximated well by sparse networks without incurring in the curse of dimensionality. Despite overparametrization, the compositionally sparse networks can then show good generalization.