Universal characteristics of deep neural network loss surfaces from random matrix theory

This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local statistics to derive practical implications for deep neural networks based on a realistic model of their Hessians. In particular we derive universal aspects of outliers in the spectra of deep neural networks and demonstrate the important role of random matrix local laws in popular pre-conditioning gradient descent algorithms. We also present insights into deep neural network loss surfaces from quite general arguments based on tools from statistical physics and random matrix theory.


Introduction
The success of large deep neural networks (DNNs) optimised in surprisingly naïve ways, such as stochastic gradient descent, has spawned a considerable amount of interest in characterising their loss surfaces. The loss of a neural network is simply a real valued function that measures the network's performance on a task with respect to some reference data set. Neural network are defined in terms of parameters, or weights, of which there are typically a very large number. Fixing the data set, or even the data generating distribution, a neural network's loss can be seen as a surface in high dimensions parametrised by the network's weights.
It was first observed 1 in a sequence of experimental and theoretical papers [CHM + 15,SGAL14] that the loss surfaces of DNNs can be connected with spherical multi-spin glasses from statistical physics [MPV87]. Mathematically, multi-spin glasses are just certain random multivariate polynomials, or equivalently Gaussian processes with specific covariance functions. These works were very influential, providing the first steps towards explaining why DNNs can be trained at all, given the apparent intractability of their optimisation from the viewpoint of classical optimisation theory.
The key insight from the spin glass connection was that DNN loss surfaces are indeed very complicated and filled with many local optima, but that, in the high-dimensional limit, the local optima are arranged favourably, so that simple gradient based optimisation methods can be expected to 1 Note that early connections between neural networks and spin systems were presented in [Gar88].
converge to local optima close to the global minimum. This work has been extended in several directions. [BKMN21] extended the scope of the results to more general neural networks, and in so doing uncovered some of the limits of the spin glass model's explanatory power. Departing from the direct connection between glassy systems and neural networks, several authors have used similar high-dimensional random models as toy models for high-dimensional optimisation. Among these works are several that focus on the existence of spurious minima and the problem of recovering signals from high dimensional signal-in-noise models [RABC19, MBAB20, MBC + 19]. Similarly, using spin glasses as surrogates for the complex, high-dimensional, random loss surfaces of DNNs, [BKMN22] extended the spin glass analysis to study the non-standard loss surfaces of generative adversarial networks [GPAM + 14].
A significant feature common amongst all the work mentioned so far is random matrix theory [Meh04, LXT + 18, AGZ10]. When studying the number and configuration of local optima of high dimensional random functions, Kac-Rice formulae are an essential mathematical tool, and random matrices arise as Hessians in these formulae. For spin glasses, this path was first trodden in the theoretical physics literature by Fyodorov [Fyo04,Fyo05] and made rigorous in later work [AAČ13]. Specifically in the case of [CHM + 15] and [AAČ13], which lays its mathematical foundation, very detailed calculations can be completed. This is possible because the random matrices that appear belong to the Gaussian Orthogonal Ensemble (GOE), one of the canonical ensembles of random matrix theory. Almost anything one could want to know of the GOE is known and many powerful tools of random matrix theory can be applied to it. In particular, the full joint distribution of its eigenvalues and large deviations results for its eigenvalues and its spectral density, all of which are required by [AAČ13,CHM + 15]. Works such as [BKMN21,BKMN22] show how detailed calculations can be completed beyond the standard spin glass case, however these results still depend on important properties of the GOE, as the Hessians in those cases are closely related to the GOE. In a recent work, [GZR20] showed how valuable practical insights about DNN optimisation can be obtained by considering the outliers in the spectrum on the loss surface Hessian. Once again, this work relies on special properties from random matrix theory, namely the Wigner semicircle law for Wigner matrices and outlier results for rotationally invariant matrices [BGN11].
Challenging the above-mentioned works, an experimental line of work has demonstrated convincingly that special RMT ensembles like the GOE do not appear to be present in DNNs [Pap18,Gra20,BGK22], for example as their Hessians. In addition, there have been challenges in the literature to the practical relevance of spin glass loss surface results for DNNs [BJSG + 19]. In this context, it is natural to question the relevance to DNNs of many of the results discussed above. Indeed, does random matrix theory actually provide insight into real DNNs, or are its powerful tools merely applicable to toy models too divorced from real DNNs to be of any value? Random matrix theory itself may hold the answer to this question, in particular its notion of universality. Broadly speaking, universality refers to the phenomenon that certain properties of special random matrix ensembles (such as the GOE) remain true for more general random matrices that share some key feature with the special ensembles. For example, the Wigner semicircle is the limiting spectral density of the Gaussian Wigner ensembles, i.e. matrices with Gaussian entries, independent up two symmetry (symmetric real matrices, Hermitian complex matrices) [Meh04]. The Gaussian case is the simplest to prove, and there are various powerful tools not available in the non-Gaussian case, however the Wigner semicircle has been established as the limiting spectral density for Wigner ma-trices with quite general distributions on their entries [AGZ10,Tao12]. While surprisingly general is some sense, the Wigner semicircle relies on independence (up to symmetry) of matrix entries, a condition which is not typically satisfied in real systems. The limiting form of the spectral density of a random matrix ensemble is a macroscopic property, i.e. the matrix is normalised such that the average distance between adjacent eigenvalues is on the order of 1/ √ N , where N is the matrix size. At the opposite end of the scale is the microscopic, where the normalisation is such that eigenvalues are spaced on a scale of order 1; at this scale, random matrices display a remarkable universality. For example, any real symmetric matrix has a set of orthonormal eigenvectors and so the set of all real symmetric matrices is closed under conjugation by orthogonal matrices. Wigner conjectured that certain properties of GOE matrices hold for very general random matrices that share the same (orthogonal) symmetry class, namely symmetric random matrices (the same is true of Hermitian random matrices and the unitary symmetry class). The spacings between adjacent eigenvalues should follow a certain explicit distribution, the Wigner surmise, and the eigenvectors should be delocalised, i.e. the entries should all be of the same order as the matrix size grows. Both of these properties are true for the GOE and can be proved straightforwardly with quite elementary techniques. These properties and more have been proved in a series of works over the last decade or-so, of which a good review is [EY17]. Crucial in these results is the notion of a local law for random matrices. The technical statement of local laws is given later in the paper, but roughly they assert that the spectrum of a random matrix is, with very high probability, close to the deterministic spectrum defined by its limiting spectral density (e.g. the semicircle law for Wigner matrices). Techniques vary by ensemble, but generally a local law for a random matrix ensemble provides the control required to demonstrate that certain matrix statistics are essentially invariant under the evolution of the Dyson Brownian motion. In the case of real symmetric matrices, the Dyson Brownian motion converges in finite time to the GOE, hence the statistics preserved under the Dyson Brownian motion must match the GOE. The n-point correlation functions of eigenvalues are one such preserved quantity, from which follows, amongst other properties, the Wigner surmise on adjacent spacings.
At the macroscopic scale, there are results relevant to neural networks, for example [PSG18,Pas20] consider random neural networks with Gaussian weights and establish results that are generalised to arbitrary distributions with optimal conditions, so demonstrating universality. On the microscopic scale, [BGK22] provided the first experimental demonstration of the presence of universal local random matrix statistics deep neural networks, specifically in the Hessians and Gauss-Newton matrices of their loss surfaces. This work has recently been extended to the weight matrices of neural networks [TSR22]. This paper explores the consequences of random matrix universality in deep neural networks. Our main mathematical result is a significant generalisation of the Hessian spectral outlier result recently presented by [GZR20]. This generalisation removes any need for GOE or Wigner forms of the Hessian and instead leverages much more universal properties of the eigenvectors and eigenvalues of random matrices which we argue are quite likely to hold for real networks. Our results make concrete predictions about the outliers of DNN Hessians which we compare with experiments on several real-world DNNs. These experiments provide indirect evidence of the presence of universal random matrix statistics in the Hessians of large DNNs, which is noteworthy as certainly these DNNs are far too large to permit exact eigendecomposition of their Hessians as in [BGK22]. Along a similar line, we show how local random matrix laws in DNNs can dramatically simplify the dynamics of certain gradient descent optimisation algorithms and may be in part responsible for their success. Finally, we highlight another aspect of random matrix universality relevant to DNN loss surfaces. Recent work [ABM21] has shown that the so-called 'self averaging' property of random matrix determinants is very much more universal than previously thought. The self-averaging of random matrix determinants has been used in the spin glass literature both rigorously and nonrigorously (e.g. [Fyo04,Fyo05,AAČ13,BKMN21,BKMN22] inter alia) and is the key property that produces the exponentially large/small number of local optima repeatedly observed. We argue that insights into the geometry of DNN loss surfaces can be conjectured from quite general assumptions about the Hessian and gradient noise and from the general self-averaging effect of random matrix determinants.
The paper is structured as follows. Section 2 introduces our random matrix theory model for DNN Hessians, derives results for their outliers and compares with experimental results. Section 3 presents the proof of our main result for addition of random matrices, combining quantum unique ergodicity and the supersymmetric method. Section 4 proves an extension of the random matrix determinant results of [ABM21] and presents insights into DNN loss surfaces from complexity calculations and random matrix determinants. Section 5 describes the role of random matrix local laws in certain popular DNN optimisation algorithms. Section 6 concludes the paper.
Notation. We adopt the following conventions throughout • For a probability measure µ, g µ is its Stieljtes transform.
• µ ν denotes the free additive convolution between probability measures µ and ν.
• Hats denote empirical quantities unless stated otherwise. For exampleμ X is the empirical spectral measure of a matrix X.
• r(µ), l(µ) denote the right and left edges of the support of a probability measure µ respectively.

The model
Given a loss function L : Y ×Y → R, a data generating distribution P data supported on X ×Y and a neural network f w : X → Y parametrised by w ∈ R N , its batch Hessian is given by ∼ P data (2.1) and its true Hessian is given by Both H batch and H true are N × N matrix functions of w; H batch is random but H true is deterministic. Only in very specific cases and under strong simplifying assumptions can one hope to obtain the distribution of H batch or the value of H true from L, P data and f w . Inspired by the success of many random matrix theory applications, e.g. in Physics, we will instead seek to capture the essential features of deep neural network Hessians in a sufficiently general random matrix model.
We introduce the following objects: • A sequence (in N ) of random real symmetric N × N matrices X. X possesses a limiting spectral probability measure µ, i.e. if λ 1 , . . . , λ N are the eigenvalues of X then weakly almost surely. We further assume that µ has compact support and admits and smooth density with respect to Lebesgue measure.
for fixed integers p, q. We assume the existence of limiting measure ν such that, weakly, where ν is a compactly supported probability measure. The remaining eigenvalues satisfy ν is also assumed to be of the form ν = εη + (1 − η)δ 0 where η is a compactly supported probability measure which admits a density with respect to Lebesgue measure.
With these definitions, we construct the following model for the Hessian: where b is the batch size. We have dropped the subscript on H batch for brevity. Note that H takes the place of the batch Hessian and A taken the place of the true Hessian. s(b)X takes the place of the random noise introduced by sampling a finite batch at which to evaluate the Hessian. s(b) is an overall scaling induced in X by the batch-wise averaging.
This model is almost completely general. Note that we allow the distribution of X and the value of A to depend on the position in weight space w. The only restrictions imposed by the model are 1. the existence of ν; 2. the position of θ i , θ j relative to the support of ν; 3. ν may only possess an atom at 0; 4. the fixed number of θ i , θ j ; 5. the existence of µ; 6. the existence of the scaling s(b) in batch size.
All of the above restrictions are discussed later in the section. Finally, we must introduce some properties of the noise model X in order to make any progress. We introduce the assumption that the eigenvectors of X obey quantum unique ergodicity (QUE). The precise meaning of this assumption and a thorough justification and motivation is given later in this section. For now it suffices to say that QUE roughly means that the eigenvectors of X are delocalised or that they behave roughly like the rows (or columns) of a uniform random N × N orthogonal matrix (i.e. a matrix with Haar measure). QUE is known to hold for standard ensembles in random matrix theory, such as quite general Wigner matrices, Wishart matrices, adjacency matrices of certain random graphs etc. Moreover, as discussed further section 2.5 below, QUE can be thought of as a property of quite general random matrix models.

Quantum unique ergodicity
It is well known that the eigenvectors of quite general random matrices display a universal property of delocalisation, namely for any component u k of an eigenvector u. Universal delocalisation was conjectured by Wigner along with the Wigner surmise for adjacent eigenvalue spacing. Both of these properties, and the more phenomenon of universal correlation functions on the microscopic scale have since been rigorously established for quite a variety of matrix models e.g. [EY17,EY12,EKS19]. [BY17] show that the eigenvectors of generalised Wigner matrices obey Quantum unique ergodicity, a particular form of delocalisiation, stronger than the above statement. Specifically, they are shown to be approximately Gaussian in the following sense.
for large enough N , where N j are i.i.d. standard normal random variables, (u k ) N k=1 are the normalised eigenvectors, P is any polynomial in n variables and ε > 0.

Batch Hessian outliers
Let {λ i } be the eigenvalue of H. To set the context of our results, let us first simplify and suppose momentarily that s = 1 and, instead of mere QUE, X has eigenvectors distributed with Haar measure, and A is fixed rank, i.e. ξ i = 0 ∀i, then the results of [BGN11] would apply and give λ j a.s.
where ω is the subordination function such that g µ SC τ (z) = g τ (ω(z)). These two expressions coincide when i.e. only when B is in fact of negligible rank as N → ∞. This apparent contradiction is resolved by the observation that the proof in [BGN11] in fact relies implicitly on an isotropic local law. Note in particular section 4.1, which translated to our context, would Returning to our main thread, we have satisfied the conditions on X required to invoke Theorem 3.4 and so conclude that where ω is the subordination function such that g µ b ν (z) = g ν (ω(z)) and µ b is the limiting spectral measure of s(b)X. The reasoning found in [CDM16] then applies regarding the outliers of H. Indeed, suppose that λ is an outlier of H, i.e. λ is an eigenvalue of H contained in R\supp(µ ν). Necessarilyĝ H possesses a singularity at λ, and soĝ A must have a singularity at ω(λ). For this singularity to persist for all N ,ppp ω(λ) must coincide with one of the outliers of A which, unlike the bulk eigenvalues ξ j , remain fixed for all N . Therefore we have the following expressions for the outliers of H: (2.14) We now consider ε to be small and analyse these outlier locations as a perturbation in ε. Firstly note that (2.17) We now must take care in computing Recall that the R-transform of a measure is defined as a formal power series [AGZ10] where k n is the n-th cumulant of the measure. It is known [AGZ10] that k n = C n where the functional inverse of the Stieljtes transform of the measure is given by the formal power series Now let m n be the n-th moment of µ and similarly let m (b) n be the n-th moment of µ b , so formally ir . from which we deduce k (b) n = s(b) n k n , which establishes that R µ b (z) = s(b)R µ (s(b)z). Recalling (2.17) we find (2.24) The form of ν gives and so we can expand to give Note that (2.26) reduces to outliers of the form The problem of determining the support of µ b ν is difficult and almost certainly analytically intractable, with [BES20] containing the most advanced results in that direction. However overall, we have a model for deep neural network Hessians with a spectrum consisting, with high-probability, of a compactly supported bulk µ b ν and a set of outliers given by (2.26) (and similarly for θ j ) subject to (2.14).
(2.26) is a generalised form of the result used in [GZR20]. We have the power series where m (η) n are the moments of η and k (µ) n are the cumulants of µ. In the case that the spikes θ j are large enough, we approximate by truncating these power series to give where the approximation is more precise for larger b and smaller ε and we have used the fact that the first cumulant of any measure matches the first moment. One could consider for instance a power law for s(b), i.e.
for some υ > 0. In the case that µ is a semicircle, then all cumulants apart from the second vanish, so setting ε = 0 recovers exactly where σ is the radius of the semicircle. To make the link with [GZR20] obvious, we can take υ = 1/2 and µ to be the semicircle, so giving where we have truncated O(ε) term. We present an argument in favour of the υ = 1/2 power law below, but we allow for general υ when comparing to experimental data.
Remark 2.1. It is quite possible for µ's density to have a sharp spike at the origin, or even for µ to contain a δ atom at 0, as observed empirically in the spectra of deep neural network Hessians.

Experimental results
The random matrix Hessian model introduced above is quite general and abstract. Necessarily the measures µ and η must be allowed to be quite general as it is well established experimentally [Pap18, Gra20, BGK22] that real-world deep neural network Hessians have spectral bulks that are not familiar as being any standard canonical examples from random matrix theory. That being said, the approximate form in (2.30) gives quite a specific form for the Hessian outliers. In particular, the constants m 2. the true Hessian is approximately low-rank (as measured by ε) and has a finite number of outliers; 3. the Hessian noise model X has QUE.
In view of this third point, agreement with (2.30) provides an indirect test for the presence of universal random matrix statistics in deep neural network Hessians.
We can use Lanczos power methods [MS06] to compute good approximations to the top few outliers in the batch Hessian spectra of deep neural networks [GZR20]. Indeed the so-called Pearlmutter trick [Pea94] enable efficient numerical computation of Hessian-vector products, which is all that one requires for power methods. Over a range of batch sizes, we compute the top 5 outliers of the batch Hessian for 10 different batch seeds. We repeat this procedure at every 25 epochs throughout the training of VGG16 and WideResNet28 × 10 image classifiers on the CIFAR100 dataset and at every epoch during the training of a simple multi-layer perceptron network on the MNIST dataset. By the end of training each of the models have high test accuracy, specifically the VGG16 architecture which does not use batch normalisation, has a test accuracy of ≈ 75%, whereas the WideResNet28 × 10 has a test accuracy of ≈ 80%. The MLP has a test set accuracy of ≈ 95%.

Remark 2.2.
There is a subtlety with regard to obtaining the top outliers using the Lanczos power method. Indeed, since Lanczos provides, in some sense, an approximation to the whole spectrum of a matrix, truncating at m iterations for a N × N matrix cannot produce good approximations to all of the m top eigenvalues. In reality, experimental results [Pap18,GWG19] show that, for deep neural networks, and using sufficiently many iterations (m), the top r eigenvalues may be recovered, for r m. We display some spectral plots of the full Lanczos results in the Appendix which demonstrate clearly a large number of outliers, and clearly more than 5. As a result, we can have confidence that our numerical procedure is indeed recovering approximations to the top few eigenvalues required for our experiments.
be the top i-th empirical outlier (so i = 1 is the top outlier) for the j-th batch seed and a batch size of b for the model at epoch e. To compare the experimental results to our theoretical model, we propose the following form: where β (e) > 0 (as the second cumulant of a any measure of non-negative) and θ (i,e) > θ (i+1,e) > 0 for all i, e. The parameters α (e) , β (e) , γ (e) and θ (i,e) need to be fit to the data, which could be done with standard black-box optimisation to minimise squared error in (2.33), however we propose an alternative approach which reduces the number of free parameters and hence should regularise the optimisation problem. Observe that (2.33) is linear in the parameters α (e) , β (e) , γ (e) so, neglecting the positivity constraint on β (e) , we can in fact solve exactly for optimal values. Firstly let us definē λ to be the empirical mean of λ over the batch seed index j. Each epoch will be treated entirely separately, so let us drop the e superscripts to streamline the notation. We are then seeking to optimise α, β, γ, θ (i) to minimise (2.34) Now make the following definitions (2.36) Finally we can define the vector n-dimensional Y by flattening the matrix (y ib ) ib , and the 3 × n matrix X by stacking the vectors x ib and then flattening of the i, b indices. That done, we have have a standard linear regression problem with design matrix X and parameters w. For fixed θ, the global minimum of E is then attained at parameters where the dependence on the parameters θ is through Y and X as above. We thus have and can plug these values back in to (2.34) to obtain an optimisation problem only over the θ (i) . There is no closed form solution for the optimal θ (i) for this problem, so we fit them using gradient descent. The various settings and hyperparameters of this optimisation were tuned by hand to give convergence and are detailed in the Appendix Section Appendix B.3. To address the real constraint β > 0, we add a penalty term to the loss (2.34) which penalises values of θ (i) leading to negative values of β. The constraint θ (i) > θ (i+1) > 0 is implemented using a simple differentiable transformation detailed in the Appendix Section Appendix B.2.. Finally, the exponent υ is selected by fitting the parameters for each υ in {−0.1, −0.2, . . . , −0.9} and taking the value with the minimum mean squared error E.
The above process results in 12 fits for VGG and Resnet and 10 for MLP (one per epoch). For each of these, we have a theoretical fit for each of the 5 top outliers as a function of batch size which can be compared graphically to the data, resulting in (2 × 12 + 10) × 5 = 170 plots. Rather than try to display them all, we will select a small subset that illustrates the key features. Figure 1 shows results for the Resnet at epochs 0 (initialisation), 25, 250 and 300 (end of training) and outliers 1, 3 and 5. Between the three models, the Resnet shows consistently the best agreement between the data and the parametric form (2.33). The agreement is excellent at epoch 0 but quickly degrades to that seen in the second row of Figure 1, which is representative of the early and middle epochs for the Resnet. Towards the end of training the Resnet returns to good agreement between theory and data, as demonstrated in the third and fourth rows of Figure 1 at epochs 250 and 300 respectively.
The VGG16 also has excellent agreement between theory and data at epoch 0, and thereafter is similar to the early epochs of the Resnet, i.e. reasonable, but not excellent, until around epoch 225 where the agreement starts to degrade significantly until the almost complete failure at epoch 300 shown in the first row of Figure 2. The MLP has the worst agreement between theory and data, having again excellent agreement at epoch 0, but really quite poor agreement even by epoch 1, as shown in the second row of Figure 2.
The experimental results show an ordering Resnet > VGG > MLP, in terms of how well the random matrix theory loss surface predictions explain the Hessian outliers. We conjecture that this relates to the difficulty of the loss surfaces. Resnets are generally believed to have smoother, simpler loss surfaces [LXT + 18] and be easier to train than other architectures, indeed the residual connections were originally introduced for precisely this reason. The VGG is generally more sensitive to training set-up, requiring well-tuned hyperparameters to avoid unstable or unsuccessful training [GB22]. The MLP is perhaps too small to benefit from high-dimensional highly over-parametrised effects.
The parameter values obtained for all models over all epochs are shown in Figure 3, with a column for each model. There are several interesting features to draw out of these plots, however note that we cannot meaningfully interpret the parameters for the MLP beyond epoch 0, as the agreement with (2.33) is so poor. Firstly consider the parameter m (µ) 1 , which is interpreted as the first moment (i.e. mean) of the spectral density of the noise matrix X. Note that for all models this parameter starts close to 0 and generally grows with training epochs, noting that the right hand side of Figure 3b at the higher epochs should be ignored owing to the bad fit discussed above. m (µ) 1 = 0 is significant, as it is seen in the case of the a symmetric measure µ, such as the Wigner semicircle used by [GZR20]. It is interesting also to observe that εm (η) 1 remains small for all epochs particularly compared to m 2 . This is consistent with the derivation of (2.33), which relies on ε being small, however we emphasise that this was not imposed as a numerical constraint but arises naturally from the data. Recall that the magnitude of εm (η) 1 measures the extent of the deviation of A from being exactly low rank, so its small but non-zero values suggest that it is indeed important to allow for the true Hessian to have non-zero rank in the N → ∞ limit. Finally, we comment that the best exponent is generally not υ = 1/2. Again, the results from the Resnet are the most reliable and they appear to show that the batch scaling, as characterised by υ, is not constant throughout training, particularly comparing epoch 0 and epoch 300, say.

Justification and motivation of QUE
Work on random matrix universality has shown that a local law is the key ingredient in the establishing universal local statistics of eigenvalues [EYY12,EY17] and universal delocalisation of eigenvectors [BY17]. There are several forms of local law, but all make control, with high probability, the error between the (random) matrix Green's function G(z) = (z −X) −1 and certain deterministic equivalents. In all cases we use the set for ω ∈ (0, 1) and the local law statements holds for all (large) D > 0 and (small) ξ > 0 and for all large enough N . The averaged local law states: (2.39) The isotropic local law states: (2.40) The anisotropic local law states: where Π(·) is an N × N deterministic matrix function on C. The entrywise local law states: (2.42) The anisotropic local law is a stronger version of the entrywise local law. The anisotropic local law is a more general version of the isotropic local law, which can be recovered in the isotropic case by taking Π = g µ I. The entrywise local law can also be applied in the isotropic case by taking Π = g µ I. The averaged local law is weaker than all of the other laws. General Wigner matrices are known to obey isotropic local semi-circle laws [ES17]. Anisotropic local laws are known for general deformations of Wigner matrices and general covariance matrices [KY17] as well as quite general classes of correlated random matrices [EKS19]. As mentioned above, quantum unique ergodicity was proved for general Wigner matrices in [BY17]. It appears that the key ingredient in the proof of QUE (2.9) in [BY17] (2.44) This can be generalised to (2.46) where V is a potential function. Note that the eigenvector dynamics are unaffected by the presence of the potential V , so we expect to be able to generalise the proof of [KY17] to any random matrix ensemble with an isotropic local law by defining the potential V so that the invariant ensemble with distribution Z −1 e −N Tr V (X) dX has equilibrium measure µ (Z is a normalisation constant). We show how to construct such a V from µ in Section Appendix A.
The arguments so far suffice to justify a generalisation of the "dynamical step" in the arguments of [BY17], so it remains to consider the "comparison step". The dynamical step establishes QUE for the matrix ensemble with a small Gaussian perturbation, but in the comparison step one must establish that the perturbation can be removed without breaking QUE. To our knowledge no such argument has been articulated beyond generalized Wigner matrices, with the independence of entries and comparable scale of variances being critical to the arguments given by [BY17]. Our guiding intuition is that QUE of the form (2.9) is a general property of random matrices and can reasonably be expected to hold in most, if not all, cases in which there is a local law and universal local eigenvalue statistics are observed. At present, we are not able to state a precise result establishing QUE in sufficient generality to be relevant for this work, so we shall take it as an assumption. Assumption 2.3. Let X be an ensemble of N × N real symmetric random matrices. Assume that X admits a limiting spectral measure is µ with Stieljtes transform m. Suppose that the isotropic local law (2.40) holds for X with µ. Then there is some set such that with |I| = n, for any polynomial P in n indeterminates, there exists some ε(P ) > 0 such that for large enough N we have (2.47) Note that the isotropic local law in Assumption 2.3 can be obtained from the weaker entrywise law (2.42) as in Theorem 2.14 of [AEK + 14] provided there exists a C > 0 such that E|X ij | 2 CN −1 for all i, j and there exists C p > 0 such that E| √ N X ij | p C p for all i, j and integer p > 0.

Remark 2.4. In [BY17] the restriction I ⊂ T N is given for the explicit set
for some 0 < δ < 1. In the case of generalised Wigner matrices, this restriction on the indices has since been shown to be unnecessary [BL22,Ben20,BL21,BL21]. In our context, we could simply take as an assumption all results holds with T N = [N ], however our results can in fact be proved using only the above assumption that |T c N | = o(N ), so we shall retain this weaker form of the assumptions.
This section is not intended to prove QUE from explicit known properties of deep neural network Hessians, but rather to provide justification for it as a reasonable modeling assumption in the noise model for Hessians defined in section 2.1. We have shown how QUE can be obtained from an isotropic (or entrywise) local law beyond the Wigner case. It is important to go beyond Wigner or any other standard random matrix ensemble, as we have observed above that the standard macroscopic spectral densities of random matrix theory such as the semicircle law are not observed in practice. That said, we are not aware of any results establishing QUE in the more general case of anistropic local laws, and this appears to be a very significant technical challenge. We must finally address why a local law assumption, isotropic or otherwise, may be reasonable for the noise matrix X in our Hessian model. Over the last decade or so, universal local statistics of random matrices in the form of k-point correlation function on the appropriate microscopic scale have been established for a litany of random matrix ensembles. An immediate consequence of such results is that, on the scale of unit mean eigenvalue spacing, Wigner's surmise holds, depending only on the symmetry class (orthogonal, unitary or symplecitic). As with the QUE proof discussed above, the key ingredient in these proofs, as part of the "three step strategy" [EY17], is establishing a local law. The theoretical picture that has emerged is that when universal local eigenvalue statistics are observed in random matrices, it is due to the mechanism of short time scale relaxation of local statistics under Dyson Brownian Motion made possible by a local law. It has been observed that universal local eigenvalue statistics do indeed appear to be present in the Hessian of real, albeit quite small, deep neural networks [BGK22]. Given all of this context, we propose that a local law assumption of some kind is reasonable for deep neural network Hessians and not particularly restrictive. As we have shown, if we are willing to make the genuinely restrictive assumption of an isotropic local law for the Hessian noise model, then QUE follows. However an anistropic local law is arguably more plausible as we expect deep neural networks Hessians to contain a good deal of dependence between entries, and such correlations are know to generically lead to anisotropic local laws [EKS19].

Motivation of true Hessian structure
In this section we revisit and motivate the assumptions made about the Hessian in Section 2.1. Firstly note that one can always define A = EH batch and it is natural then to associate A with the true Hessian H true . In light of (2.1), it is natural to expect some fixed form of the law for H batch − A for any batch size, but with an overall scaling s(b), which must naturally be decreasing in b. Next we address the assumptions made about the spectrum of A. The first assumption one might think to make is that A has fixed rank relative to N , with spectrum consisting only of the spikes θ i , θ j . Behind such an assumption is the intuition that the data distribution does not depend on N and so, in the over-parametrised limit N → ∞, the overwhelming majority of directions in weight space are unimportant. The form we take for A in the above is a strict generalisation of the fixed rank assumption; A still has a fixed number of spiked directions, but the parameter ε controls the rank of A. Since any experimental investigation is necessarily limited to N < ∞, the generalisation to N > 0 is particularly important. Compact support of the measures µ and η is consistent with experimental observations of deep neural network Hessian spectra.

The batch size scaling
Our experimental results considered s(b) = b −1/2 , a choice which we now justify. From (2.1) we have where X (i) are i.i.d. samples from the law of X. Suppose that the entries X ij were Gaussian, with Cov(X ij , X kl ) = Σ ij,kl . Then Z = X (2.50) In the case of centred X, one then obtains Note that this does not quite match the case described in Section 2.1, since we do not there assume EX = 0, however we take this a rough justification for s(b) = b −1/2 as an ansatz. Moreover, numerical experimentation with s(b) = b −q for values of q > 0 shows that q = 1/2 gives roughly the best fit to the data.

Intermediate results on QUE
This section establishes some intermediate results that follow from assuming QUE for the eigenvectors of a matrix. They will be crucial for our application in the following section.

Lemma 3.1. Consider a real orthogonal
are the eigenvectors of a real random symmetric matrix with QUE. Let P be a fixed N × N real orthogonal matrix. Let V = U P and denote the rows Proof. Take any unit vector q, then for any k = 1, . . . , N But P q 2 = q 2 = 1 since P is orthogonal, so the statement of QUE for {u i } N i=1 transfers directly to {v i } N i=1 thanks to the supremum of all unit q.

Lemma 3.2. Consider a real orthogonal N × N matrix U with rows {u
are the eigenvectors of a real random symmetric matrix with QUE. Let 0 (q) = i 1{q i = 0} count the non-zero elements of a vector with respect to a fixed orthonormal basis {e i } N i=1 . For any fixed integer s > 0, define the set Then the columns {u i } N i=1 of U satisfy a weaker form of QUE (for any fixed n, s > 0): We will denote this form of QUE as QUE.
Proof. Take some q ∈ V s . Then there exists some J ⊂ T N with |J| = s and non-zero {q k } k∈J such that q j e T k u j but then the coefficients q j can be absorbed into the definition of the general polynomial in the statement (2.9) of QUE for {u i } N i=1 , which completes the proof, noting that the sum only includes indices contained in T N owing to the definition of V s .

Lemma 3.3. Fix some real numbers {y
Fix also a diagonal matrix Λ and an orthonormal set of vectors {v i } N i=1 that satisfies QUE. Then there exists an ε > 0 and η i ∈ C N with such that for any integer l > 0 where the g i are i.i.d. Gaussians N (0, I N ).
Proof. Let {e i } N i=1 be the standard orthonormal basis from above. Then The ellipsis represents the similar terms where further of the j 1 , . . . , j r are in T c N . For j ∈ T c N the terms are excluded from the statement of QUE, however we can still bound them crudely. Indeed Note that this is error term is surely far from optimal, but is sufficient here. Overall we can now say (3.12) We can apply QUE to the terms in square parentheses to give ε 1 , . . . , ε r > 0 such that (3.13) We can obtain a single error bound by setting ε = min i ε i , where clearly ε > 0 and then write where η 2 i k j k ∈ [−1, 1]. To further include the indices j ∈ T c N , we extended the expression (3.14) to all j k by saying but by comparing with (3.6) we can rewrite as . . . , η iN ).

Main result
Theorem 3.4. Let X be an N ×N real symmetric random matrix and let D be an N ×N symmetric matrix (deterministic or random). Letμ X ,μ D be the empirical spectral measures of the sequence of matrices X, D and assume there exist deterministic limit measures µ X , µ D . Assume that X has QUE, i.e. 2.3. Assume also theμ X concentrates in the sense that where τ > 0 and f is some positive increasing function. Then H = X + D has a limiting spectral measure and it is given by the free convolution µ X µ D .

Remark 3.5. A condition like (3.19) is required so that the
which is of course just a deterministic version of (3.19). [ABM21] introduce the extra condition around concentration of Lipschitz traces: for all δ > 0, Lipschitz f and N large enough, where ζ, c ζ > 0 are some constants. As shown in the proof of Theorem 1.2, this condition is sufficient to obtain for any t > 0 and for large enough N . Note that [ABM21]  Proof. We shall denote use the notation Recall the supersymmetric approach to calculating the expected trace of the resolvent of a random matrix ensemble: with φ ∈ C N and χ, χ * being N -long vectors of anti-commuting variables. Independence of X and D gives (3.30) E D simply means integration against a delta-function density if D is deterministic. Let us introduce some notation: for N × N matrices K, Φ X (K) = E X e −i Tr XK , and similarly Φ D . We also define a new matrix ensembleX (λ 1 , . . . , λ N ) are equal in distribution to the eigenvalues of X and O is an entirely independent Haar-distributed orthogonal matrix. Now and so we need to analyse the error term E(z).
Now consider X = U T ΛU where the rows of U are the eigenvectors {u i } i of X. Say also that K = Q T Y Q for diagonal Y = (y 1 , . . . , y r , 0, . . . , 0), where we note that K has fixed rank, by construction. Then but Lemma 3.1 establishes that the rows of U Q T obey QUE, since the rows of U do. Further, Lemma 3.2 then establishes that the columns of U Q T obey QUE as required by Lemma 3.3. Let {v i } be those columns, then we have The expectation over X can be split into eigenvalues and conditional eigenvectors (3.33) We can simply bound for any n, but clearly since, whatever the distribution of U | Λ, the integral is over a compact group (the orthogonal group O(N )) and the integrand has no singularities. Therefore, by the dominated convergence theorem and in precisely the same way (3.37) Recalling (3.32) we now have (3.38) and similarly (3.39) where thev i are defined in the obvious way fromX. We would now apply QUE, but to do so we must insist that E Λ is taken over the ordered eigenvalues of X. Having fixed that convention, lemma 3.3 can be applied to the terms in (3.38). The terms in ΦX can be treated similarly. This results in The exponential has infinite radius of convergence, so we may re-order the terms in the sums to give cancellation Here ε > 0 and η i , Simplifying, we obtain and so For any fixed δ > 0, ε can be reduced if necessary so that ε < δ and then for sufficiently large N we obtain, say, Thence we can write η T i Λη i = Tr |Λ|ξ i for ξ i ∈ [−2, 2], and similarlyη T i Λη i = Tr |Λ|ξ i . Now so we can apply Laplace's method to the empirical spectral measureμ X to obtain where the o(1) terms do not depend on the y i and where we have defined We have thus established that Remark 3.6. We have also constructed a non-rigorous argument for Theorem 3.4 where the supersymmetric approach is replaced by the replica method. This approach simplifies some of the analysis but at the expense of being less convincing.

Experimental validation
We consider the following matrix ensembles: All of the GOE n , U W ig n , ΓW ig n have the same limiting spectral measure, namely µ SC , the semi-circle of radius √ 2. U W ish n , W ish n have a Marcenko-Pastur limiting spectral measure µ M P , and the constant √ 12 is chosen so that the parameters of the MP measure match those of a Gaussian Wishart matrix W ish n . GOE n , W ish n are the only ensembles whose eigenvectors are Haar distributed, but all ensembles obey a local law in the sense above. It is known that the sum of GOE n and any of the other ensembles will have limiting spectral measure given by the free additive convolution of µ SC and the other ensemble's measure (so either µ SC µ M P or µ SC µ SC ). Our result implies that the same holds for addition of the non-invariant ensembles. Sampling from the above ensembles is simple, so we can easily generate spectral histograms from multiple independent matrix samples for large n. µ SC µ SC is just another semi-circle measure but with radius 2. µ SC µ M P can be computed in the usual manner with R-transforms and is given by the solution to the polynomial i.e. Say the cubic has roots {r 1 , r 2 + is 2 , r 2 − is 2 } for s 2 0, then the density of µ SC µ M P at z is s 2 /π. This can all be solved numerically. The resulting plots are in Figure 4 and clearly show agreement between the free convolutions and sampled spectral histograms. We can also test the result in another more complicated case. Consider the case of random d-regular graphs on N vertices. Say M ∼ Reg N,d is the distribution of the adjacency matrix of µ SC and compare to spectral histograms as above. Alternatively we can investigate agreement with µ (d) KM µ SC indirectly by sampling and comparing spectra from say Reg N,d +U W ig N and also from Reg N,d +GOE N . The latter case will certainly yield the distribution µ (d) KM µ SC since the GOE matrices are freely independent from the adjacency matrices. Figure shows a q-q plot for samples from these two matrix distributions and demonstrates near-perfect agreement, thus showing that indeed the spectrum of Reg N,d +U W ig N is indeed described by µ (d) KM µ SC . We reached the same conclusion when repeating the above experiment with U W ish N +Reg N,d and W ish n + Reg N,d .

Extension of a key result and prevalence of minima
Let's recall Theorem 4.5 from [ABM21]. H N (u) is our random matrix ensemble with some parametrisation u ∈ R m and its limiting spectral measure is µ ∞ (u). Define So G −ε is the event that µ ∞ (u) is close to being supported only on (0, ∞). Let l(u), r(u) be te left and right edges respectively of the support of µ ∞ (u). • For every R > 0 and every ε > 0, we have • Several other assumptions detailed in [ABM21].
Then for any α > 0 and any fixed p ∈ N, we have We claim the following extension Corollary 4.2. Under the same assumptions as the above theorem and for any integer sequence (4.4) Proof. Firstly note that so it suffices to establish a complementary upper bound. The proof in of Theorem 4.5 in [ABM21] establishes an upper bound using which holds for all ε > 0, so our proof is complete if we can prove the analogous result (4.7) As in [ABM21], let f ε be some 1 2 -Lipschitz function satisfying ε and also (4.9) We have for some η > 0, then we obtain d BL (μ H N (u) , µ ∞ (u)) η. Then applying (4.2) yields the result (4.7).
(4.11) can be satisfied if So, given ε > 0, we can take N large enough such that, say, k(N ) N < ε 4 . By taking η < ε 2 128 we obtain and so (4.12) is satisfied. Now finally (4.2) can be applied (with η in place of ε) and so we conclude (4.7).
Overall we see that the superexponential BL condition (4.2) is actually strong enough to deal with any o(N ) index not just index-0. This matches the GOE (or generally invariant ensemble) case, in which the terms with 1{i(H N (u)) = k} are suppressed compared to the exact minima terms 1{i(H N (u)) = 0}.

The dichotomy of rough and smooth regions
Recall the batch loss from Section 2.1: ∼ P data . (4.14) As with the Hessian in Section 2.1, we use the model L ≡ L batch (w) = L true (w) + s(b)V (w), where V is a random function R N → R. Now let us define the complexity for sets This is simply the number of stationary points of the training loss in the region B of weight space. A Kac-Rice formula applied to ∇L gives where φ w is the density of ∇V at w. A rigorous justification of this integral formula would, for example, have to satisfy the conditions of the results of [AT + 07]. This is likely to be extremely difficult in any generality, though is much simplified in the case of Gaussian V (and X) -see [AT + 07] Theorem 12.1.1 or [BKMN21] Theorem 4.4. Hereafter, we shall take (4.16) as assumed. The next step is to make use of strong self-averaging of the random matrix determinants. Again, we are unable to establish this rigorously at present, but note that this property has been proved in some generality by [ABM21], although we are unable to satisfy all the conditions of those results in any generality here. Self-averaging and using the addition results above gives where µ b , ν depend in principle on w. We are concerned with N −1 log EC N , and in particular its sign, which determines the complexity of the loss surface in B: positive ↔ exponentially many (in N ) critical points, negative ↔ exponentially few (i.e. none). The natural next step is to apply the Laplace method with large parameter N to determine the leading order term in EC N , however the integral is clearly not of the right form. Extra assumptions on φ w and ∇L true could be introduced, e.g. that they can be expressed as functions of only a finite number of combinations of coordinates of w.
Suppose that φ w has its mode at 0, for any w, which is arguably a natural property, reflecting in a sense that the gradient noise has no preferred direction in R N . The sharp spike at the origin in the spectral density of deep neural network Hessians suggests that generically (4.17) We claim it is reasonable to expect the gradient (and Hessian) variance to be increasing in w 2 . Indeed, consider the general form of a multi-layer perceptron neural network: where all of the weight matrices W (l) and bias vectors b (l) combine to give the weight vector w. Viewing x as a random variable, making f a random function of w, we expect from the above that the variance in f w is generally increasing in w 2 , and so therefore similarly with L batch .
Overall it follows that φ w (−s(b) −1 ∇L true ) is generally decreasing in ∇L true , but the maximum value at φ w (0) is decreasing in w 2 . The picture is therefore that the loss surface is simple and without critical points in regions for which ∇L true is far from 0. In neighbourhoods of ∇L true = 0, the loss surface may become complex, with exponentially many critical points, however if w 2 is too large then the loss surface may still be without critical points. In addition, the effect of larger batch size (and hence larger s(b) −1 ) is to simplify the surface. These considerations indicate that deep neural network loss surfaces are simplified by over-parametrisation, leading to the spike in the Hessian spectrum and thus (4.17). The simple fact that neural networks' construction leads gradient noise variance to increase with w 2 has the effect of simplifying the loss landscape far from the origin of weight space, and even precluding the existence of any critical points of the batch loss.

Implications for curvature from local laws
Consider a general stochastic gradient update rule with curvature-adjusted preconditioning: where recall that L(w) is the batch loss, viewed as a random function on weight space. B t is some preconditioning matrix which in practice would be chosen to somehow approximate the curvature of L. Such methods are discussed at length in [Mar16] and also describe some of the most successful optimisation algorithms used in practice, such as Adam [KB14]. The most natural choice for B t is B t = ∇ 2 L(w t ), namely the Hessian of the loss surface. In practice, it is standard to include a damping parameter δ > 0 in B t , avoid divergences when inverting. Moreover, typically B t will be constructed to be some positive semi-definite approximation to the curvature such as the generalised Gauss Newton matrix [Mar16], or the diagonal gradient variance form used in Adam [KB14]. Let us now suppose that B t = B t (δ) =Ĥ t + δ, whereĤ t is some chosen positive semi-definite curvature approximation and δ > 0. We can now identify B t (δ) −1 as in fact the Green's function ofĤ t , i.e.
But G t is precisely the object used in the statement of a local law on forĤ t . Note that ∇L(w t ) is a random vector and howeverĤ t is constructed, it will generally be a random matrix and dependent on ∇L(w t ) in some manner that is far too complicated to handle analytically. As we have discussed at length hitherto, we conjecture that a local law is reasonable assumption to make on random matrices arising in deep neural networks. In particular [BGK22] demonstrated universal local random matrix theory statistics not just for Hessians of deep networks but also for Generalised Gauss-Newton matrices. Our aim here is to demonstrate how a local law onĤ t dramatically simplifies the statistics of (5.1). Note that some recent work [WHS22] has also made use of random matrix local laws to simplify the calculation of test loss for neural networks.
A local law onĤ t takes the precise form (for any ξ, D > 0 µ is the limiting spectral measure ofĤ t and, crucially, Π is a deterministic matrix. We will use the following standard notation to re-express (5.3) and the probabilistic statement, valid for all ξ, D > 0 is implicit in the symbol ≺. In fact, we will need the local law outside the spectral support, i.e. at z = x + iη where x ∈ R\supp(µ). In that case Ψ N (z) is replaced by 1 N (η+κ) where κ is the distance of x from supp(µ) on the real axis, i.e.
For δ > 0 this becomes for δ > 0 and now any u, v. Applying this to (5.1) gives Consider any u with u 2 = α, then we obtain Thus with high probability, for large N , we can replace (5.1) by incurring only a small error, provided that Note that the only random variable in (5.10) is ∇L(w t ). If we now consider the case ∇L(w t ) = ∇L(w t ) + g(w t ) for deterministicL, then and so the noise in the parameter update is entirely determined by the gradient noise. Moreover note the linear dependence on g in (5.12). For example, a Gaussian model for g immediately yields a Gaussian form in (5.12), and e.g. if Eg = 0, then (5.13) A common choice in practice forĤ is a diagonal matrix, e.g. the diagonal positive definite curvature approximation employed by Adam [KB14]. In such cases,Ĥ is best viewed as an approximation to the eigenvalues of some positive definite curvature approximation. The next result establishes that a local law assumption on a general curvature approximation matrix can be expected to transfer to an analogous result on a diagonal matrix of ts eigenvalues. (5.14) Then D obeys the local law where κ is the distance of x from supp(µ). Naturally, we can redefine D i = λ σi for any permutation σ ∈ S N and the analogous statement replacing q j [µ] with q σ(j) will hold.
Proof. As in [EY17], the local law (5.5), (5.6) is sufficient to obtain rigidity of the eigenvalues in the bulk, i.e. for any ε, D > 0 (5.16) Then we have . (5.17) For z = x + iη and x at a distance κ > 0 from supp(µ) which yields the result.
With this result in hand, we get the generic update rule akin to (5.12), with high probability where we emphasise again that the π j are deterministic and the only stochastic term is the gradient noise g(w t ).
Implications for preconditioned stochastic gradient descent.. The key insight from this section is that generic random matrix theory effects present in preconditioning matrices of large neural networks can be expected to drastically simplify the optimisation dynamics due to high-probability concentration of the pre-conditioning matrices around deterministic equivalents, nullifying the statistical interaction between the pre-conditioning matrices and gradient noise. Moreover, with this interpretation, the damping constant typically added to curvature estimate matrices is more than a simple numerical convenience: it is essential to yield the aforementioned concentration results.
As an example of the kind of analysis that the above makes possible, consider the results of [GBW + 21]. The authors consider a Gaussian process model for the noise in the loss surface, resulting in tractable analysis for convergence of stochastic gradient descent in the presence of statistical dependence between gradient noise in different iterations. Such a model implies a specific form of the loss surface Hessian and its statistical dependence on the gradient noise. This situation is a generalisation of the spin glass model exploited in various works [CHM + 15,BKMN21,BKMN22], except that in those cases the Hessian can be shown to be independent of the gradients. Absent the very special conditions that lead to independence, one expects the analysis to be intractable, hence why the authors in [GBW + 21] restrict to SGD without preconditioning, or simply assume a high probability concentration on a deterministic equivalent. To make this discussion more concrete, consider a model L = L true + V where V is a Gaussian process with mean 0 and covariance function where k is some decreasing function and q some increasing function. The discussion at the end of the previous section suggests that the covariance function for loss noise should not be modelled as stationary, hence the inclusion of the q term. For convenience define ∆ = 1 2 ( x − x 2 2 ) and S = 1 2 ( x 2 2 + x 2 2 ). Then it is a short exercise in differentiation to obtain and moreover Hence we see that the gradients of L and its Hessian are statistically dependent by virtue of the nonstationary structure of V . Putting aside issues of positive definite pre-conditioning matrices, and taking δ such that (∇ 2 L + δ) −1 exists (almost surely) for large N , it is clear that the distribution of (∇ 2 L+δ) −1 ∂V will be complicated and non-Gaussian. This example concretely illustrates our point: even in almost the simplest case, where the gradient noise is Gaussian, the pre-conditioned gradients are generically considerably more complicated and non-Gaussian. Moreover, centred Gaussian noise on gradient is transformed into generically non-centred noise by pre-conditioning. Continuing the differentiation above, it is elementary to obtain the covariance structure of the Hessian ∇ 2 V , though the expressions are not instructive. Crucially, however, the Hessian is Gaussian and the covariance of any of its entries is O(1) (in large N ), so the conditions in Example 2.12 of [EKS19] apply to yield an optimal local law on the Hessian, which in turn yields the above high-probability concentration of (∇ 2 L + δ) −1 provided that δ is large enough.

Conclusion
In this paper we have considered several aspects of so-called universal random matrix theory behaviour in deep neural networks. Motivated by prior experimental results, we have introduced a model for the Hessians of DNNs that is more general than any previously considered and, we argue, actually flexible enough to capture the Hessians observed in real-world DNNs. Our model is built using random matrix theory assumptions that are more general than those previously considered and may be expected to hold in quite some generality. By proving a new result for the addition of random matrices, using a novel combination of quantum unique ergodicity and the supersymmetric method, we have derived expressions for the spectral outliers of our model. Using Lanczos approximation to the outliers of large, practical DNNs, we have compared our expressions for spectral outliers to data and demonstrated strong agreement for some DNNs. As well as corroborating our model, this analysis presents indirect evidence of the presence of universal local random matrix statistics in DNNs, extending earlier experimental results. Our analysis also highlights a possibly interesting distinction between some DNN architectures, as Resnet architectures appear to better agree with our theory than other architectures and Resnets have been previously observed to have better-behaved loss surfaces than many other architectures.
We also presented quite general arguments regarding the number of local optima of DNN loss surfaces and how 'rough' or 'smooth' such surfaces are. Our arguments build on a rich history of complexity calculations in the statistical physics and mathematics literature but, rather than performing detailed calculations in some specific, highly simplified toy model, we instead present general insights based on minimal assumptions. Finally we highlight an important area where random matrix local laws, an essential aspect of universality, may very directly influence the performance of certain popular optimisation algorithms for DNNs. Indeed, we explain how numerical damping, combined with random matrix local laws, can act to drastically simplify the training dynamics of large DNNs.
Overall it is our hope that this paper demonstrates the relevance of random matrix theory to deep neural networks beyond highly simplified toy models. Moreover, we have shown how quite general and universal properties of random matrices can be fruitfully employed to derive practical, observable properties of DNN spectra. This work leaves several challenges for future research. All of our work relies on either local laws for e.g. DNN Hessians, or on matrix determinant self-averaging results. Despite the considerable progress towards establishing local laws for random matrices over the last decade or-so, it appears that establishing any such laws for, say, the Hessians of any DNNs is quite out of reach. We expect that the first progress in this direction will come from considering DNNs with random i.i.d. weights and perhaps simple activation functions. Based on the success of recent works on random DNNs [PS20], we conjecture that the Gram matrices of random DNN Jacobians may be the simplest place to establish a local law, adding to the nascent strand of nonlinear random matrix theory [PW17, BP19,PS20]. We also believe that there is more to be gained in further studies of forms of random matrix universality in DNNs. For example, our ideas may lead to tractable analysis of popular optimisation algorithms such as Adam [KB14] as the problem is essentially reduced to deriving a local law for the gradient pre-conditioning matrix and dealing with the gradient noise.
Observe that if all y i , z i ∈ supp(µ) then S V [µ](y i ) = S V [µ](y i ) = c and so E V [µ ] = E V [µ]. Without loss of generality therefore, we take y i / ∈ supp(µ) and z i ∈ supp(µ), whence Lemma Appendix A.2. Consider a probability measure µ on R with compact support, absolutely continuous with respect to the Lebesgue measure. Then there exists a potential V : R → R which yields a well-defined invariant distribution on real symmetric matrices for which the equilibrium measure is µ.
Proof. (A.2) can be integrated to obtain V and the condition S V [µ] = c (a constant) on supp(µ) determines V uniquely on supp(µ). Next observe that, for y ∈ R\supp(µ) there exists some constant R > 0 such that |x − y| R + |y| and so log |x − y| |y| + R. Therefore So the condition S V [µ](y) c for y ∈ R\supp(µ ∞ ) can be satisfied by taking V (y) = (y − y 0 ) 2 + b for some constants y 0 , b, which we need not record. There is sufficient freedom in b, y 0 left to make V continuous on R. Since V is constructed with Gaussian tails, it can certainly be used to define a legitimate invariant ensemble of real symmetric random matrices. By Lemma Appendix A.1, the V we have constructed has equilibrium measure µ.

Appendix B. Experimental details
This section gives full details of the experimental set-up and analysis for the outlier experiments.
Appendix B.1. Architectures and training of models. We use the GPU powered Lanczos quadrature algorithm [GPW + 18,MS06], with the Pearlmutter trick [Pea94] for Hessian vector products, using the PyTorch [PGC + 17] implementation of both Stochastic Lanczos Quadrature and the Pearlmutter. We then train a 16 Layer VGG CNN [SZ14] with P = 15291300 parameters and the 28 Layer Wide Residual Network [ZK16,HZRS16] architectures on the CIFAR-100 dataset (45,000 training samples and 5,000 validation samples) using SGD. We use the following learning rate schedule: We use a learning rate ratio r = 0.01 and a total number of epochs budgeted T = 300. We further use momentum set to ρ = 0.9, a weight decay coefficient of 0.0005 and data-augmentation on PyTorch [PGC + 17].

Appendix B.2. Implementation of constraints
As mentioned in the main text, one of the three weights of the linear model fit in the outlier analysis, β, is constrained to be positive, as it corresponds to a second cumulant, i.e. a variance, of a probability measure. Recall that the linear model's parameters are solved exactly as functions of the unknown θ (i) , and these parameters are in turn optimised using gradient descent. β is unconstrained during the linear solve, but its value is determined by the θ (i) , so to impose the constraint β > 0 we add to the mean squared error loss the term β = 1000 max(0, −β) (B.2) which penalises negative β values and is minimised at any non-negative value. The factor 1000 was roughly tuned by hand to give consistently positive values for β.
There is also the constraint that θ (i) > θ (i+1) > 0 for all i. This is imposed simply using a re-parametrisation. We introduce unconstrained raw value t (i) taking values in R and define θ (i) = i j=1 log(1 + exp(t (i) )), then the gradient descent optimisation is simply performed over the t (i) .

Appendix B.3. Fitting of outlier model
We optimise the mean squared error with respect to the raw parameters t (i) using 200 iterations of Adam [KB14] with a learning rate of 0.2. The learning rate was chosen heuristically by increasing in steps until training became unstable. The number of iterations was chosen heuristically as being comfortably sufficient to obtain convergence of Adam. The raw parameters t (i) were initialised by drawing independently from a standard Gaussian. The t (i) were initialised and trained using the above method 20 times and the values with the lowest mean squared error were chosen.