Large Deviation Analysis of Function Sensitivity in Random Deep Neural Networks

Mean field theory has been successfully used to analyze deep neural networks (DNN) in the infinite size limit. Given the finite size of realistic DNN, we utilize the large deviation theory and path integral analysis to study the deviation of functions represented by DNN from their typical mean field solutions. The parameter perturbations investigated include weight sparsification (dilution) and binarization, which are commonly used in model simplification, for both ReLU and sign activation functions. We find that random networks with ReLU activation are more robust to parameter perturbations with respect to their counterparts with sign activation, which arguably is reflected in the simplicity of the functions they generate.


Introduction
Learning machines realized by deep neural networks (DNN) have achieved impressive success in performing various machine learning tasks, such as speech recognition, image classification and natural language processing [1]. While DNN typically have numerous parameters and their training comes at a high computational cost, their applications have been extended also to include devices with limited memory or computational resources, such as mobile devices, thanks to compressed networks and reduced parameter precision [2]. Most supervised learning scenarios are of DNN functions representing some input-output mapping, on the basis of input-output example patterns. DNN parameter estimation (training) aims at obtaining a network that approximates well the underlying mapping. Despite their profound engineering success, a comprehensive understanding of the intrinsic working mechanism [3,4] and the generalization ability [5,6,7,8] of DNN are still lacking. The difficulty in analyzing DNN is due to the recursive nonlinear mapping between layers they implement and the coupling to data and learning dynamics.
A recent line of research utilizes the mean field theory in statistical physics to investigate various DNN characteristics, such as expressive power [9], Gaussian process-like behaviors of wide DNN [10,11,12], dynamical stability in layer propagation and its impact on weight initialization [13,14,15] and function similarity and entropy in the function space [16]. By assuming large layer-width and random weights, such techniques harness the specific type of nonlinearity used and many degrees of freedom to provide valuable analytical insights. The Gaussian process perspectives of infinitely wide DNN also facilitates the analysis of training dynamics and generalization by employing established kernel methods [17,18].
To study the entropy of functions realized by DNN [16], we adopted similar assumptions but employed the generating functional analysis [19,20], which is more general and can be applied to sparse and weight-correlated networks. The analysis of function error incurred by weight perturbations exhibits an exponential growth in error for DNN with sign activation functions, while networks with ReLU activation function are more robust to perturbations. We have also found that ReLU activation induces correlations among variables in random convolution networks [16]. The robustness of random networks with ReLU activation is related to the simplicity of the functions they compute [21,22], which may converge to a constant function in the large depth and width limit [15], although, in principle, they admit high capacity with arbitrary weights. However, DNN used in practice are of finite size and finite depth, therefore it is essential to analyze the deviation of finite-size systems with respect to the typical mean field behavior, and characterize its rate of convergence with increasing size. An example of a recent study along these lines [23] investigates the deviation in performance of finite size neural networks with a single hidden layer from the Gaussian process behavior.
In this work, we adopt the large deviation approach and the path integral formalism of [16] to derive the deviation of function sensitivity of finite systems from their infinite system counterparts, which is applicable to a range of DNN structures. We analyze the effect of sparsifying (diluting) and binarizing DNN weights, commonly used for model simplification [24,25,26,27]. Although the dependence on data and training are not considered, the analysis of random DNN provides valuable insights and baseline comparisons. We will also investigate the sensitivity of functions to input perturbation [9,13], which is related to function complexity and generalization [28,29,21,22]. The paper is organized as follows. In Sec. 2 and 3, we introduce the random DNN model and review the basic results of generating functional analysis,respectively. In Sec. 4 and 5, we derive the large deviation of function sensitivity to weight and input perturbations, respectively, based on the path integral formalism. Finally, in Sec. 6, we discuss the results and their implications.

The model
Following [16], we consider two coupled fully-connected DNN. One of them serves as the reference function under consideration, and the other as its perturbed counterpart, either in the weights or input variables. As shown in Fig. 1, each network consists of L + 1 layer; layer l has N l neurons, which can be layer dependent. The reference · · · · · · · · · · · · · · · · · · · · · · · · · · · s 0 s 1 Figure 1. The reference and perturbed fully-connected DNN, parameterized by {ŵ l } (black edges) and {w l } (blue edges), respectively. Each layer l has N l = α l N nodes.
network is parameterized by the weight variables ‡ {ŵ l } L l=1 , while the perturbed network is parameterized with {w l } L l=1 . Similarly, variables with a circumflex are associated with the reference network. In the following, w l represents the N l × N l−1 weight matrix at layer l, and w l i represents the N l−1 dimensional weight vector of the ith perceptron at layer l. Denoting the input dimension as N = N 0 , we assume the sizes of all layers scale linearly with N as N l = α l N.
A deterministic feed-forward network is defined by the recursive mapping where {w l ij } are the weights, h l i and s l i are pre-and post-activation field and variable, respectively, and φ l (·) is the activation/transfer function at layer l. The scaling factor of 1/ √ N l−1 in Eq. (1) is introduced for normalization. We primarily focus on networks with either sign [φ s (x) = sgn(x)] or ReLU [φ r (x) = max(x, 0)] activation functions in the hidden layers, and consider binary input and output variables s 0 i , s L i ∈ {1, −1} by applying the sign activation function at the output layer s L i = sgn(h L i ) for a fair comparison across architectures. The resulting feed-forward DNN implements a Boolean mapping f : {1, −1} N 0 → {1, −1} N L , where each output node s L i (s 0 ) computes a Boolean function. In the following, we call the two architectures sign-DNN and relu-DNN respectively, keeping in mind that sign activation function is always applied in the output layer.
To facilitate a path integral calculation, we consider stochastic dynamics between successive layers. For the layer with sign activation function, the activation s l i is ‡ The usual bias variables are omitted for simplicity, but it can be easily accommodated within the current framework. disturbed by thermal noise according to the following probability while for relu activation function, s l i is disturbed by additive Gaussian noise In the limit β → ∞, we recover the deterministic model. The evolution of the two systems follows the joint distribution To probe the difference between the functions implemented by the two networks, we feed in the same single input s 0 =ŝ 0 to the two systems such that i , and study the resulting output difference due to parameter perturbation. For continuous weight variables, one useful choice for the weight perturbation is which ensures that w l ij has the same variance ofŵ l ij as long as δw l ij follows the same distribution ofŵ l ij , and effectively rotates the high dimensional vectorŵ l i by an angle θ l = sin −1 η l as demonstrated schematically in Fig. 2.
In probing the sensitivity of a function due to input perturbations, the weights of two networks are kept the same w =ŵ and a fixed fraction of input variables are flipped randomly. The resulting output difference of the two systems reflects the sensitivity and complexity of the underlying DNN.

Generating functional analysis for typical behavior
Viewing the weights {ŵ l ij , w l ij } as quenched random variables, a generating functional analysis has been proposed [16] to derive the typical behavior of DNN. It starts with computing the disorder-averaged generating functional where the average Eŝ ,s is taken with respect to the joint probability Eq. (5). Assume the layer widths are the same N l = N for all l. Upon averaging over the disorderŵ, w, the generating functional can be expressed through a set of macroscopic order parameters such as the overlaps q l = 1/N l i ŝ l i s l i and magnetizationsm l = 1/N l i ŝ l i , m l = 1/N l i s l i as where Q is the conjugate variable of the order parameter q. In the large system size limit N → ∞, the generating functional Γ is dominated by the saddle point of the potential function Ψ(q, Q, ...). It gives rise to typical overlaps that dominate in probability, which facilitates analytical studies of random DNN. Assume the weight perturbation follows the form of Eq. (6), and both weight and perturbation are independent of each other and follow a Gaussian distribution w l ij , δw l ij ∼ N (0, σ 2 w ). It is found that for the layer with sign activation function in the limit β → ∞, the overlap evolves as [16] Similarly, for ReLU activation function in the deterministic limit, if the weight standard deviation is chosen as σ w = √ 2, the magnitude of the activations remains stable and the overlap evolves as while the output layer L follows Eq. (9) due to the use of the sign activation function. The restriction s 0 =ŝ 0 leads to q 0 = 1 in both cases.

Large deviations in parameter sensitivity of functions
The generating functional analysis above gives typical behaviors of random DNN in the limit N → ∞. However, practical DNN always have finite sizes. Therefore, it is worthwhile to understand the deviation to the most probable behaviors under finite N. In the following, we adopt the large deviation analysis to tackle this problem. An introduction of large deviation theory and its application to statistical mechanics can be found in [30]. In essence, a continuous observable O in a system of size N (assumed to be large) is said to satisfy the large deviation principle if the probability of finding O follows where I(x) is the rate function of the observable. It implies that the probability density of O scales as P N (O = x) ≃ e −N I(x) , which is concentrated at the minimum of the rate function x * = argmin x I(x) in large systems and the profile of I(x) quantifies the fluctuation of the observable. In this work the overlap of the output layer q L := 1/N L iŝ L i s L i is at the focus of our study. The path integral techniques adopted in the generating functional framework [16] can be adapted to tackle the large deviation analysis. We start with computing the probability density § where the operation Trŝ ,s is understood as an integration or summation depending on the nature of variables. The input distribution follows P To deal with the non-linearity of the pre-activation fields in the conditional probability, we introduce auxiliary fields {x l i , x l i } through the integral representation of delta-function which allows us to express the quenched random variablesŵ l ij and w l ij linearly in the exponents, leading to Assuming self-averaging [31] we exchange the order of summation and integration, and first carry out the average over the disorder variables. Specifically, we consider the weights of the reference network to be independent and follow a Gaussian distribution w l ij ∼ N (0, σ 2 w ) as before, and three types of perturbations § Here we assume q L = 1/N L N L i=1ŝ L i s L i to be a continuous variable by considering large N L . Instead, one can view q L as a discrete variable by definition (since the inputs are binary variables), where δ(·) should be understood as the Kronecker delta function.
(i) rotation of the weight vectorŵ l i following Eq. (6); (ii) sparsification of the weight matrixŵ l by randomly dropping connections with probability p l and rescaling the remaining weights by 1/ 1 − p l to ensure the same weight strength where σ w is introduced for keeping the variance of w l ij the same asŵ l ij .

Macroscopic order parameters
For perturbation of type (i), the disorder average of the third line of Eq. (14) yields To decouple Eqs. (14) and (17) over sites we introduce three sets of order parameters by inserting the identity and by expressing the output constraint as Upon introducing these macroscopic order parameters, Eq. (17) becomes The probability density in Eq. (14) involves N l identical integration and summation at each layer l, which can be performed individually [16], yielding where we have integrated out the auxiliary fields {x l , x l } and introduced the field doublet H l := [ĥ l , h l ] ⊤ . We further write P (q L ) as where −NΦ(Q, q,V ,v, V , v|q L ) is equal to the logarithm of the integrand in Eq. (21). Similar to the analysis in [16], the probability density P (q L ) is dominated by the saddle point (Q * , q * , ...) of the potential function Φ(...) in the large N limit (N l = α l N with α l as a constant) where I(q L ) = Φ(Q * , q * , ...|q L ) is the desired rate function. While this set-up is based on computing the deviation in function similarity with a single input q L = 1/N L iŝ L i s L i , one may argue that it requires testing on more than one input for obtaining a robust estimation, e.g., where M is the number of independent patterns used. Assuming that representation of different patterns are uncorrelated, we show in Appendix C that for small M, the rate function I(q L ) is approximately related to the single input case through a simple scaling This assumption is valid for sign-DNN but not for relu-DNN. We also confirm this scaling relation by numerical experiments (see below and in Appendix C).

Unifying three types of weight perturbations
The other two types of perturbations can be treated similarly. For network sparsification (15), the disorder average of Eq. (14) has the following form in the large N l limit (see Appendix A for details) which has the same form of Eq. (17) when p l is replaced by (η l ) 2 . Introducing the same order parameters, we obtain the covariance of the fieldsĥ l and h l in the form of Hence, diluting connections with probability p l at layer l in a random DNN corresponds to rotating each of the weight vectorŵ l i by an angle θ l = sin −1 p l .
Similarly, for network binarization in Eq. (16), the disorder average of Eq. (14) yields (see Appendix B for details) which corresponds to the covariance matrix of the fieldsĥ l and h l to be in the form Comparing to type (i) perturbation, one finds that binarizing weight elements in a random DNN corresponds to rotating each of the weight vectorsŵ l i by a fixed angle θ l = cos −1 2 π ≈ 37 • . This phenomenon has been observed in [32] and is linked to the practical success of binary DNN. It is argued [32] that 37 • is a very small angle in high dimensional spaces where two randomly sampled vectors are typically orthogonal to each other; therefore weight binarization approximately preserves the directions of the high dimensional weight vectors, which contributes to the success of binary DNN.
Therefore, we establish that the three types of perturbations on random DNN can be unified in the same framework developed in Sec. 4.1.

Saddle point equations
For networks with a generic activation function, the large deviation potential function Φ(...) can be express as Setting the derivatives with respect to the conjugate order parameters ∂Φ/∂iV l , ∂Φ/∂iV l , ∂Φ/∂iQ l to zero yields the saddle point equationŝ in which M l (ŝ l , s l ,ĥ l , h l ) bears the meaning of an effective measure [33]. Notice that q L is an input parameter imposing a nonlinear end point constraint on iQ L , which differs from the generating functional analysis calculation of typical behaviors [16], where q L is a dynamical variable and iQ L = 0 at the saddle point.
Setting ∂Φ/∂q l to zero yields the saddle point equations for the conjugate order parameters iQ l Similar relations holds for iV l and iV l . While the conjugate order parameters {V l , V l , Q l } are defined on the real axis, they can be extended to the complex plane and evaluated on the imaginary axis in the saddle point approximation, in which case {iV l , iV l , iQ l } are real variables. Other observables can be computed by resorting to the effective measure M l once the saddle point is obtained, e.g., the mean activations are given by [33] Since the covariance matrix Σ l (q l−1 , ...) depends on the order parameters of layer l − 1, the effective measure M l at layer l depends on the order parameters {q l−1 , ...} of the previous layer, while it depends on the conjugate order parameters {iQ l , ...} of the current layer. We then observe that the order parameters {q l , ...} propagate forward in layers, while {iQ l , ...} encoding the randomness leading to the desired deviation propagate backward, which resembles the structure in optimal control problem [34]. Therefore, we solve the saddle point equations in a forward-backward iteration manner until convergence. Another feature to notice in Eq. (36) is the dependence of the saddle point solution on the layer-shape parameters {α l }, which does not play a role in the mean field solutions where all the conjugate order parameters {iQ l , ...} vanish [16].

Explicit solutions for sign and ReLU activation functions
For networks with sign activation function the order parameters satisfyv l = v l = 1, such that the only meaningful order parameters are {q l , Q l }. The potential function Φ can be computed analytically, taking the form while the saddle point equations become Note that q L in Eq. (40) is an input parameter. For networks with ReLU activation function the potential function Φ also admits an explicit expression where A l , B l , C l are 2 × 2 matrices defined as The saddle point equations also admit a close-form expression accordingly.

Large deviations in input sensitivity of functions
In probing the sensitivity of a function to the flipping of input variables, the weights of two networks considered are taking the same values w =ŵ, which is done by setting η l = 0 in Eq. (6). We constrain the input s 0 of the perturbed system to have a predefined overlap q 0 (or Hamming distance N 0 (1−q 0 )/2) with the inputŝ 0 of the reference system. The sensitivity of the output overlaps to input perturbations is investigated through the conditional probability Without loss of generality, we choose a decoupled input distribution P (ŝ 0 , s 0 ) = while the delta function involving q 0 in Eq. (44) constrains the systems to have the desired input correlation. The probability of input overlap P (q 0 ) can be computed as where we have made use of the saddle point approximation of P (q 0 ) in the large N 0 limit, with the corresponding potential function defined in Eq. (46) and the saddle point solution iQ 0 * given in Eq. (47). The computation of the joint probability P (q L , q 0 ) is analogous to that of P (q L ) in earlier sections, α l log dĥ l dh l Trŝl ,s l M l (ŝ l , s l ,ĥ l , h l ).
The saddle point of iQ 0 satisfies iQ 0 * = − tanh −1 (q 0 ), which coincides with the one of P (q 0 ) in Eq. (47). So the conditional distribution satisfies where the saddle point solution {Q * , q * , ...} have the same form as those in Sec. 4.3, except that q 0 = 1 in Eq. (33) is replaced by the pre-defined value q 0 under investigation.

Weight sparsification
We first consider the effect of weight perturbation by sparsifying connections as in Eq. (15). For a concrete example, we consider DNN with L = 4, uniform layer width α l = 1 and disconnection probability p l = 1/2, for which we compute the large deviation rate function I(q L ) = Φ(Q * , q * , .. The results are shown in Fig. 3(a)(b), which exhibit a perfect match between the theory and simulation. The most probable q L , located at the minimum of Φ corresponds to the mean field solution, where q L mf ≈ 0.047 for sign-DNN and q L mf ≈ 0.266 for the relu-DNN. However, in finite systems they have a non-zero probability of admitting a higher value of q L due to fluctuations. We can compute the probability from the rate function by P (q L ) = exp(−NΦ * (q L ))/Z and estimate the tail probability of output mismatch. As an example we consider N = 64 and find that P (q L > 1/2) ≈ 0.055% for sign-DNN and P (q L > 1/2) ≈ 3.8% for relu-DNN, which is non-negligible especially for ReLU activation. ¶ In Fig. 3(c), we also demonstrate that the approximation of rate function I(q L ) of output overlapq L , estimated for M patterns by employing Eq. (25), is accurate for DNN with sign activation, while the approximation does not hold for deep ReLU networks (see Appendix C). Therefore in sign-DNN, the probability of finding perturbed DNN agreeing on all M patterns with the reference DNN decays exponentially with M (at least for small M values). This may not be the case in relu-DNN which requires further exploration in a future study.
In Fig. 3(d), we compare the mean field output overlaps q L mf between DNN with sign and ReLU activations for different system depths and disconnection probability p l . It is shown that relu-DNN are more robust to weight sparsification perturbation, as expected; the perturbed relu-DNN have residual correlations with the reference networks even after removing 90% of the weights. The robustness of relu-DNN to weight dilution was also observed and theoretically analysed in [35]. Finally, we remark that our scenario is different from the practical methods used to prune networks trained on specific data; in this case particular heuristic rules have been developed to disconnect weights instead of the random removal used here. The success of weight pruning in practice hightlights the weight-redundancy in real trained networks [24,35] but may also be influenced by properties of the data used and training methods. This behaviour is absent in random networks with random data, as indicated in the inset of Fig. 3(d), where even a small dilution probability can deteriorate the overlap. Additional modelling considerations are needed to address practical scenarios.

Weight binarization
We then consider the effect of perturbation by binarization of weight variables as in Eq. (16). Also here we consider uniform layer width α l = 1. The results shown in For finite N L , the output overlap is a discrete variable q L ∈ {1, 1 − 2 N L , 1 − 4 N L , ..., −1}, so it is convenient to consider the discretized probability distribution of q L as Prob(q L ) = P (q L )∆q L = exp(−N Φ * (q L ))/Z; the normalization constant is computed as Z = k exp(−N Φ * (q L k ))∆q L , where the summation runs over all possible values of q L and ∆q L = 2 N L . Although we could not find the saddle point solution of Φ(...|q L ) in the vicinity of q L = −1 for relu-DNN (see Fig. 3(b)), the contribution from that region to the cumulative probability of the overlap is negligible .
¶ Notice that such estimation is obtained by saddle point approximation in Eq. (22) and by keeping the leading order contribution, which may be slightly biased for small N .   Fig. 4(d).

Sensitivity to input perturbation
We have shown that relu-DNN with random weights are robust to parameter perturbations such as weight sparsification and weight binarization, which is a desired property for better generalization. On the other hand, such network ensembles typically represent simple functions as studied in [21,22]. The simplicity of the functions generated is one reason accounting for the observed robustness to parameter perturbation.
To probe the function complexity, we study the function sensitivity under input perturbation while keeping w =ŵ [28]. Flipping n input variables corresponds to the input overlap q 0 = 1 − 2n N 0 . In Fig. 5(a) and (b) we depict the overlap q L mf of the final output as a function of input overlap q 0 (keeping in mind that we always apply the sign activation in the output layer). While the outputs become more de-correlated in deeper layers of sign-DNN, the relu-DNN induce correlation at deeper layers. Therefore, random relu-DNN tend to forget the input structure at deeper layers, generating increasingly simpler functions that are robust to parameter perturbation. This phenomenon has been noticed in the Gaussian process-like analysis of DNN [10,11,12].
In [16], we investigated the effect of weight correlation in the form of P (ŵ where I is the identity matrix and J the all-one matrix. We found that DNN with ReLU activation functions and negative weight correlation c < 0 are more sensitive to parameter perturbation. Here we examine the sensitivity of relu-DNN to input perturbation by employing the same results developed in [16]. In Fig. 5 (c) and (d), we depict the mean field output overlap q L mf as a function of input overlap q 0 . It is observed that negative weight correlation corresponds to a higher sensitivity to input perturbation, indicating that the relu-DNN with negatively correlated weights generate more complex functions than those with random or positively correlated weights. We conjecture that negative weight correlation develops in very deep ReLU networks when they are trained to performed complex task where a high expressive power is needed, a phenomenon that has been observed in [36]. In Fig. 6, we further investigate deviations from the typical behaviors in the presence of input perturbations for the specific example with L = 4, α l = 1. The rate functions Φ(q L ) depicted in Fig. 6(a)(b) dictate the rate of convergence to the typical behaviors with increasing N by the large deviation principle, for both sign and ReLU activations, respectively. In Fig. 6(c), we observe that the rate functions have similar trends in the vicinity of the mean field solution q L mf for different levels of input perturbation (corresponding to different q 0 ) in sign-DNN, while they are more distinctive in relu-DNN as seen in Fig. 6(d). In relu-DNN, smaller input perturbation (larger q 0 ) leads to smaller variance of q L around q L mf . The rate function of relu-DNN is also more asymmetric around q L mf , suggesting that large deviations will be more often observed below q L mf than above it. This indicates that random relu-DNN of finite size may produce functions that are slightly more complex than what would be expected by the mean field solutions, which remains to be verified.
We also examine the dominant trajectories across layers leading to particular deviations by monitoring the correlations of activations between the two systems across layers. The relevant quantity is the correlation coefficient where the mean activationsm l and m l are computed by Eq. (37). We find that sign-DNN satisfym l = m l = 0,v l = v l = 1, such that ρ l = q l in this case. The results are shown in Fig. 6(e) and (f), which suggest that the deviations of q L from the typical value q L mf are mainly contributed by the deviations at later layers. Lastly, we investigate the effect of DNN architecture on the deviation. In particular, we consider a single bottleneck layer at a particular hidden layer l ′ (0 < l ′ < L) with α l ′ = 1 8 while all other layers satisfy α l = 1, ∀l = l ′ . Placing the bottleneck at later layer introduces a higher variability of output overlap q L by observing smaller values of the rate function in Fig. 7; this effect is more prominent in sign-DNN, while it is much less noticeable in relu-DNN.

Discussion
By utilizing the large deviation theory coupled with the path integral analysis, we derive the sensitivity of finite size random DNN under parameter and input perturbations. Random DNN with sign or ReLU activation function are shown to satisfy the large  deviation principle, where the rate functions govern an exponential decay of the deviation to the mean field behaviors as the size of the system increases. We also investigate the effects of weight sparsification and binarization of random DNN, and uncover their equivalence to rotation of weight vector in high dimension. Random DNN with ReLU activation function are found to be robust to these parameter perturbations, which is caused by the low complexity of the corresponding function mappings. Random initializing the weights of ReLU DNN places a prior for simple functions, while they have the capacity to compute more complex functions with specifically trained weights. The next important question is how the networks adapt to perform complex tasks by the training processes.
where we have made use of the large N l approximation.

Appendix B. Disorder average for weight binarization
For weight binarization in (16), the disorder average in Eq. (14) can be computed as where the large N l approximation has been employed.

Appendix C. Large deviation in the multiple-pattern scenario
Consider function similarity estimated for multiple patterns whereŝ L,µ i (ŝ 0,µ ) is the ith output of the reference network with the µth inputŝ 0,µ drawn independently and identically from the input distribution P (s 0 ). In the small fluctuation regime, where each q L,µ is close to the mean field solution q L mf , we have I(q L,µ ) ≈ 1/2I ′′ (q L mf )(q L,µ − q L mf ) 2 (both I(q L mf ) and I ′ (q L mf ) vanish [30]), i.e., P (q L,µ ) can be approximated by a Gaussian density where the corresponding variance is 1/(NI ′′ (q L mf )). Since the M inputs are independent, we also assume the outputs are also approximately independent (which holds in sign-DNN but does not necessary for relu-DNN since ReLU non-linearity can induce correlations among variables), such that the variance ofq L is 1/(MNI ′′ (q L mf )). Therefore, in the vicinity of q L mf we have implying that the corresponding rate function differs from the one with single pattern by a factor of M. More formally, one can directly compute the probability density P (q L ) as Eq. (C.4) can be factorized over sites as before. However, we have O(LM 2 ) order parameters here, while there are only O(L) order parameters in the single pattern case. To further simplify the calculation, we assume a symmetric structure of the crosspattern overlaps at the saddle point q l,µν = q l, δ µν + q l,⊥ (1 − δ µν ), where q l, , q l,⊥ are the diagonal and off-diagonal matrix elements respectively. Under this assumption, one can in principle evaluate the integral in C.4, but the resulting calculation becomes rather involved. Alternatively, since the M input patterns are independent, we expect the diagonal elements of the matrix q l,µν to be larger than the off-diagonal elements (sum of correlated variables v.s. sum of random variables). In particular, for sign activation we expect q l, ∼ O(1), q l,⊥ ∼ O( 1 √ N l ) since q l,⊥ involves a summation over weakly correlated positive and negative numbers. We therefore approximate the summation µν [...] in the exponential of Eq. (C.5) by µ=ν [...], which yields MN l un-coupled identical integrals at each layer N l . It eventually leads to the rate function of multiple-pattern overlap q L as I(q L ) ≈ MΦ(Q * , q * , ...|q L ), where Φ(Q * , q * , ...|q L ) is the rate function of the single-pattern overlap q L . While the off-diagonal elements of q l,µν have smaller values, there are more of these terms (M(M − 1) off-diagonal terms compared to M diagonal terms in the summation µν [...] in the exponential of Eq. (C.5)), so we expect the above approximation to hold only for small M. The above argument may fail for ReLU activation, sinceŝ l,µ j , s l,µ j are always positive, and therefore q l,⊥ ∼ O(1). In Fig. C1, we compare the approximate theoretical results I(q L ) ≈ MΦ(Q * , q * , ...|q L ) to numerical simulations in the scenario of weight sparsification with disconnection probability p l = 1/2. We observe a good match between the two approaches for sign-DNN, validating the de-correlation assumption of M patterns. For relu-DNN, the theory gives a good prediction on shallow networks with L = 2 but deteriorates for deeper networks; it suggests the importance of cross-pattern order parameters q l,⊥ in this case, whose detailed treatment is beyond the scope of this work. (f) Figure C1. The rate function I(q L ) of output overlapq L defined for M patterns and DNN with different activation functions and system depths, in the scenario of weight sparsification with disconnection probability p l = 1/2. Solid lines correspond to theoretical results and dashed lines with circle markers correspond to estimation from simulation.