Decomposing neural networks as mappings of correlation functions

Understanding the functional principles of information processing in deep neural networks continues to be a challenge, in particular for networks with trained and thus non-random weights. To address this issue, we study the mapping between probability distributions implemented by a deep feed-forward network. We characterize this mapping as an iterated transformation of distributions, where the non-linearity in each layer transfers information between different orders of correlation functions. This allows us to identify essential statistics in the data, as well as different information representations that can be used by neural networks. Applied to an XOR task and to MNIST, we show that correlations up to second order predominantly capture the information processing in the internal layers, while the input layer also extracts higher-order correlations from the data. This analysis provides a quantitative and explainable perspective on classification.


I. INTRODUCTION
Recent years have shown a great success of deep neural networks in solving a wide range of tasks, from image recognition [1] to playing Go [2]. One major branch is supervised learning, where input-output mappings are learned from examples. In many common problems the target output values are given by a finite set, defining a classification task [3]. The objective then is to minimize an error measure between the correct class label and the prediction made by the neural network with respect to the joint probability distribution of data samples and class labels [4]. Thus, training dynamics, and consequently the solution strategy implemented by the network, depend on this probability distribution and the information it encodes. In this view, a network implements a transformation of the input distribution with the objective to concentrate the output distribution around the assigned target values for each class. How such a transformation is achieved and how the network training depends on the statistics of the presented data is, however, still mostly unknown.
To render the decision-making process of neural networks transparent, a profound understanding regarding their functional principles and extraction of meaningful features from given data is required. Over the past years the discrepancy between success in applications and limited understanding has lead to an increased interest also in the theoretical community [4][5][6][7][8][9]. An important line of theoretical research investigates ensembles of neural networks in the limit of infinite width for which the central limit theorem implies an exact equivalence to Gaussian processes (GPs) [10][11][12][13]. While this approach is informative with respect to how relations between data samples * ki.fischer@fz-juelich.de are transformed by the network, it does not reveal how the internal structure of data samples is processed. As an example, for image classification the Gaussian process view takes into account the relation between all corresponding pixels x α i , x β i of any pair of images α, β in the form of a scalar product ∑ i x α i x β i . Even though the data statistics shape the eigenfunctions of the GP's covariance matrix [14,Sec. 4.3], it is not obvious which role is played by the structure within individual images determined by e.g. correlations between pixel values x α i and x α j . In particular, due to the rotational symmetry of the scalar product, the GP view gives identical results when image pixels are shuffled consistently across all images. However, clearly the internal structure of data samples also contains important information that may be employed to solve a given task. The focus of the present study is to investigate how this information is extracted from the data and utilized by the network to perform classification.
Other approaches [15][16][17], similarly as GPs, focus on ensembles of networks with randomly drawn weights. In contrast, here we study how particular realizations of trained and untrained networks process different statistical features of the data. Thus, we shift the perspective from distributions over network parameters to distributions over the data. In particular, we describe the input-output mapping implemented by deep neural networks in terms of correlation functions. To trace the transformation of correlation functions across layers of neural networks, we make use of methods from statistical field theory [18][19][20][21] and obtain recursive relations in a perturbative manner by means of Feynman diagrams. Our results yield a characterization of the network as a non-linear mapping of correlation functions, where each layer exchanges information between different statistical orders. Re-expressing the loss function in terms of data correlations allows us to study their role in the training process, and to link the transformation of data correlations to the solution strategies found by the network. arXiv:2202.04925v2 [cond-mat.dis-nn] 1 Dec 2022 For the particular example of the mean squared error loss function, we show that network training relies exclusively on the first two cumulants of the output (mean and covariance), while these, in turn, are predominantly determined by means and covariances of network activations in previous layers. Furthermore, we show that corrections from higher-order correlations to mean and covariance, which are readily computable with the proposed generic field-theoretical framework, are of greatest importance in the first layer, where these corrections effectuate the information flow from higher-order correlations to mean and covariance.
The structure of this study is as follows: Section II provides theoretical background on the definition and architecture of deep neural networks (Section II A), on empirical risk minimization in the context of classification (Section II B), and on field-theoretical descriptions of probability distributions in terms of cumulants and their generating function (Section II C). In Section III we decompose the network mapping into correlation functions, tracing their transformations backwards through the network. We start by relating the loss to the first-and second-order correlations of the network outputs (Section III A), then discuss the mapping of correlations by individual hidden layers (Section III B), and end with the extraction of data correlations by the input layer (Section III C). Section IV applies these theoretical tools to several example data sets. We start with an adaptation of the XOR problem, where the input statistics are fully known and selectively presented to the network to study different encoding and processing schemes of class identities (Section IV B). We proceed with an application to the MNIST data set [22], where we show that classification performance is largely based on the transformation of means and covariances across layers (Section IV C). Finally, we showcase the importance of higher-order correlations and their extraction in the input layer by constructing a data set where information on class identity is only encoded in correlations of third and higher order (Section IV D). In Section V we discuss our results and provide an outlook.

A. Feed-forward network architecture
We consider fully-connected neural networks with L layers of N l neurons each, and one additional linear readout layer, as shown in Fig. 1a. Each layer l = 1, . . . , L consists of an affine transformation parameterized by a weight matrix W l ∈ R N l ×N l−1 and bias vector b l ∈ R N l . This step is followed by the pointwise application of a non-linear activation function φ, yielding Here y 0 = x ∈ R N0 denotes the input data of dimension N 0 . The readout layer produces the network output y ∈ R dout , specifically y i = z L+1 i . The network mapping y = g(x; θ) is given by iterating over network layers and characterized by parameters θ ∶= {W l , b l } l=1,...,L+1 .
We initialize all network parameters randomly from i.i.d. centered Gaussians W l ij i.i.d.
∼ N 0, σ 2 b before training. The scaling of the variance σ 2 w N l−1 is chosen such that the covariance of z l (see Eq. (1)) is independent of the layer width.

B. Learning theory: empirical risk minimization
The fundamental assumption underlying classification 1 is the existence of a joint distribution p(x, t) of data samples x and class labels t that is the same for training and evaluation [3]. By Bayes' theorem, the distribution of the input data can be treated as a mixture model p(x) = ∑ t p(t) p(x t) . The network's task is then to implement a mapping g ∶ x ↦ y that minimizes the expectation of a loss (y, t) between the network outputs y = g(x; θ) and the labels t.
This mapping, in turn, induces a mapping of the probability distributions for each label t, where δ(.) refers to the Dirac delta distribution. The unconditioned output distribution is then the weighted sum p(y) = ∑ t p(t) p(y t; θ). Ideally, the network output y matches the true label t so that the target distribution is given by p(y t) = δ(y − t).
Training algorithms seek to minimize the expected loss or risk functional [23] where the expectation value ⟨⋅⟩ y t;θ is taken with regard to the class-conditional output distributions p(y t; θ). In general, neither the mixture components of the input distribution p(x t) nor the induced class-conditional output distributions p(y t; θ) are known. Instead, the expected loss is replaced by the empirical loss or risk evaluated for a training set {(x (d) , t (d) )} d , with D being its size and d the respective sample index. The empirical risk minimization principle then assumes the following: the mapping g( ⋅ ; θ * ) that minimizes the empirical risk θ * = argmin θ R emp (θ) yields an expected risk R(θ * ) that is close to its minimum min θ R(θ) [23].

C. Parameterization of probability distributions in terms of cumulants
This section contains the framework to track data correlations (cumulants) of arbitrary order through the network. Large parts of the main text deal with the first two orders, the Gaussian approximation. Readers who want to obtain an overview of the main results may skip the remainder of this section at first read. This part, however, becomes essential when including non-Gaussian corrections and as a means to obtain an intuitive picture of how non-linear transformations couple the different statistical orders.
Neural networks can be regarded as complex systems that generate many interactions between data components. A common approach to investigate such systems is by studying generating functions of moments or cumulants rather than the probability distributions themselves. Cumulants often provide a more convenient parameterization of probability distributions as they are additive with respect to addition of independent variables, leading to simpler expressions for the transformation of statistics across layers. Such approaches are common in statistical physics and in mathematical statistics.
The network mapping g ∶ x ↦ y relates the cumulant generating function of network outputs y to the statistics of the input x: The cumulant generating function is considered per class t as the data statistics are expected to differ between classes. The class-conditional output cumulant of order n denoted by G (n) y t;θ is then defined as Evaluating Eq. (7) would in principle allow one to relate G (n) y t;θ to the input cumulants G (n ′ ) x t;θ . However, one intricacy is that the network mapping g(x; θ) is given via the iterations in Eq. (2). Their iterative nonlinear nature makes deep neural networks powerful as universal function approximators, but complicates their analysis in terms of data processing. Yet, we can study the transformation of cumulants from input to output by considering layers individually.
Since pre-activations z l are determined by affine linear transformations, the cumulant generating function of pre-activations z l in layer l is trivially related to the cumulant generating function of post-activations y l−1 of layer l − 1 as yielding for the first order cumulant (n = 1) and for second-and higher-order cumulants (n ≥ 2)  Each index s i is hence contracted with one factor W l r k si to produce the index r k of the resulting cumulant. Consequently, cumulants of pre-activations z l are linear tensor transformations of cumulants of post-activations y l−1 of the same order. The non-linear activation function φ in each layer l then relates the pre-activations z l to the corresponding post-activations y l : This cumulant generating function of the postactivations y l cannot, in general, be computed exactly. One common approximation technique is a perturbative expansion [21], which we here recast in the following way: by replacing φ(z l ) with its Taylor expan- (11) and treating nonlinear terms (m > 1) as perturbations, we can construct cumulants G (n) y l as series of Feynman diagrams composed of the graphical elements shown in Table I. For example, for the mean of the first layer we get the following diagrams: x, ii + . . .
We find that in general these expressions involve two types of factors, which we represent with two types of vertices: empty circles with internal lines, representing cumulants G (n) z l of pre-activations z l , and hatched circles with one external line j that stem from Taylor coefficients For constructing a cumulant G (n) y l of the postactivations y l of order n, we need to determine all di-agrams with n external lines. External lines occur on cumulant vertices as well as on hatched vertices. Furthermore, they always need to be connected to a cumulant vertex, but cannot be connected to one another. Finally, due to the linked cluster theorem, only connected diagrams need to be considered, since others do not contribute to cumulants. When evaluating the generated diagrams, all permutations of indices (r 1 , . . . , r n ) for both internal and external lines need to be taken into account. Symmetries within diagrams result in their repeated occurrence, which is reflected in combinatorial pre-factors (for more details, see [21]).
Using this perturbative approach for determining the cumulants G (n) y l of the post-activations y l has two main advantages: First, it provides a principled way to go beyond Gaussian statistics and include higher-order cumulants. Second, the availability of a diagrammatic language allows us to graphically represent the information transfer from cumulants G (n) z l of the pre-activations z l to cumulants G (m) y l of the post-activations y l . The diagrammatic representation introduced above assumes that the activation function φ can be expanded as a Taylor series. For non-differentiable functions such as ReLU, this approach can be adapted by using a Gram-Charlier expansion of the probability distribution p(z l ). The expectation value in Eq. (11) then becomes a sum of Gaussian integrals, which can be calculated either analytically (see Appendix C for ReLU as an example) or numerically.

III. DECOMPOSING DEEP NEURAL NETWORKS INTO CORRELATION FUNCTIONS
Analyzing how deep networks process data is difficult due to their iterative, parameter-dependent definition. Statistical learning theory studies the expected error [24], thus shifting from the transformation of data samples to that of data distributions. We follow this idea here by studying how data correlations of the input are iteratively transformed by deep networks, as illustrated in Fig. 1, and how they shape the expected loss.

A. Data correlations drive network training
We here discuss the dependence of the expected loss in Eq. (4) on the data correlations. In general, the expected risk R(θ) is a function of the class labels t and the classconditional cumulants G (n) y t;θ of arbitrary orders n: However, for the often employed mean squared error MSE (y, t) = y − t 2 , R(θ) only depends on the mean µ t y and variance Σ t y of outputs of each class t as Training therefore aims to match class means and labels, while minimizing the variance of each class's output. 2 In this case, the first-and second-order cumulants (mean and covariance) of the last layer alone drive network training, thus singling these out as the relevant statistics. This result has two implications: 1.) In deep feed-forward networks, only non-Gaussian statistics that appear in network layers before the final layer can contribute to the learned information processing by influencing the first two cumulants in the final layer. 2.) If networks produce non-Gaussian statistics in the final layer, these do not serve a functional role per se; rather they may arise as a by-product of earlier layers operating on higher-order statistics.
Thus, understanding the network mapping reduces to understanding how the Gaussian statistics (µ t y , Σ t y ) of the output arise from the presented data distribution across multiple network layers. Network training and the resulting information processing within the network is therefore directly linked to how data correlations are transformed by the network.

B. Propagation of data correlations within the network
To understand how the extraction of information from the input and its internal processing shape the first-and second-order cumulants of the output, we follow these two quantities backwards through the network. According to Eq. (8)-Eq. (10), the affine transformation in each layer implies for the pre-activations z l : showing that the two quantities are transformed independently of each other in this step (Fig. 1). In general, the non-linear activation function φ ∶ z l ↦ y l makes the statistics of the post-activations y l dependent on cumulants of arbitrary orders in the preactivations z l through (cf. Section II C) However, due to the central limit theorem, initializing the weights independently causes the affine transformation y l−1 ↦ z l to mainly pass on the Gaussian part of the statistics, since higher-order cumulants G (n) z l , (i1,...,in) = ⟨⟨z l i1 z l i2 . . . z l in ⟩⟩ ∼ O((N l−1 ) − n 2 +1 ) are suppressed by the layer width N l−1 for n > 2. In Appendix B we derive sufficient conditions under which the Gaussian approximation remains valid also for wide trained networks. In brief, we find that it suffices to have a natural scaling of weights w ∼ O(N − 1 2 ) as well as an approximate orthogonal decomposition of the sending layer's covariance matrix by the row vectors of the connectivity to the next layer, Eq. (B7). These conditions are in particular different from those of the lazy (kernel or neural tangent kernel) regimes, where weights only change marginally. Under these conditions, in the limit of infinitely wide networks, expectations over pre-activations ⟨⋅⟩ z l can be taken with respect to Gaussian distributions z l ∼ N (µ z l , Σ z l ), and we obtain that the mean and covariance of post-activations are non-linear functions of only mean and covariance of pre-activations These functions mediate interactions between first-and second-order cumulants. Applying this argument iteratively to the network layers l = L + 1, L, . . . , 2, it follows that the information processing in the internal network layers is largely determined by an iterated, non-linear mapping of mean and covariance. The interaction functions f µ and f Σ can be calculated numerically for arbitrary activation functions. In particular, φ need not be differentiable. Analytic expressions can be obtained for various activation functions φ; we provide expressions for φ = ReLU and φ(z) = z + α z 2 in Appendix C Table II. The latter, minimally nonlinear activation function yields especially interpretable interaction functions that are constructed from the following diagrams: Σ y l , ij = + + + = Σ z l , ij + 4 α 2 µ z l , i Σ z l , ij µ z l , j The last diagram contributing to Σ y l , ij corresponds to an expression containing two terms. These terms result from the permutation of the indices (i, j) (see Section II C). Training introduces correlations between weights, thus violating the independence assumption of the central limit theorem. Also the sufficient conditions for the Gaussian approximation to be consistent (Appendix B) are not necessary conditions; for example pairs of neurons may be perfectly correlated without violating a Gaussian description. We will therefore show in the following that empirically the first-and second-order cumulants provide a useful approximation for the information propagation within the network.

C. Information extraction in the input layer
So far we have studied the internal network layers. Here, we discuss the role of the input layer in extracting information from higher-order correlations of the input data. Since the pre-activations of this layer and need to be taken into account for smaller input dimension N 0 . In consequence, cumulants of multiple orders n contribute to the mean and covariance of the post-activations y 1 : These mean and covariance are then passed on through the entire network. The interaction functions h µ and h Σ can be systematically approximated for any activation function, either by the diagrammatic techniques discussed in Section II C in the case of differentiable functions or alternatively by a Gram-Charlier expansion for non-differentiable functions (see Appendix C for ReLU as an example). Analytically simple and exact expressions can be computed for a quadratic non-linearity (see Eq. (18a)); in this case, the expression for the mean does not get any contribution from G (n>2) z 1 , while the covariance gets additional contributions from third-and fourth-order input correlations: As in the previous section, there are two diagrams that each correspond to an expression containing multiple terms. These terms result from the permutation of the indices (i, j) (see Section II C).
The cumulants of the pre-activations G (n) z 1 are linked to the cumulants of the input data G (n) x by a mapping between corresponding orders n as G Thus, the input layer effectively extracts information from higher-order correlations of the input data G

D. Statistical model of a feed-forward network
Putting together all previous sections, we introduce the statistical model corresponding to a given network model. This model represents the information processing performed by the network in terms of the data correlations {G (n) x } n . By iterating Eq. (14), Eq. (19), and Eq. (17), respectively, across layers, one obtains the mean and covariance of the network output y = g(x; θ) as functions of the statistics of x: Since the network decomposes into a mapping for each class label t, one obtains the distribution of the network output as a Gaussian mixture p(y) = ∑ t p(t) N (µ t y , Σ t y )(y). The parameters (µ t y , Σ t y ) are determined by the propagation of data correlations {G (n), t x } n, t through the network Eq. (20)- (21). Note that these are generally not exact due to the Gaussian approximation of pre-activations z l at each intermediate layer.
In the following, we call the mapping the statistical model of the network. One important feature is that the statistical model shares the parameter structure θ = {W l , b l } l=1,...,L+1 with the corresponding network model. In consequence, there is a one-to-one correspondence between the statistical model Eq. (22) and the network model g∶ (x; θ) ↦ y given a fixed set of parameters θ.
Beyond empirically comparing these two models, the statistical model can be used to assess the relevance of data correlations {G (n), t x } n, t for solving a particular task. We have shown in Section III A that the expected mean squared error loss R MSE ({µ t y , Σ t y } t ) is given by Eq. (13), so that it depends solely on mean and covariance of the output. By the statistical model Eq. (22), the mean squared error R MSE thus can be approximated as a function of the data correlations {G (n), t x } n, t and the network parameters θ: Minimizing this loss then yields optimal parameters θ * for the statistical model. The corresponding network model g(x; θ * ) is then dependent on the given set of data correlations {G (n), t x } n, t , allowing the investigation of their relevance in solving a particular network task.

IV. EXPERIMENTAL RESULTS
We now apply the developed methods to the XOR problem and the MNIST dataset. We use the network architecture defined in Section II A with fixed network width N l = N for l ≥ 1 and either the ReLU activation function or a minimal non-linearity, namely the quadratic activation function φ(z) = z + α z 2 with α = 0.5.

A. Training details
For initialization of network parameters θ, we use Following the standard procedure, networks are trained by optimizing the empirical risk per data batch {(x (b) , t (b) )} b of the expected MSE loss: The batch size B is set to 10 on XOR and 100 on MNIST. For optimization, we use Adam [26,27] with learning rate 10 −3 , momenta β 1 = 0.9 and β 2 = 0.999, = 10 −8 , and λ = 0. The choice of the optimizer does not affect the above presented derivations. Network implementations were done in PyTorch [28].

B. Multiple information encodings of the XOR problem
We first study an adaptation of the XOR problem as a non-linearly separable Gaussian mixture distribution. We make use of two conceptual advantages of this XOR task: First, knowing the exact input distribution allows us to focus on the internal information processing within the network. Second, the fact that each class is itself a mixture distribution allows us to trace the classconditional correlations in two alternative forms, corresponding to two different statistical representations of class membership, which isolate different statistics of the input -respectively the mean and the covariance. We find that while the task can be solved for both representations, they correspond to different local minima of the empirical loss landscape.

Problem setup as a Gaussian mixture
Our adaptation of the XOR problem uses real-valued instead of binary inputs and describes the input distribution as a Gaussian mixture of four components, illustrated in Fig. 2a. For the class label t = +1, we choose the mean values of its two components ± as µ t=+1, ± Covariances are isotropic throughout with Σ t, ± x = 0.05 I and the input distribution based on the mixture component it is drawn from. From the geometry of the problem follows that the optimal decision boundaries coincide with the axes in data space ( Fig. 2a), allowing us to calculate the optimal performance P opt = 97.5%. We use training and test data sets of sizes n train = 10 5 and n test = 10 4 , respectively.

Accuracy of internal information processing in terms of correlation functions
Given the exact input distribution for this problem, we trace the transformation of mean and covariance predicted by Eq. (20) and Eq. (21) for each mixture component (t, ±) separately, obtaining x , Σ t, ± x ; θ, φ) are functions of the input statistics, the network parameters, and depend on the choice of activation function. In Fig. 2 we compare this theoretical result to an empirical estimate of the output distribution p emp. (y), given as a histogram obtained from the test data. We test the validity of the statistical model for both, an untrained network with random weight initialization (Fig. 2b) and a trained network (Fig. 2c).
The untrained network produces an output distribution of complex shape composed of superimposed close-to Gaussian distributions, each corresponding to one component, as shown in Fig. 2b. Training the network reshapes the output distribution such that the classconditional distributions p(y t) become well separated by the threshold at y = 0, as shown in Fig. 2c. The overlap between these two distributions around the threshold corresponds to the classification error. Qualitatively, theory Network output y ). On average, the trained networks achieve performance values of P = 97.00% ± 0.05% compared to Popt = 97.5%. Networks were trained to perform the XOR task described in Section IV B 1. Other parameters: φ = ReLU.
and simulation agree well for both random and trained networks. These results apply for different activation functions φ (see Fig. 8 in Appendix D for φ(z) = z +α z 2 ).
To quantify the alignment of theory and simulation, we compute the Kullback-Leibler divergence D KL between the empirical estimate p emp. We averageD KL (p emp. p theo. ) across 50 different network realizations, for random ( Fig. 3a) and trained ( Fig. 3b) networks. In both cases, the deviation between theory and simulation is generally small, but increases mildly with the network depth L as approximation errors accumulate across network layers. For random networks, the deviations are generally small with a slight decrease of the deviation for wider networks, in agreement with the central limit theorem as discussed in Section III B. For trained networks with thus correlated parameters, there is an overall increase of deviations between theory and simulation. Nonetheless, this increase remains modest, showing that the theory continues to be applicable for networks with trained, and thus non-random, parameters. Again these results apply for different activation functions φ (see Fig. 9 in Appendix D for φ(z) = z +α z 2 ). When evaluating the expressions for ReLU, one needs to be careful with the numerics due to the appearing error functions.

Different information coding paradigms and their relations
In the previous section, we have shown that the mapping implemented by the network can be described as a mapping of correlation functions (see Eq. (22)). On the level of data correlations, it directly follows that the network's expressivity with respect to a given task depends on two properties: (1) the ability of the network architecture to implement a desired mapping of data correlations from its input to its output; (2) the way in which information about class membership is represented by data correlations in the input.
A complete study of the first property would be provided by fully describing the space of possible mappings, which is challenging in general. However, the forward mapping of cumulants we have obtained in Section III B allows us to probe this space experimentally, and it provides a path to more systematic studies of network expressivity -see our remarks on statistical receptive fields in Section V.
In this section, we study the second property by investigating two different information representations: (A) the class membership is represented by different means between classes, while the covariances and all higherorder cumulants are identical; (B) the class membership is represented by different covariances between classes, while the means and all higher-order cumulants are identical. Accordingly, these two representations are called mean coding (A) and covariance coding (B) in the following. While each of these two settings confines the class membership to one particular cumulant order, the more general case is that class membership is represented by various orders of statistical moments. In that case, the network may make use of this duplicate information to maximize performance.
To be able to compare these settings, in either case we train models on a single task defined via a single data distribution, but present different statistical representations of the data. We use the statistical model corresponding to the network described in Section III D, limiting input correlations to mean and covariance by setting higher-order cumulants to zero. We take the binary XOR problem (see Section IV B 1) which can be cast into either information representation in a natural way: For mean coding (A), we provide to the network both the class labels t and the specific mixture component ± from which a sample was drawn, yielding four sets of statistics {µ m x , Σ m x } m=(t,±) with different means but identical covariances. For covariance coding (B), only the class label t is provided to the network, yielding two sets of statistics {µ m x , Σ m x } m=(t) , for which the covariances Σ t=±1 differ between the two classes, while their means are the same (see Fig. 4a). In both cases, all higher-order cumulants of the component distributions m = (t, ±) and m = (t), respectively, are set to zero. Note that for mean coding the class distribu- , respectively, define different statistical models for mean and covariance coding.
We compare these two statistical representations A and B of the network to the network trained directly on batches of samples; the latter we refer to as sample coding in the following. Sample coding can be considered as the case where potentially all statistical moments of the data are accessible to the network. Our goal is to address the following questions: First, which statistical representation most closely matches the information representation used by a network trained on data samples? Second, is there a difference in performance between in-formation representations; in particular, can the network equivalently use the information provided by either mean or covariance coding? Finally, does the network make use of duplicate information in different cumulant orders to improve performance in the case of sample coding?
To answer these questions, we optimize models until convergence using either representation. We then switch to a different representation, continuing optimization for the same number of steps, and observe the stability of the previously found solution. Each experimental setup is repeated with 10 2 different weight initializations. Results are shown in Fig. 4 for three different coding combinations, where we plot both the loss and the magnitude of change ∆θ 2 2 of the model parameters. We find that after initial optimization all three models correspond to networks with at least P = 91% performance, so training converges in all cases and the networks implement viable solutions before the switch. Thus, the behavior after the switch indicates how the found solution is affected by changing the statistical representation. Furthermore, we observe that immediately after the switch from covariance to mean coding, ∆θ 2 2 jumps to values similar to the initial training steps (Fig. 4b). This indicates a near complete change of the model, which suggests that mean and covariance coding induce fundamentally different solutions. In contrast, the jump is modest when switching from covariance to sample coding (Fig. 4c), and non-existent when switching from mean to sample coding (Fig. 4d) -suggesting that those different solutions coexist in the true loss landscape of the network model. Thus, we find that the network utilizes the presented information in different ways for the two representations, as expected based on the information flow in these networks, derived in Section III B.
In particular, the case of covariance coding highlights the importance of a non-linear activation function when the discriminating information is not contained in the class means. Since classification is based on different mean values in the network output, the difference in covariance for each class needs to be transferred to the mean. This information transfer is mediated by the nonlinearity φ; for the case φ(z) = z + α z 2 used in Fig. 4, we have the particularly simple transfer function from covariances to means. Here, only diagonal entries of the covariance enter, while the input covariances Σ t=±1 ±0.25 0.3 differ in their off-diagonal entries. The information transfer from off-diagonal to diagonal entries is mediated by the affine transformation (see Eq. (14)) prior to the activation function. In this way, we can track how information flows into the mean as it is transformed by successive network layers.
In summary, we find that for this task the network can effectively utilize the information presented by either mean or covariance coding, both representations leading to different solutions with comparable performance. Sample coding tends to yield similar solutions as mean coding, implying that the network makes use of duplicate information present in higher-order moments of the data samples from each class.

C. Essential data correlations of the MNIST data set
We consider in this section the MNIST data set [22], consisting of 10 classes of 28 × 28 images. This data set is highly structured: if one approximates each class by a multivariate Gaussian, the resulting samples are already visually recognizable (Fig. 5a,b; see Appendix E for further details). Our goal is to use the theory developed in previous sections to quantify this observation, in a matter which can be generalized to different data sets and different sets of input cumulants. We also argue that truncation of cumulants in the input layer has the largest impact, and in the process validate our theory on a nontrivial task.
Concretely, we proceed as follows: when optimiz-ing the parameters θ * of the statistical model, we restrict the data statistics to a particular set of cumulants {G (n) x } n=1,...,n and compare the achieved performance to that of the network trained on samples y = g(x; θ * ). The difference in performance is then indicative of the importance of the cumulants we kept. We employ one-hot encoding, making the network output d out = 10 dimensional.
As a baseline, we first train network models on both the MNIST data set (Fig. 5a) and the corresponding Gaussian samples (Fig. 5b). The latter case limits the information that can be extracted by the input layer to the class-conditional means µ t x and covariances Σ t x . In both cases, networks are trained with the standard empirical loss (Eq. (25)); in particular, this allows inner network layers to make use of cumulants of any order. With respect to classification performance, we find that training on Gaussian samples yields a performance that is lower by ∆P ≃ 2.4% ± 0.7% (Fig. 5c): a difference we can ascribe to the removal of higher-order cumulants in the data distribution. Based on the modest magnitude of this difference, we conclude that data mean and covariance are already highly informative for these data and account for about P ≈ 91%.
We next train the corresponding statistical model (Eq. (22)) on the Gaussian approximation of MNIST (Fig. 5b). Compared to the network model trained on the Gaussian samples corresponding to the same data distribution, we find only slightly lower performanceby about 0.9 ± 0.4% (Fig. 5c) -suggesting that the statistical model given by Eq. (22) is a good representation for the information processing in internal network layers. The fact that most of the performance drop with respect to standard training on MNIST is due to the Gaussian approximation of the input data indicates the importance of processing higher-order cumulants by the input layer. In the next section, we show with an illustrative example how these can be included into the theory.

D. Including higher-order correlation functions in the input layer
So far we have studied class-conditional means µ t x and covariances Σ t x of the input data; however, these two statistics may not always be informative. It is in fact easy to construct a low-dimensional task with two classes t = ±1, where both class-conditional means and covariances of the data are identicalµ t=−1 x -thereby conveying no information regarding the class membership (Fig. 6a,b). Classification in such cases must therefore rely on higher-order statistics. For the example in Fig. 6, since third-order cumulants differ between classes (G ), we expect their inclusion into the statistical model to be sufficient for solving the task. We here demonstrate that such higher-order cumulants can indeed be treated by our approach -in particular, we validate the statement made in Section III C that it suffices to consider higher-order cumulants in only the first layer.
The input distribution for this task is defined as a Gaussian mixture of four components, illustrated in Fig. 6a,b (details in Appendix F). As expected, training the network model yields near-optimal performance values, while a statistical model g stat ({G (n) x } n=1,2 , θ) that considers only the class-conditional means µ t x and covariances Σ t x fails to solve the task, yielding chance-level performance (Fig. 6c). This performance gap is nearly bridged when we include the third-order input cumulants G x } n=1,2,3,4 , θ). The activation function φ allows information in G (3) x to be transferred to lowerorder cumulants, which are then processed by subsequent layers in the manner described in previous sections -facilitating different means G (1) y in the output of the statistical model.

E. High dimensionality of input data justifies Gaussian description of fully-connected deep networks
We study the CIFAR-10 data set [29], consisting of 10 classes of 32 × 32 images with 3 color channels. Compared to MNIST, we expect two antagonistic effects. On the one hand, since images within one class of CIFAR-10 are significantly more heterogeneous, we expect the classconditional distributions to be more complex, and consequently to require higher-order cumulants to accurately represent its statistical structure. On the other hand, due to the larger input dimensionality N 0 = 3072 compared to N 0 = 784 for MNIST, higher-order cumulants are more strongly suppressed in the input layer (see Section III B). To check how these two effects interplay in feed-forward networks, we employ the methods presented in previous sections to restrict training to certain cumulants, similar as in Section IV C.
We train network models on the CIFAR-10 data set and compare these to the statistical model trained on the Gaussian approximation of CIFAR-10 (Fig. 7). In both cases, performance is evaluated on the CIFAR-10 test data set. We find that network models trained on data samples achieve performance values of P = 34.8% ± 1.4%. In contrast to MNIST, the statistical model trained on the Gaussian statistics consistently achieves higher performance values of P = 37.6% ± 1.3%. These results are directly linked to the two aforementioned effects: They indicate that due to the large input dimensionality, networks predominantly process only the Gaussian statistics -the statistical model therefore continues to provide a good representation of the network. Moreover, estimates of the Gaussian statistics are more accurate in the statistical model (averaged over the full training set of 50, 000 images) compared to training on data samples (averaged over mini-batches of 100 images), possibly explaining the slightly higher performance values. Importantly, although the achieved performance values are far below values reported for other architectures such as convolutional ResNets [30], they are representative for fully-connected feed-forward networks [12]. The difference between the architectures lies in the extracted statistical information. For high-dimensional input data, the here presented theory predicts that fully-connected feed-forward networks are limited to Gaussian statistics, which can only partly capture the statistical structure of more complex data sets such as CIFAR-10. Hence, the presented decomposition of a network in terms of cumulants allows us to relate the power of network architectures to the processing of statistical information contained in the data.

V. DISCUSSION
The question of how neural networks process data is fundamentally the question of how information is encoded in the data distribution and subsequently transformed by the network. We here present an analytical approach, based on methods from statistical physics, to study the mapping of data distributions implemented by deep feed-forward neural networks: we parameterize the data distribution in terms of correlation functions and derive their successive transformations across layers. We show that the initial network layer effectuates the extraction of information from higher-order correlations in the data; for subsequent layers, a restriction to first-and second-order correlation functions (mean and covariance) already captures the main properties of the network computation. This reduction of the bulk of the network to a non-linear mapping of a few correlation functions provides an attractive view for further analyses. It relies on the assumption of sufficiently wide layers to apply the central limit theorem, but, in practice, we find that the approximations are useful even for narrow networks.
We validate these results for different data sets. We first investigate an adaptation of the XOR problem that is purely based on first-and second-order cumulants. Despite the non-linear transformations in each layer giving rise to higher-order correlations, the network solutions to this task can largely be described in terms of transformations solely between mean and covariance of each class. We then consider the MNIST database: we show that network solutions based on empirical estimates for mean and covariance of each class capture a large amount of the variability within the data set, but still exhibit a non-negligible performance gap in comparison to solutions based on the actual data set. We discuss how this performance difference results from the omission of higher-order correlations. We then introduce an example task where higher-order correlations exclusively encode class membership, which allows us to explore their role in isolation. Finally, we show that for high-dimensional input data such as CIFAR-10, the first layer of fullyconnected networks predominantly extracts the Gaussian statistics. As a consequence, the information processing in these networks is well described by the Gaussian theory.
Limitations The dimensionality N 0 of the data may limit the applicability of the presented approach to low orders n, since cumulants of order n are tensors with N n 0 entries. We note, however, that there exist methods to ease the computational cost of higher-order cumulants in large dimensions: for example, one can make use of the inherent symmetries in these tensors, as well as in the theory itself. The application of such methods to our framework remains a point for future work. A parameterization of a probability distribution in terms of cumulants, moreover, needs to be chosen such that it maintains positivity of the probability density function. Conserving this property implies constraints for truncating cumulant orders, which require further investigations.
The presented framework and its perturbative methods naturally apply to polynomial approximations of activation functions. Although networks with polynomial non-linearity are, in principle, not capable of universal function approximation [31][32][33], this is not an issue for the classification tasks we consider. To obtain illustrative analytical expressions for the mixing of correlation functions, we chose to demonstrate the approach with a quadratic activation function. Non-polynomial and even non-differentiable activation functions can, however, also be dealt with in our framework using Gram-Charlier expansions that are detailed for the example of the ReLU activation in the Appendix C. While we here mostly focus on the mean and covariance, we also show how to generalize the results to higher-order cumulants.
Relation to kernel limit of deep networks In this paper we study individual networks with specific parameters θ. There is a complementary approach that studies ensembles of (infinitely) wide networks with random parameters: Poole et al. [15] expose a relation between the Lyapunov exponents and the depth to which information propagates in randomly initialized deep networks. They find the regime close to chaos beneficial for information propagation. We similarly find that the depth scale of information propagation controls the propagation of the Gaussian statistics across data samples studied in the current work, if network parameters are drawn randomly (see Appendix G, i.p. Fig. 10). Furthermore, random network parameters are central to studying training as Bayesian inference [34]: independent Gaussian priors on the network parameters render Bayesian inference exact on the resulting Gaussian process [7,12,35,36]. The works [9,[37][38][39] use methods similar to ours to compute finite-width corrections and corrections arising from training with stochastic gradient descent. These approaches consider distributions over network parameters θ. The statistics of the data in this view enters in the form of the pairwise overlaps ∑ i x i x ′ i between pairs of patterns x and x ′ . In the large data limit, the data statistics can moreover be described by a density p(x), whose properties shape the eigenfunctions φ i of the kernel k in the [14,Sec. 4.3]. In contrast, in the present work we study the transformation of an input distribution p(x) by a network with fixed parameters θ. The focus on individual networks rather than ensembles allows us to directly take into account the internal statistical structure of data samples, for example in the form of the mean µ x,i and covariances Σ x,ij for individual pixels i and j in images.
Related works Describing data and network activity in terms of correlations was initially explored by Deco and Brauer [40] on the particular architecture of volumepreserving networks. They derived expressions of the output in terms of its correlations as well as training rules that aim to decorrelate given input data. The work we present here differs in that our goal is not to impose a specific statistical structure on the network output, but to relate the correlations of the input and output distributions and thereby obtain a description of the information processing within the network.
While we do show that these distributions are not exactly Gaussian, that the networks can utilize higher-order correlations in the hidden layers, and how these contributions could in principle be computed, we focus mostly on self-consistently tracking the distributions in Gaussian approximation. This is because, as we show, this approximation is tractable while staying accurate also for trained networks and capturing the majority of the test accuracy in our examples. That a Gaussian approximation is surprisingly effective has also been argued in a recent line of works using teacher-student models with realistic data structure [41][42][43]. We also derive conditions under which a Gaussian approximation of the activity in the inner layers of a deep network is consistent in the limit of wide layers: Scaling of weight amplitudes w ij ∝ N − 1 2 , weak pairwise correlations c ij = O(N −1 ) as well as an approximate pairwise orthogonal decomposition of the previous layer's covariance matrix by the row vectors of the following layer. Under these conditions we show that cumulants of order higher than two are at most O(N − 1 2 ). Our approach is inspired by and analogous to the Gaussian equivalence property proposed by [41]; in particular, we also use as the central argument an expansion of higher order cumulants caused by weak pairwise correlations. Our result differs, though, by us treating layered networks instead of random feature maps embedding a low dimensional manifold in [41]. Other works which are based on a Gaussian approximation of the representation in each layer are: [44] using general deep networks, [45] focusing on Res-nets, and [46] considering the case of GANs. Going beyond random weights, other works study dimensionality reduction and decorrelation in both random deep networks and trained deep belief networks [47], explicitly analyzing the effects of weak correlations among weights [48]. Finally, a pedagogical text focusing on field-theory for deep neural networks has recently been published [49].
Outlook Tracing transformations of data correlations through layers of a neural network allows the investigation of mechanisms for both information encoding and processing; in this manner, it presents a handle towards interpretability of deep networks. The availability of tractable expressions describing the transformations of data correlations within neural networks is therefore an interesting prospect for future work seeking to dissect how networks learn and perform tasks. In this context, the theory we propose assumes data statistics of the input distribution p(x) to be known and exposes how statistical features of the data are transformed to generate the output, with the goal of shedding light onto the networks' functioning principles.
Another natural application of the proposed framework is the identification of essential correlations in the data. In that scenario, we do not need the exact distribution p(x), but only sufficiently accurate estimates of some statistics of x that can be obtained from the training data. By manipulating the information available to the model during training, we expose different information encodings the network can employ to solve the same task. We believe this approach could be used to identify data statistics required to solve a given task.
More complex data sets, such as CIFAR-10, require richer network architectures than fully-connected feedforward networks to achieve high performance. For example, applying the presented approach to ResNet-50 [50] would require the extension to convolutional network layers and skip connections. However, since these are equivalent to linear layers with weight matrices of a particular shape [13], they can straightforwardly be included in the framework.
Another future direction targets expressivity of deep networks: by reversely tracing the data correlations through the network, from target to data, one may ask which input distributions are mapped to a given output distribution -in effect constructing layer-resolved, statistical receptive fields for each target. Expressing these receptive fields in terms of data correlations may also be useful for studying how the complexity of data distributions is reduced by deep neural networks.

ACKNOWLEDGMENTS
We are grateful to Claudia Merger and Anno Kurth for helpful discussions. We thank Peter Bouss for feed-back on an earlier version of the manuscript. This work was partly supported by the German Federal Ministry for Education and Research (01IS19077A and 01IS19077B), the Excellence Initiative of the German federal and state governments (ERS PF-JARA-SDS005), and the Helmholtz Association Initiative and Networking Fund under project number SO-092 (Advanced Computing Architectures, ACA).

APPENDIX A HIGHER-ORDER CUMULANTS OF POST-ACTIVATIONS CAUSED BY WEAKLY CORRELATED PRE-ACTIVATIONS
We study how weak correlations between preactivations affect higher-order cumulants of the postactivations. Assume pre-activations x, y are zero-mean Gaussian distributed and weakly correlated. Let φ be a piece-wise differentiable activation function. The covariance matrix of x and y be C = a c c a . For simplicity, we denote ⟨○⟩ ∶= ⟨○⟩ (x,y)∼N (0,C) . Then by Price's theorem [51-53, This can be used to expand ⟨f (x)g(y)⟩ for small c as This expression corresponds to Eq. (A4) in [41], but Goldt et al. use a different approach than Price's theorem. In the expression in [41] one needs to replace ⟨uf (u)⟩ = ⟨f ′ (u)⟩), which holds since they assume ⟨u 2 ⟩ = 1.
Next, we consider the centered variables and correspondingly for g, one gets We may generalize this property to expectation values of more than two functions f , g By the marginalization property of Gaussian distributions, the joint distribution of any subset of x i is Gaussian distributed, too, where the covariance matrix is the corresponding sector of the matrix C ij = ⟨⟨x i x j ⟩⟩. Therefore for any i, j we define the function Applying (A1) to the first term yields (A3) Now take the expectation also across the remaining variables We may consider p(x 1 , . . . , The pair (i, j) has been chosen arbitrary. The remaining factors ∏ k {i,j}fk (x k ) can now be expanded in a similar manner, where all remaining k {i, j} need to be paired. Any such pairing yields non-zero contributions. Together one therefore has where ∑ σ∈Π sums over all disjoint pairings of indices (This expression corresponds to A16 in [41], apart from minor typos; the factors b i seem to be missing, p should be m, and we interpret the upper case of their A16 to be meant as b 1 ⋯b m ∑ σπΠ m σ1σ2 ⋯m σm−1σm ). This expression is also consistent with Wick's theorem, to which it needs to reduce in the case of an identity mapping f (x) = x. The expansion (A4) holds for arbitrary n. For any n, the result is correct up to terms of order O(c n 2 ). All cumulants of order n ≥ 3 thus vanish at the given order O(c n 2 ). This can be exemplified on the fourth order (dropping the arguments x for brevity) The first line on the right hand side, according to (A4) is Expanding each of the three negative terms on the right hand side of (A5) with help of (A4) yields, for example for the first of them which precisely cancels the corresponding term in (A5) at the given accuracy O(c 3 ). Analogous results hold at any even order (odd orders vanish for the centered variables), so that we find We also need to consider the case that indices in (A5) repeat, for example ⟨⟨f 1f1f2f2 ⟩⟩. In general, assume we have r different indices j 1 , . . . , j r among the n indices and want to compute the n-th cumulant for n > r. Within a set of repeated variables correlations are of order O(1) instead of O(c). In the expansion (A4) variables with repeated indices must be treated as a single variable. For the given example, define g i ∶=f 2 i , i = 1, 2 and centered variablesg i ∶=f 2 i − ⟨f 2 i ⟩. One then has with (A4) which is also the second cumulant, because theg are centered. We then expand the forth cumulant with repeated indices analogous to (A5) as The fourth moment in the first line, using the definitions of g andg above as well as (A7), is Combined with (A8) one has where we dropped all terms of order O(c 2 ), such as ⟨f 1f2 ⟩ ⟨f 1f2 ⟩ = O(c 2 12 ) and used that ⟨f 1f1 ⟩ ⟨f 2f2 ⟩ = ⟨g 1 ⟩⟨g 2 ⟩. Now consider that r is odd, such as in The fourth moment then is Applied to (A9), we have where the terms ∝ c cancel exactly. These two examples show the structure of the expansion: If we have r different indices j 1 , . . . , j r , the n-th cumulant for n > r of these variables will be of the order in c that equals the number of pairs to join all different indices. So together (r even or odd) we get This expression describes the scaling of the higher-order cumulants of post-activations with weak correlations of the pre-activations.

APPENDIX B WEAKLY-CORRELATED GAUSSIAN NETWORK MAPPING
The application of a non-linear activation function φ in each network layer generates higher-order cumulants from Gaussian distributed pre-activations, as discussed in Section III B. We here derive conditions under which higher-order cumulants G (n) z l of the pre-activations z l beyond mean and covariance on expectation scale down with the layer width N . Consequently, these become negligible for wide networks where N ≫ 1.
We apply the considerations in the previous Appendix A of weakly correlated Gaussian variables to the network mapping. Pre-activations in layer l are given by which then produce post-activations Assume the pre-activations z l i are Gaussian and weakly correlated to order We want to derive conditions under which it then follows that also pre-activations z l+1 i in the next layer have this property. By induction through the layer index, one then has established a condition under which the neglect of non-Gaussian cumulants in the inner layers of the network is justified. To this end, we define centered variables z ∶= z − ⟨z⟩ as well as The variance of pre-activations should be of order unity because one aims to explore the dynamic range of the gain function, which we assume to be of order unity (we use the dynamic range of the gain function to define our scale). It then follows that also the post-activations have a variance of order unity, so Such conditions are typically also enforced by batchnormalization. If the y l a were uncorrelated, the variance of the pre-activations in the next layer is given by For the variances on both sides to be of order unity, we need that To have low correlations in the next layer, one needs to demand that This can be interpreted as demanding that different rows W l+1 i○ and W l+1 j○ project out mutually nearly orthogonal sub-spaces out of the space of principal components of C. This means that different neurons i and j each specialize on sub-spaces that have little mutual overlap. Now consider higher-order correlations. It follows from (A6) and from the condition of weak pairwise correlations (B3) that for n ≥ 3 The cumulants of the pre-activations z l+1 i are given by those of the post-activations as We now distinguish three cases: 1.) Diagonal contributions: First consider the special case where all indices j 1 = . . . = j n are identical. One then gets a contribution to (B9) at order n ≥ 3 For n ≥ 3 this is suppressed by a large layer width N . 2.) Off-diagonal contributions with all distinct indices: Next consider the off-diagonal terms, where all sending neurons' indices are unequal j 1 ≠ j 2 ≠ . . . ≠ j n , so that we can use (B8). For n odd, the contributions vanish, because then (B8) vanishes. For n even we get N (j1≠j2,...,≠jn)=1 For these contributions to be suppressed for n ≥ 3 with increasing network size, we thus need to demand that the order of pairwise correlations is at most which is hence suppressed with network size also for large orders n.
3.) Off-diagonal contributions with two or more equal indices: Now consider terms for which a subset of j a , j b , j c , . . . assume the same value. Let the number of disjoint indices j 1 ≠ j 2 ≠ . . . ≠ j r be r < n. Each pair of equal indices can be seen as the appearance of one Kronecker δ jaj b , which eliminates one summation ∑ N j=1 -hence one factor N less. But at the same time, by (A8), also the moments are increased ⟨∏ n k=1fjk (z l j k )⟩ = O ⌈ r 2 ⌉ . Together, we get a contribution where we used r < n in the last step and upper bounded the expression by the worst case, in which n = r+1, where n is odd. So contributions from partial diagonal terms are suppressed with network size, too. In summary, we have shown that the Gaussian approximation with weak pairwise correlations of order < O(N −1 ) is consistently maintained in the limit of wide networks N ≫ 1 if synaptic amplitudes scale as (B4) and if the rows of the connectivity W l in each layer l in addition obey the approximate orthonormality condition (B7). From a functional perspective the latter condition makes sense, because this condition assures that the N neurons in each layer are used effectively to represent the entire variability that is present in the previous layer, avoiding redundancy among neurons.
Finally, we note that Eq. ∼ N 0, σ 2 b . This choice of initialization precisely preserves the magnitude of the covariance within the network: Due to the resulting covariance in the next layer being approximately diagonal, the calculations simplify significantly in this case. The above considerations include conditions also for trained networks where correlations among weights cause correlations between pairs of preactivations (z l i , z l j ).

APPENDIX C INTERACTION FUNCTIONS FOR DIFFERENT ACTIVATION FUNCTIONS
In Section III B, we derived the interaction functions resulting from the non-linearity φ, We here consider networks with the ReLU activation function φ(z) = max(0, z). Taking the distribution of pre-activations z l to be Gaussian distributed with mean µ z l and covariance Σ z l , the mean post-activations are given by For the covariance of post-activations, we distinguish the cases i = j and i = j, starting with the former by calculating its second moment as Fig. 2 in the main text illustrates information propagation in networks with ReLU activations. For completeness, we include here as Supplemental Fig. 8 the analogous illustration for a network with quadratic activations. For Fig. 3, we include the results for networks with quadratic activation function in supplemental Fig. 9.

APPENDIX E DATA SAMPLE GENERATION FOR MNIST BASED ON GAUSSIAN APPROXIMATION OF INPUT DISTRIBUTION
In Section IV C of the main text, we discuss training networks to solve MNIST using R emp, MSE (Eq. (25)) with data samples drawn from the Gaussian approximation of the input distribution. For this Gaussian approximation, meansμ t x and covariancesΣ t x for each class t are estimated empirically from the training data set where we flattened the 28×28 images into 784-dimensional vectors. Due to lack of variability in some pixel values at the image edges, the resulting covariancesΣ t x are not positive definite, but only positive semi-definite.
To account for the zero eigenvalues of the covariance, data samples are generated based on a principal component analysis of the covariance matrix. For each class t, we decompose the covariance matrix aŝ containing the unit-length eigenvectors v i and D = diag(λ 1 , . . . , λ N0 ) containing the corresponding eigenvalues λ i ofΣ t x , which we assume to be ordered according to their size, λ 1 ≥ ⋅ ⋅ ⋅ ≥ λ N0 ≥ 0. We set a threshold ϑ PCA > 0 that defines a subspace U spanned by the eigenvectors {v i } i=1,...,N PCA for which λ i > ϑ PCA . Data samplesx (d) U are then generated with respect to  ). On average, the trained networks achieve performance values of P = 96.52% ± 0.15% compared to Popt = 97.5%. Networks were trained to perform the XOR task described in Section IV B 1. Other parameters: φ(z) = z + α z 2 .
this subspace U and projected back to the input space R N0 according tô For all experiments in Section IV C of the main text, we choose ϑ PCA = 10 −2 , corresponding to N PCA between 103 and 234 for the different classes t. Since for all classes t the magnitude of the largest eigenvalue is of order 1, this choice of ϑ PCA ensures including relevant eigenvectors while excluding noise due to finite numerical precision.
Since the MNIST training data set contains 60,000 samples, to allow for a fair comparison between training on Gaussian samples and on the original images, we generated a similarly-sized training data set of D = 60,000 Gaussian samples.

APPENDIX F PROBLEM SETUP FOR INCLUSION OF HIGHER-ORDER STATISTICS IN THE MAIN TEXT
The problem studied in Section IV D of the main text is constructed as follows. We define two classes, t = ±1, each composed of two Gaussian components + and −, with the following means: Covariances are isotropic throughout with The outer components (t = −1, −) and (t = +1, +) are weighed by p outer = 1 8 , while the inner components (t = −1, +) and (t = +1, −) are weighed by p inner = 3 8 , as illustrated in Fig. 6a,b of the main text. A data sample x (d) is assigned a target label t (d) ∈ {±1} based on the mixture component it is drawn from. Distribution parameters are chosen such that the class-conditional means and covariances of the data are identical, while the third-order correlations differ x, (i,j,k) = ± 0.75 δ ij δ jk δ ki δ i1 .
We use training and test data sets of size D = 10 4 .

APPENDIX G DEPTH SCALES OF INFORMATION PROPAGATION
We here discuss the relation between the presented work and Poole et al. [15]. To formalize this relation, we define as z ld θk the pre-activation of neuron k in layer l for a given data sample x (d) in a network with parameters θ. Poole et al. study ensembles of networks across random realizations of network parameters θ. The family of distributions they study, expressed in terms of preactivations, is thus which is one distribution jointly for all pre-activations {z d k } d=1,...,D; k=1,...,N for a given set of P data samples x (d) and for each given layer l. In the limit of wide networks, they find thatp factorizes across different neuron indices, so that z ld i and z ld ′ k are independent for different i ≠ k. Further, these variables are centered Gaussian, so that a single covariance matrix is sufficient to describe their statistics. In this limit, it is therefore sufficient to study the joint statistics of all pairs of networks corresponding to all pairs of inputs d, d ′ Correlation functions in their work a priori thus quantify fluctuations across realizations of network parameters. Their mean-field theory for deep feed-forward networks is identical to the classical mean-field theory of random recurrent networks [56], because for recurrent networks with discrete-time updates the equal time statistics is identical to the equal-layer statistics of a deep network [57].
In the presented work, instead, we study individual networks defined by one fixed set of parameters θ across the distribution p(x) of data samples x (d) . Correlations in our work thus quantify the variability of the network state across different data points. Formally, the family of distributions we study is which for each given l and θ is one joint distribution of all neurons k. Importantly, the distribution is across the ensemble of data points d.
One formal difference is thus the expectation across θ in (G1) versus the expectation over d in (G3). Wide networks, however, tend to be self-averaging. This means that the ensemble across parameters θ studied by Poole et al. shows a concentration on a single typical behavior that one finds in any of its (likely) individual realizations. Formally this means that the empirical distribution of (z k , z ′ k ) across neurons k for any random choice of parameters θ takes on the same form asp, so that (G2) for N large approaches the empirical average over neuron activations, A way to show this is by a saddle point approximation of the moment-generating function after the disorder average across θ [e.g, 21,53,[57][58][59].
To derive the result by Poole et al. [15] or Molgedey et al. [56] in our notation, we start with the expression for the pre-activations z l+1 i = ∑ k W l+1 ik y l k + b l+1 i (see (1)). For Gaussian distributed W l+1 ik and b l+1 i , pre-activations for one fixed data sample x (d) (suppressing the superscript d for brevity in the following) become Gaussian as well, with mean and covariance with S z l+1 = σ 2 w ⟨φ(z l )φ(z l )⟩ z l+1 ∼N (0,S z l ) + σ 2 b , and where we used the mapping by the activation function (15). To determine the statistics ⟨y l k ⟩ W,b and ⟨y l k y l k ⟩ W,b , we simultaneously performed an average over weights and biases in layers l ′ ≤ l. These statistics are identical across neurons, so we write ⟨y l k ⟩ W,b = ⟨y l ⟩ W,b and ⟨y l k y l k ⟩ W,b = ⟨y l y l ⟩ W,b . Eq. (G4) and Eq. (G5) show that correlations among different neurons vanish on average across networks.
The covariance between pre-activations of a pair of networks for two different inputs x (d) and x (d ′ ) analogously becomes where N 0, {S z l,d z l,d ′ } is meant as the Gaussian distribution for the pair (z l,d , z l,d ′ ) with covariance matrix S z l,d S z l,d z l,d ′ S z l,d ′ z l,d S z l,d ′ .
We may make a connection to our results by considering a pair of inputs x (d) and x (d ′ ) . These inputs are presented to the network as y 0d and y 0d ′ . The theory by Poole et al. yields a measure of the overlap O l dd ′ of network states after l layers as the solution of the joint iterative equations derived above. In the limit of wide networks, this overlap becomes self-averaging, so it concentrates around its mean value across y, To show the simplest possible link between the depth scales studied in Poole et al., we consider the case of a deep untrained network. The statistics of z l and y l decay to a fixed point, so that we can consider the autostatistics to become constant for a large enough l where A 0 is the stationary solution of (G6) and C 0 the stationary solution of (G7) Now consider pairs of inputs (y 0d , y 0d ′ ) for which the statistics of pre-activations differ only little from this fixed-point statistics: assume that any data point has variance S z l = A 0 and for any pair (d, d ′ ) of data points we may express the covariance of pre-activations in the first layer as where δC 1 dd ′ ≪ C 0 . Based on these assumptions, we now compute decay constants with l.
Linearizing the iteration (G7), one obtains for the propagation of δC l Here, we made use of Price's theorem [51-53, Appendix A] ∂⟨φ(z)φ(z ′ )⟩ ∂Σ zz ′ = ⟨φ ′ (z)φ ′ (z ′ )⟩ where φ ′ = dφ dz and Σ zz ′ is the covariance of z and z ′ . For stationary statistics across layers (G10) and under the homogeneity assumption across data samples (G9), one thus has One then obtains an exponential evolution with layer index with a depth scale In particular, at the transition to chaos, namely at the point in parameter space (σ w , σ b ) for which σ 2 w ⟨φ ′ φ ′ ⟩ = 1, the depth scale diverges. The overlap of activations (G8) shows the same depth scale, because its variation is linearly related to δC l dd ′ as δC l dd ′ = σ 2 w δO l−1 dd ′ , so Both the covariance of pre-activations (S z 1,d z 1,d ′ ) and the overlaps of activations (O l dd ′ = δO l dd ′ +O 0 ) therefore decay to fixed points -respectively C 0 and O 0 -that are related by O 0 = σ −2 w (C 0 − σ 2 b ). We can relate these results to our work on single networks by re-expressing the overlap O l dd ′ in terms of the probability distribution across different data samples. From (G8) it follows that Here µ y l ,{k} denotes the mean post-activation of neuron k in layer l, taken over the ensemble of all data points d. This is obtained by iterating Eqs. (14) and (17). We show in Fig. 10 that predictions of (G18) are indeed consistent with the depth scale obtained from Poole et al.'s theory (G13). Moreover, since we derived our theory for single networks, (G18) also captures variability due to particular network realizations. Interestingly, while for network ensembles the depth scale ξ describes the evolution of the second moments, expression (G18) shows that for single networks ξ describes the evolution of the squared means across data samples.