Entropy and mutual information in models of deep neural networks

We examine a class of deep learning models with a tractable method to compute information-theoretic quantities. Our contributions are three-fold: (i) We show how entropies and mutual informations can be derived from heuristic statistical physics methods, under the assumption that weight matrices are independent and orthogonally-invariant. (ii) We extend particular cases in which this result is known to be rigorously exact by providing a proof for two-layers networks with Gaussian random weights, using the recently introduced adaptive interpolation method. (iii) We propose an experiment framework with generative models of synthetic datasets, on which we train deep neural networks with a weight constraint designed so that the assumption in (i) is verified during learning. We study the behavior of entropies and mutual informations throughout learning and conclude that, in the proposed setting, the relationship between compression and generalization remains elusive.

The successes of deep learning methods have spurred efforts towards quantitative modeling of the performance of deep neural networks. In particular, an information-theoretic approach linking generalization capabilities to compression has been receiving increasing interest. The intuition behind the study of mutual informations in latent variable models dates back to the information bottleneck (IB) theory of [1]. Although recently reformulated in the context of deep learning [2], verifying its relevance in practice requires the computation of mutual informations for high-dimensional variables, a notoriously hard problem. Thus, pioneering works in this direction focused either on small network models with discrete (continuous, eventually binned) activations [3], or on linear networks [4,5].
In the present paper we follow a different direction, and build on recent results from statistical physics [6,7] and information theory [8,9] to propose, in Section 1, a formula to compute informationtheoretic quantities for a class of deep neural network models. The models we approach, described in Section 2, are non-linear feed-forward neural networks trained on synthetic datasets with constrained weights. Such networks capture some of the key properties of the deep learning setting that are usually difficult to include in tractable frameworks: non-linearities, arbitrary large width and depth, and correlations in the input data. We demonstrate the proposed method in a series of numerical experiments in Section 3. First observations suggest a rather complex picture, where the role of compression in the generalization ability of deep neural networks is yet to be elucidated.

Multi-layer model and main theoretical results
A stochastic multi-layer model-We consider a model of multi-layer stochastic feed-forward neural network where each element x i of the input layer x ∈ R n 0 is distributed independently as P 0 (x i ), while hidden units t ,i at each successive layer t ∈ R n (vectors are column vectors) come from P (t ,i |W ,i t −1 ), with t 0 ≡ x and W ,i denoting the i-th row of the matrix of weights W ∈ R n ×n −1 . In other words given a set of weight matrices {W } L =1 and distributions {P } L =1 which encode possible nonlinearities and stochastic noise applied to the hidden layer variables, and P 0 that generates the visible variables. In particular, for a non-linearity t ,i = ϕ (h, ξ ,i ), where ξ ,i ∼ P ξ (·) is the stochastic noise (independent for each i), we have P (t ,i |W ,i t −1 ) = dP ξ (ξ ,i ) δ t ,i − ϕ (W ,i t −1 , ξ ,i ) . Model (1) thus describes a Markov chain which we denote by X → T 1 → T 2 → · · · → T L , with , and the activation function ϕ applied componentwise. Replica formula-We shall work in the asymptotic high-dimensional statistics regime where allα ≡ n /n 0 are of order one while n 0 → ∞, and make the important assumption that all matrices W are orthogonally-invariant random matrices independent from each other; in other words, each matrix W ∈ R n ×n −1 can be decomposed as a product of three matrices, W = U S V , where U ∈ O(n ) and V ∈ O(n −1 ) are independently sampled from the Haar measure, and S is a diagonal matrix of singular values. The main technical tool we use is a formula for the entropies of the hidden variables, H(T ) = −E T ln P T (t ), and the mutual information between adjacent layers I(T ; T −1 ) = H(T ) + E T ,T −1 ln P T |T −1 (t |t −1 ), based on the heuristic replica method [10,11,6,7]: Claim 1 (Replica formula). Assume model (1) with L layers in the high-dimensional limit with componentwise activation functions and weight matrices generated from the ensemble described above, and denote by λ W k the eigenvalues of W k W k . Then for any ∈ {1, . . . , L} the normalized entropy of T is given by the minimum among all stationary points of the replica potential: which depends on -dimensional vectors A, V ,Ã,Ṽ , and is written in terms of mutual information I and conditional entropies H of scalar variables as where α k = n k /n k−1 ,α k = n k /n 0 , ρ k = dP k−1 (t) t 2 ,ρ k = (E λ W k λ W k )ρ k /α k , and ξ k ∼ N (0, 1) for k = 0, . . . , . In the computation of the conditional entropies in (3), the scalar t k -variables are generated from P (t 0 ) = P 0 (t 0 ) and whereξ andz are independent N (0, 1) random variables. Finally, the function F W k (x) depends on the distribution of the eigenvalues λ W following The computation of the entropy in the large dimensional limit, a computationally difficult task, has thus been reduced to an extremization of a function of 4 variables, that requires evaluating single or bidimensional integrals. This extremization can be done efficiently by means of a fixedpoint iteration starting from different initial conditions, as detailed in the Supplementary Material. Moreover, a user-friendly Python package is provided [12], which performs the computation for different choices of prior P 0 , activations ϕ and spectra λ W . Finally, the mutual information between successive layers I(T ; T −1 ) can be obtained from the entropy following the evaluation of an additional bidimensional integral, see Section 1.6.1 of the Supplementary Material.
Our approach in the derivation of (3) builds on recent progresses in statistical estimation and information theory for generalized linear models following the application of methods from statistical physics of disordered systems [10,11] in communication [13], statistics [14] and machine learning problems [15,16]. In particular, we use advanced mean field theory [17] and the heuristic replica method [10,6], along with its recent extension to multi-layer estimation [7,8], in order to derive the above formula (3). This derivation is lengthy and thus given in the Supplementary Material. In a related contribution, Reeves [9] proposed a formula for the mutual information in the multi-layer setting, using heuristic information-theoretic arguments. As ours, it exhibits layer-wise additivity, and the two formulas are conjectured to be equivalent.
Rigorous statement-We recall the assumptions under which the replica formula of Claim 1 is conjectured to be exact: (i) weight matrices are drawn from an ensemble of random orthogonallyinvariant matrices, (ii) matrices at different layers are statistically independent and (iii) layers have a large dimension and respective sizes of adjacent layers are such that weight matrices have aspect ratios {α k ,α k } k=1 of order one. While we could not prove the replica prediction in full generality, we stress that it comes with multiple credentials: (i) for Gaussian prior P 0 and Gaussian distributions P , it corresponds to the exact analytical solution when weight matrices are independent of each other (see Section 1.6.2 of the Supplementary Material). (ii) In the single-layer case with a Gaussian weight matrix, it reduces to formula (13) in the Supplementary Material, which has been recently rigorously proven for (almost) all activation functions ϕ [18]. (iii) In the case of Gaussian distributions P , it has also been proven for a large ensemble of random matrices [19] and (iv) it is consistent with all the results of the AMP [20,21,22] and VAMP [23] algorithms, and their multi-layer versions [7,8], known to perform well for these estimation problems.
In order to go beyond results for the single-layer problem and heuristic arguments, we prove Claim 1 for the more involved multi-layer case, assuming Gaussian i.i.d. matrices and two non-linear layers: Theorem 1 (Two-layer Gaussian replica formula). Suppose (H1) the input units distribution P 0 is separable and has bounded support; (H2) the activations ϕ 1 and ϕ 2 corresponding to P 1 (t 1,i |W 1,i x) and P 2 (t 2,i |W 2,i t 1 ) are bounded C 2 with bounded first and second derivatives w.r.t their first argument; and (H3) the weight matrices W 1 , W 2 have Gaussian i.i.d. entries. Then for model (1) with two layers L = 2 the high-dimensional limit of the entropy verifies Claim 1.
The theorem, that closes the conjecture presented in [7], is proven using the adaptive interpolation method of [24,18] in a multi-layer setting, as first developed in [25]. The lengthy proof, presented in details in the Supplementary Material, is of independent interest and adds further credentials to the replica formula, as well as offers a clear direction to further developments. Note that, following the same approximation arguments as in [18] where the proof is given for the single-layer case, the hypothesis (H1) can be relaxed to the existence of the second moment of the prior, (H2) can be dropped and (H3) extended to matrices with i.i.d. entries of zero mean, O(1/n 0 ) variance and finite third moment.

Tractable models for deep learning
The multi-layer model presented above can be leveraged to simulate two prototypical settings of deep supervised learning on synthetic datasets amenable to the replica tractable computation of entropies and mutual informations.
teacher student generative recognition teacher-student (i.i.d. input data) generative-recognition (correlated input data) The first scenario is the so-called teacherstudent (see Figure 1, left). Here, we assume that the input x is distributed according to a separable prior distribution P X (x) = i P 0 (x i ), factorized in the components of x, and the corresponding label y is given by applying a mapping x → y, called the teacher. After generating a train and test set in this manner, we perform the training of a deep neural network, the student, on the synthetic dataset. In this case, the data themselves have a simple structure given by P 0 .
In constrast, the second scenario allows generative models (see Figure 1, right) that create more structure, and that are reminiscent of the generative-recognition pair of models of a Variational Autoencoder (VAE). A code vector y is sampled from a separable prior distribution P Y (y) = i P 0 (y i ) and a corresponding data point x is generated by a possibly stochastic neural network, the generative model. This setting allows to create input data x featuring correlations, differently from the teacher-student scenario. The studied supervised learning task then consists in training a deep neural net, the recognition model, to recover the code y from x.
In both cases, the chain going from X to any later layer is a Markov chain in the form of (1). In the first scenario, model (1) directly maps to the student network. In the second scenario however, model (1) actually maps to the feed-forward combination of the generative model followed by the recognition model. This shift is necessary to verify the assumption that the starting point (now given by Y ) has a separable distribution. In particular, it generates correlated input data X while still allowing for the computation of the entropy of any T .
At the start of a neural network training, weight matrices initialized as i.i.d. Gaussian random matrices satisfy the necessary assumptions of the formula of Claim 1. In their singular value decomposition W = U S V the matrices U ∈ O(n ) and V ∈ O(n −1 ), are typical independent samples from the Haar measure across all layers. To make sure weight matrices remain close enough to independent during learning, we define a custom weight constraint which consists in keeping U and V fixed while only the matrix S , constrained to be diagonal, is updated. The number of parameters is thus reduced from n × n −1 to min(n , n −1 ). We refer to layers following this weight constraint as USV-layers. For the replica formula of Claim 1 to be correct, the matrices S from different layers should furthermore remain uncorrelated during the learning. In Section 3, we consider the training of linear networks for which information-theoretic quantities can be computed analytically, and confirm numerically that with USV-layers the replica predicted entropy is correct at all times. In the following, we assume that is also the case for non-linear networks. In Section 3.2 of the Supplementary Materialwe train a neural network with USV-layers on a simple real-world dataset (MNIST), showing that these layers can learn to represent complex functions despite their restriction. We further note that such a product decomposition is reminiscent of a series of works on adaptative structured efficient linear layers (SELLs and ACDC) [26,27] motivated this time by speed gains, where only diagonal matrices are learned (in these works the matrices U and V are chosen instead as permutations of Fourier or Hadamard matrices, so that the matrix multiplication can be replaced by fast transforms). In Section 3, we discuss learning experiments with USV-layers on synthetic datasets.
While we have defined model (1) as a stochastic model, traditional feed forward neural networks are deterministic. In the numerical experiments of Section 3, we train and test networks without injecting noise, and only assume a noise model in the computation of information-theoretic quantities. Indeed, for continuous variables the presence of noise is necessary for mutual informations to remain finite (see discussion of Appendix C in [5]). We assume at layer an additive white Gaussian noise of small amplitude just before passing through its activation function to obtain H(T ) and I(T ; T −1 ), while keeping the mapping X → T −1 deterministic. This choice attempts to stay as close as possible to the deterministic neural network, but remains inevitably somewhat arbitrary (see again discussion of Appendix C in [5]).
Other related works-The strategy of studying neural networks models, with random weight matrices and/or random data, using methods originated in statistical physics heuristics, such as the replica and the cavity methods [10] has a long history. Before the deep learning era, this approach led to pioneering results in learning for the Hopfield model [28] and for the random perceptron [29,30,15,16].
Recently, the successes of deep learning along with the disqualifying complexity of studying real world problems have sparked a revived interest in the direction of random weight matrices. Recent results -without exhaustivity-were obtained on the spectrum of the Gram matrix at each layer using random matrix theory [31,32], on expressivity of deep neural networks [33], on the dynamics of propagation and learning [34,35,36,37], on the high-dimensional non-convex landscape where the learning takes place [38], or on the universal random Gaussian neural nets of [39].
The information bottleneck theory [1] applied to neural networks consists in computing the mutual information between the data and the learned hidden representations on the one hand, and between labels and again hidden learned representations on the other hand [2,3]. A successful training should maximize the information with respect to the labels and simultaneously minimize the information with respect to the input data, preventing overfitting and leading to a good generalization. While this intuition suggests new learning algorithms and regularizers [40,41,42,43,44,45,46], we can also hypothesize that this mechanism is already at play in a priori unrelated commonly used optimization methods, such as the simple stochastic gradient descent (SGD). It was first tested in practice by [3] on very small neural networks, to allow the entropy to be estimated by binning of the hidden neurons activities. Afterwards, the authors of [5] reproduced the results of [3] on small networks using the continuous entropy estimator of [44], but found that the overall behavior of mutual information during learning is greatly affected when changing the nature of non-linearities. Additionally, they investigate the training of larger linear networks on i.i.d. normally distributed inputs where entropies at each hidden layer can be computed analytically for an additive Gaussian noise. The strategy proposed in the present paper allows us to evaluate entropies and mutual informations in non-linear networks larger than in [5,3].

Numerical experiments
We present a series of experiments both aiming at further validating the replica estimator and leveraging its power in noteworthy applications. A first application presented in the paragraph 3.1 consists in using the replica formula in settings where it is proven to be rigorously exact as a basis of comparison for other entropy estimators. The same experiment also contributes to the discussion of the information bottleneck theory for neural networks by showing how, without any learning, information-theoretic quantities have different behaviors for different non-linearities. In the following paragraph 3.2, we validate the accuracy of the replica formula in a learning experiment with USV-layers -where it is not proven to be exact -by considering the case of linear networks for which information-theoretic quantities can be otherwise computed in closed-form. We finally consider in the paragraph 3.3, a second application testing the information bottleneck theory for large non-linear networks. To this aim, we use the replica estimator to study compression effects during learning.
3.1 Estimators and activation comparisons-Two non-parametric estimators have already been considered by [5] to compute entropies and/or mutual informations during learning. The kernel-density approach of Kolchinsky et. al. [44] consists in fitting a mixture of Gaussians (MoG) to samples of the variable of interest and subsequently compute an upper bound on the entropy of the MoG [47]. The method of Kraskov et al. [48] uses nearest neighbor distances between samples to directly build an estimate of the entropy. Both methods require the computation of the matrix of distances between samples. Recently, [45] proposed a new non-parametric estimator for mutual informations which involves the optimization of a neural network to tighten a bound. It is unfortunately computationally hard to test how these estimators behave in high dimension as even for a known distribution the computation of the entropy is intractable (#P-complete) in most cases. However the replica method proposed here is a valuable point of comparison for cases where it is rigorously exact.
In the first numerical experiment we place ourselves in the setting of Theorem 1: a 2-layer network with i.i.d weight matrices, where the formula of Claim 1 is thus rigorously exact in the limit of large networks, and we compare the replica results with the non-parametric estimators of [44] and [48]. Note that the requirement for smooth activations (H2) of Theorem 1 can be relaxed (see discussion below the Theorem). Additionally, non-smooth functions can be approximated arbitrarily closely by smooth functions with equal information-theoretic quantities, up to numerical precision.
We consider a neural network with layers of equal size n = 1000 that we denote: X → T 1 → T 2 . The input variable components are i.i.d. Gaussian with mean 0 and variance 1. The weight matrices entries are also i.i.d. Gaussian with mean 0. Their standard-deviation is rescaled by a factor 1/ √ n and then multiplied by a coefficient σ varying between 0.1 and 10, i.e. around the recommended value for training initialization. To compute entropies, we consider noisy versions of the latent variables where an additive white Gaussian noise of very small variance (σ 2 noise = 10 −5 ) is added right before the activation function, , which is also done in the remaining experiments to guarantee the mutual informations to remain finite. The non-parametric estimators [44,48] were evaluated using 1000 samples, as the cost of computing pairwise distances is significant in such high dimension and we checked that the entropy estimate is stable over independent draws of a sample of such a size (error bars smaller than marker size). On Figure 2, we compare the different estimates of H(T 1 ) and H(T 2 ) for different activation functions: linear, hardtanh or ReLU. The hardtanh activation is a piecewise linear approximation of the tanh, hardtanh(x) = −1 for x < −1, x for −1 < x < 1, and 1 for x > 1, for which the integrals in the replica formula can be evaluated faster than for the tanh.
In the linear and hardtanh case, the non-parametric methods are following the tendency of the replica estimate when σ is varied, but appear to systematically over-estimate the entropy. For linear networks with Gaussian inputs and additive Gaussian noise, every layer is also a multivariate Gaussian and therefore entropies can be directly computed in closed form (exact in the plot legend). When using the Kolchinsky estimate in the linear case we also check the consistency of two strategies, either fitting the MoG to the noisy sample or fitting the MoG to the deterministic part of the T and augment the resulting variance with σ 2 noise , as done in [44] (Kolchinsky et al. parametric in the plot legend). In the network with hardtanh non-linearities, we check that for small weight values, the entropies are the same as in a linear network with same weights (linear approx in the plot legend, computed using the exact analytical result for linear networks and therefore plotted in a similar color to exact). Lastly, in the case of the ReLU-ReLU network, we note that non-parametric methods are predicting an entropy increasing as the one of a linear network with identical weights, whereas the replica computation reflects its knowledge of the cut-off and accurately features a slope equal to half of the linear network entropy (1/2 linear approx in the plot legend). While non-parametric estimators are invaluable tools able to approximate entropies from the mere knowledge of samples,they inevitably introduce estimation errors. The replica method is taking the opposite view. While being restricted to a class of models, it can leverage its knowledge of the neural network structure to provide a reliable estimate. To our knowledge, there is no other entropy estimator able to incorporate such information about the underlying multi-layer model.
Beyond informing about estimators accuracy, this experiment also unveils a simple but possibly  Figure 2: Entropy of latent variables in stochastic networks X → T 1 → T 2 , with equally sized layers n = 1000, inputs drawn from N (0, I n ), weights from N (0, σ 2 I n 2 /n), as a function of the weight scaling parameter σ. An additive white Gaussian noise N (0, 10 −5 I n ) is added inside the non-linearity. Left column: linear network. Center column: hardtanh-hardtanh network. Right column: ReLU-ReLU network.
important distinction between activation functions. For the hardtanh activation, as the random weights magnitude increases, the entropies decrease after reaching a maximum, whereas they only increase for the unbounded activation functions we consider -even for the single-side saturating ReLU. This loss of information for bounded activations was also observed by [5], where entropies were computed by discretizing the output as a single neuron with bins of equal size. In this setting, as the tanh activation starts to saturate for large inputs, the extreme bins (at −1 and 1) concentrate more and more probability mass, which explains the information loss. Here we confirm that the phenomenon is also observed when computing the entropy of the hardtanh (without binning and with small noise injected before the non-linearity). We check via the replica formula that the same phenomenology arises for the mutual informations I(X; T ) (see Section3.1).
3.2 Learning experiments with linear networks-In the following, and in Section 3.3 of the Supplementary Material, we discuss training experiments of different instances of the deep learning models defined in Section 2. We seek to study the simplest possible training strategies achieving good generalization. Hence for all experiments we use plain stochastic gradient descent (SGD) with constant learning rates, without momentum and without any explicit form of regularization. The sizes of the training and testing sets are taken equal and scale typically as a few hundreds times the size of the input layer. Unless otherwise stated, plots correspond to single runs, yet we checked over a few repetitions that outcomes of independent runs lead to identical qualitative behaviors. The values of mutual informations I(X; T ) are computed by considering noisy versions of the latent variables where an additive white Gaussian noise of very small variance (σ 2 noise = 10 −5 ) is added right before the activation function, as in the previous experiment. This noise is neither present at training time, where it could act as a regularizer, nor at testing time. Given the noise is only assumed at the last layer, the second to last layer is a deterministic mapping of the input variable; hence the replica formula yielding mutual informations between adjacent layers gives us directly I(T ; T −1 ) = H(T ) − H(T |T −1 ) = H(T ) − H(T |X) = I(T ; X). We provide a second Python package [49] to implement in Keras learning experiments on synthetic datasets, using USV-layers and interfacing the first Python package [12] for replica computations.
To start with we consider the training of a linear network in the teacher-student scenario. The teacher has also to be linear to be learnable: we consider a simple single-layer network with additive white Gaussian noise, Y =W teach X + , with input x ∼ N (0, I n ) of size n, teacher matrixW teach i.i.d. normally distributed as N (0, 1/n) , noise ∼ N (0, 0.01I n ), and output of size n Y = 4. We train a student network of three USV-layers, plus one fully connected unconstrained layer X → T 1 → T 2 → T 3 →Ŷ on the regression task, using plain SGD for the MSE loss (Ŷ − Y ) 2 . We recall that in the USV-layers (7) only the diagonal matrix is updated during learning. On the left panel of Figure 3, we report the learning curve and the mutual informations between the hidden layers and the input in the case where all layers but outputs have size n = 1500. Again this linear setting is analytically tractable and does not require the replica formula, a similar situation was studied in [5]. In agreement with their observations, we find that the mutual informations I(X; T ) keep on increasing throughout the learning, without compromising the generalization ability of the student. Now, we also use this linear setting to demonstrate (i) that the replica formula remains correct throughout the learning of the USV-layers and (ii) that the replica method gets closer and closer to the exact result in the limit of large networks, as theoretically predicted (2). To this aim, we repeat the experiment for n varying between 100 and 1500, and report the maximum and the mean value of the squared error on the estimation of the I(X; T ) over all epochs of 5 independent training runs. We find that even if errors tend to increase with the number of layers, they remain objectively very small and decrease drastically as the size of the layers increases.
3.3 Learning experiments with deep non-linear networks-Finally, we apply the replica formula to estimate mutual informations during the training of non-linear networks on correlated input data.
We consider a simple single layer generative model X =W gen Y + with normally distributed code Y ∼ N (0, I n Y ) of size n Y = 100, data of size n X = 500 generated with matrixW gen i.i.d. normally distributed as N (0, 1/n Y ) and noise ∼ N (0, 0.01I n X ). We then train a recognition model to solve the binary classification problem of recovering the label y = sign(Y 1 ), the sign of the first neuron in Y , using plain SGD but this time to minimize the cross-entropy loss. Note that the rest of the initial code (Y 2 , ..Y n Y ) acts as noise/nuisance with respect to the learning task. We compare two 5-layers recognition models with 4 USV-layers plus one unconstrained, of sizes 500-1000-500-250-100-2, and activations either linear-ReLU-linear-ReLU-softmax (top row of Figure  4) or linear-hardtanh-linear-hardtanh-softmax (bottom row). Because USV-layers only feature O(n) parameters instead of O(n 2 ) we observe that they require more iterations to train in general. In the case of the ReLU network, adding interleaved linear layers was key to successful training with 2 non-linearities, which explains the somewhat unusual architecture proposed. For the recognition model using hardtanh, this was actually not an issue (see Supplementary Material for an experiment using only hardtanh activations), however, we consider a similar architecture for fair comparison. We discuss further the ability of learning of USV-layers in the Supplementary Material. This experiment is reminiscent of the setting of [3], yet now tractable for networks of larger sizes. For both types of non-linearities we observe that the mutual information between the input and all hidden layers decrease during the learning, except for the very beginning of training where we can sometimes observe a short phase of increase (see zoom in insets). For the hardtanh layers this phase is longer and the initial increase of noticeable amplitude. In this particular experiment, the claim of [3] that compression can occur during training even with non double-saturated activation seems corroborated (a phenomenon that was not observed by [5]). Yet we do not observe that the compression is more pronounced in deeper layers and its link to generalization remains elusive. For instance, we do not see a delay in the generalization w.r.t. training accuracy/loss in the recognition model with hardtanh despite of an initial phase without compression in two layers. Further learning experiments, including a second run of this last experiment, are presented in the Supplementary Material.

Conclusion and perspectives
We have presented a class of deep learning models together with a tractable method to compute entropy and mutual information between layers. This, we believe, offers a promising framework for further investigations, and to this aim we provide Python packages that facilitate both the computation of mutual informations and the training, for an arbitrary implementation of the model. In the future, allowing for biases by extending the proposed formula would improve the fitting power of the considered neural network models.
We observe in our high-dimensional experiments that compression can happen during learning, even when using ReLU activations. While we did not observe a clear link between generalization and compression in our setting, there are many directions to be further explored within the models presented in Section 2. Studying the entropic effect of regularizers is a natural step to formulate  an entropic interpretation to generalization. Furthermore, while our experiments focused on the supervised learning, the replica formula derived for multi-layer models is general and can be applied in unsupervised contexts, for instance in the theory of VAEs. On the rigorous side, the greater perspective remains proving the replica formula in the general case of multi-layer models, and further confirm that the replica formula stays true after the learning of the USV-layers. Another question worth of future investigation is whether the replica method can be used to describe not only entropies and mutual informations for learned USV-layers, but also the optimal learning of the weights itself.

Supplementary Material
Contents 1 Replica formula for the entropy 16   1 Replica formula for the entropy

Background
The replica method [1,2] was first developed in the context of disordered physical systems where the strength of interactions J are randomly distributed, J ∼ P J (J). Given the distribution of microstates x at a fixed temperature β −1 , P (x|β, J) = 1 Z(β,J) e −βH J (x) , one is typically interested in the average free energy from which typical macroscopic behavior is obtained. Computing (8) is hard in general, but can be done with the use of specific techniques. The replica method in particular employs the following mathematical identity Evaluating the average on the r.h.s. leads, under the replica-symmetry assumption, to an expression of the form E J Z a = e −βn a·extrq φ(β,q) , where φ(β, q) is known as the replica-symmetric free energy, and q are order parameters related to macroscopic quantities of the system. We then write F(β) = extr q φ(β, q), so that computing F depends on solving the saddle-point equations ∇ q φ q * = 0.
Computing (8) is of interest in many problems outside of physics [3,4]. Early applications of the replica method in machine learning include the evaluation of the optimal capacity and generalization error of the perceptron [5,6,7,8,9]. More recently it has also been used in the study of problems in telecommunications and signal processing, such as channel divison multiple access [10] and compressed sensing [11,12,13,14]. For a review of these developments see [15].
These particular examples all share the following common probabilistic structure for fixed W and different choices of P Y |Z and P X ; in other words, they are all specific instances of generalized linear models (GLMs). Using Bayes theorem, one writes the posterior distribution of x as P (x|W, y) = 1 P (W,y) P Y |Z (y|W x) P X (x); the replica method is then employed to evaluate the average log-marginal likelihood E W,y log P (W, y), which gives us typical properties of the model. Note this quantity is nothing but the entropy of y given W , H(y|W ).
The distribution P J (or P W in the notation above) is usually assumed to be i.i.d. on the elements of the matrix J. However, one can also use the same techniques to approach J belonging to arbitrary orthogonally-invariant ensembles. This approach was pioneered by [16,17,18,19], and in the context of generalized linear models by [20,21,22,23,24,25,26].
Generalizing the analysis of (10) to multi-layer models has first been considered by [27] in the context of Gaussian i.i.d. matrices, and by [28,29] for orthogonally-invariant ensembles. In particular, [29] has an expression for the replica free energy which should be in principle equivalent to the one we present, although its focus is in the derivation of this expression rather than applications or explicit computations.
Finally, it is worth mentioning that even though the replica method is usually considered to be non-rigorous, its results have been proven to be exact for different classes of models, including GLMs [30,31,32,33,34,35], and are widely conjectured to be exact in general. In fact, in section 2 we show how to proove the formula in the particular case of two-layer with Gaussian matrices.

Single-layer
For a single-layer generalized linear model with P X and P Y |Z separable in the components of x ∈ R n and y ∈ R m , and W ∈ R m×n Gaussian i.i.d., W µi ∼ N (0, 1/n), define α = m/n and ρ = E x x 2 . Then the entropy of y in the limit n → ∞ is given by [15,35] lim where with ξ 0 , ξ 1 both normally distributed with zero mean and unit variance, and P (y|ξ; Vz) (here Dz denotes integration over a standard Gaussian measure). This can be adapted to orthogonally-invariant ensembles by using the techniques described in [22]. Let W = U SV T , where U is orthogonal, S diagonal and arbitrary and V is Haar distributed. We denote by π W (λ W ) the distribution of eigenvalues of W T W , and the second moment of z = W x byρ = Eλ W α ρ. The entropy is then written as and If the matrix is Gaussian i.i.d., π W (λ W ) is Marchenko-Pastur and F W (AV ) = αAV . Extremizing over A givesṼ = V , so that (13) is recovered. In this precise case, it has been proven rigorously in [35].

Multi-layer
Consider the following multi-layer generalized linear model where the W ∈ R n ×n −1 are fixed, and the i index runs from 1 to n . Using Bayes' theorem we can write with W = {W } =1,...,L . Performing posterior inference requires one to evaluate the marginal likelihood which is in general hard to do. Our analysis employs the framework introduced in [27] to compute the entropy of t L in the limit n 0 → ∞ withα = n /n 0 finite for = 1, . . . , L with the replica potential φ given by and the ξ normally distributed with zero mean and unit variance. The t in the expression above are distributed as where Dz (·) = dz N (z; 0, 1) (·) denotes the integration over the standard Gaussian measure.

A simple heuristic derivation of the multi-layer formula
Formula (20) can be derived using a simple argument. Consider the case L = 2, where the model reads with t ∈ R n and W ∈ R n ×n −1 . For the problem of estimating t 1 given the knowledge of t 2 , we compute lim n 1 →∞ n −1 1 H(t 2 |W 1 ) using the replica free energy (14) Note that Moreover, H(t 1 +ξ 1 / Ã 2 ) can be obtained from the replica free energy of another problem: that of estimating t 0 given the knowledge of (noisy) t 1 , which can again be written using (14) lim with and the noiseξ 1 being integrated in the computation of H(t 1 |ξ 1 ), see (22). Replacing (26)- (29) in (25) gives our formula (20) for L = 2; further repeating this procedure allows one to write the equations for arbitrary L.

Formulation in terms of tractable integrals
While expression (20) is more easily written in terms of conditional entropies and mutual informations, evaluating it requires us to explicitely state it in terms of integrals, which we do below. We first consider the Gaussian i.i.d. In this case, the multi-layer formula was derived with the cavity and replica method by [27], and we shall use their results here. Assuming that W ∈ R n ×n −1 such that W ,µi ∼ N (0, 1/n −1 ) and using the replica formalism, Claim 1 from the main text becomes, in this case lim with the replica potential φ evaluated from and The constants α ,α and ρ are defined as following where and the measures over which expectations are computed are We typically pick the likelihoods P so that Z can be computed in closed-form, which allows for a number of activation functions -linear, probit, ReLU etc. However, our analysis is quite general and can be done for arbitrary likelihoods, as long as evaluating (33) and (34) is computationally feasible.
Finally, the replica potential above can be generalized to the orthogonally-invariant case using the framework of [22], which we have described in subsection 1.2 If the matrix W is Gaussian i.i.d., the distribution of eigenvalues of W T W is Marchenko-Pastur and one gets (31) is recovered. Moreover, for L = 1, one obtains the replica free energy proposed by [22,23,24].

Recovering the formulation in terms of conditional entropies
One can rewrite the formulas above in a simpler way. By manipulating the measures (36) one obtains for x ∼ P 0 (x) and b ∼ N (b; Ax, A). Introducing a standard normal variable ξ 0 and using the invariance of mutual informations, this can be written as Similarly where and Dz (·) = dz N (z; 0, 1) (·) denotes integration over the standard Gaussian measure. Finally, for the K where We can then rewrite (32) as Replacing in (31) yields

Solving saddle-point equations
In order to deal with the extremization problem in one needs to solve the saddle-point equations In what follows we propose two different methods to do that: a fixed-point iteration, and the state evolution of the ML-VAMP algorithm [28].

Method 1: fixed-point iteration
We first introduce the following function, which is related to the derivatives of F W where S (z) = E λ 1 λ −z is the Stieltjes transform of W T W , see e.g. [36]. In our experiments we have evaluated S approximately by using the empirical distribution of eigenvalues.
The fixed point iteration consist in looping through layers L to 1, first computing the ϑ which minimizes (15), andṼ then A (t+1) , which for layers 1 ≤ ≤ L − 1 comes from and for the L-th layer, from Finally, we recompute ϑ using A (t+1) , andÃ and move on to the next layer. After these quantities are computed for all layers, we compute all and for the 1st layer V This particular order has been chosen so that if W is Gaussian i.i.d., θ (t) = A (t) V (t) and one recovers the state evolution equations in [27]. The set of initial conditions is picked so as to cover the basin of attraction of typical fixed points. In our experiments we have chosen (A

Method 2: ML-VAMP state evolution
While the fixed-point iteration above works well in most cases, it is not provably convergent. In particular, it relies on a solution for θ = ψ (θ, A (t) V (t) ) being found, which might not happen throughout the iteration. An alternative is to employ the state evolution (SE) of the ML-VAMP algorithm [28], which leads to the same fixed points as the scheme above under certain conditions. Let us first look at the single-layer case; the ML-VAMP SE equations read Combining (57) and (61) yields At the fixed points as well as One can show these conditions also hold for the above scheme, under the following mapping of variables: These equations are easily generalizable to the multi-layer case; the equations for A + z and A − x remain the same, while the equations for A + x and A − z become where Note that the quantities in (58), (61), (68) and (69) were already being evaluated in the scheme described in the previous subsection.

Mutual information from entropy
While in our computations we focus on the entropy H(T ), the mutual information I(T ; T −1 ) can be easily obtained from the chain rule relation where in order to go from the first to the second line we have used the central limit theorem. In particular if the mapping X → T −1 is deterministic, as typically enforced in the models we use in the experiments, then I(T ; T −1 ) = I(T ; X).

Equivalence in linear case
In the linear case, Y = W L W L−1 · · · W 1 X + N (0, ∆), our formula reduces to [22,25,37] lim n→∞ n −1 I(Y ; X) = min extr where is also known as the integrated R-transform, with λ the eigenvalues of If P 0 is Gaussian, then I(x; x + 1/Aξ) = 1 2 log(1 + A); extremizing over A and V then gives where S(z) is the Stieltjes transform of W T W . The mutual information can then be rewritten as This same result can be achieved analytically with much less effort, since in this case P Y (y) = N (y; 0, ∆I m + W W T ).

Proof of the replica formula by the adaptive interpolation method
In this section we prove Theorem 1 (that we re-write below more explicitely) using the adaptive interpolation method of [38,35] in a multi-layer setting, as first developed in [39].

Two-layer generalized linear estimation: Problem statement
One gives here a generic description of the observation model, that is a two-layer generalized linear model (GLM). Let n 0 , n 1 , n 2 ∈ N * and define the triplet n = (n 0 , n 1 , n 2 ). Let P 0 be a probability distribution over R and let ∼ P 0 be the components of a signal vector X 0 . One fixes two functions ϕ 1 : Here ∼ N (0, 1) is an additive Gaussian noise, ∆ > 0, and W 1 ∈ R n 1 ×n 0 , W 2 ∈ R n 2 ×n 1 are measurement matrices whose entries are i.i.d. with distribution N (0, 1). Equivalently, where the transition density, w.r.t. Lebesgue's measure, is Our analysis uses both representations (75) and (76). The estimation problem is to recover X 0 from the knowledge of Y = (Y µ ) n 2 µ=1 , ϕ 1 , ϕ 2 , W 1 , W 2 and ∆, P 0 . In the language of statistical mechanics, the random variables Y, W 1 , W 2 , X 0 , A 1 , A 2 , Z are called quenched variables because once the measurements are acquired they have a "fixed realization". An expectation taken w.r.t. all quenched random variables appearing in an expression will simply be denoted by E without subscript. Subscripts are only used when the expectation carries over a subset of random variables appearing in an expression or when some confusion could arise.
Defining the Hamiltonian the joint posterior distribution of (x, a 1 ) given the quenched variables Y, W 1 , W 2 reads (Bayes formula) dP 0 (x) = n 0 i=1 dP 0 (x i ) being the prior over the signal and dP A 1 (a 1 ) : The partition function is defined as One introduces a standard statistical mechanics notation for the expectation w.r.t. the posterior (79), the so called Gibbs bracket − defined for any continuous bounded function g as One important quantity is the associated averaged free entropy (or minus the averaged free energy) It is perhaps useful to stress that Z(Y, W 1 , W 2 ) is nothing else than the density of Y conditioned on W 1 , W 2 ; so we have the explicit representation (used later on) where dY = n 2 µ=1 dY µ . Thus f n is minus the conditional entropy −H(Y|W 1 , W 2 )/n 0 of the measurements. This appendix presents the derivation, thanks to the adaptive interpolation method, of the thermodynamic limit lim n→∞ f n in the "high-dimensional regime", namely when n 0 , n 1 , n 2 → +∞ such that n 2/n 1 → α 2 > 0, n 1/n 0 → α 1 > 0. In this high-dimensional regime, the "measurement rate" satisfies n 2/n 0 → α := α 1 · α 2 .

Important scalar inference channels
The thermodynamic limit of the free entropy will be expressed in terms of the free entropy of simple scalar inference channels. This "decoupling property" results from the mean-field approach in statistical physics, used through in the replica method to perform a formal calculation of the free entropy of the model [2,4]. This section presents these three scalar denoising models.
a) The first channel is an additive Gaussian one. Let r ≥ 0 play the role of a signal-to-noise ratio. Consider the inference problem consisting of retrieving X 0 ∼ P 0 from the observation The free entropy associated to this channel is just the expectation of the logarithm of the normalization factor b) The second scalar channel appearing naturally in the problem is linked to P out,2 through the following inference model. Suppose that V, U i.i.d.
∼ N (0, 1) where V is known, while the inference problem is to recover the unknown U from the observation where ρ > 0, q ∈ [0, ρ]. The free entropy for this model, again related to the normalization factor of the posterior namely, where c) The third scalar channel to play a role is linked to the hidden layer From this last description, it is easy to see that the free entropy for this model is given by a formula similar to (88). Introducing δ( · − ϕ 1 (x, a)), the Dirac measure centred on ϕ 1 (x, a), it reads .

Replica-symmetric formula and mutual information
Our goal is to prove Theorem 1 that gives a single-letter replica-symmetric formula for the asymptotic free entropy of model (75) N (0, 1).

Interpolating estimation problem
The proof of Theorem 1 follows the same steps than the proof of the replica formula for a one-layer GLM in [35]. Let t ∈ [0, 1] be an interpolation parameter. We introduce an interpolating estimation problem that interpolates between the original problem (76) at t = 0 and two analytically tractable problems at t = 1.
Let {s n 0 } n 0 ≥1 ∈ (0, 1 /2] N * be some bounded sequence that converges to 0 positively. This sequence will be explicitly specified later and defines a sequence of intervals B n 0 = [s n 0 , 2s n 0 ] 2 . For a fixed n 0 , we pick a pair = ( 1 , 2 ) ∈ B n 0 . Let q : [0, 1] → [0, ρ 1 (n 0 )] and r : [0, 1] → [0, r max ] be two continuous "interpolation functions". Their dependence on will also be specified later. It is useful to define We will be interested in functions r , q that satisfy some regularity properties: is a C 1 -diffeomorphism, whose Jacobian is greater than or equal to 1.
Let S t, ∈ R n 2 be the vector with entries The inference problem is to estimate both unknowns X 0 and U from the knowledge of V, W 1 , W 2 and the two kinds of observations where ∼ N (0, 1). Y t, = (Y t, ,µ ) n 2 µ=1 and Y t, = (Y t, ,i ) n 1 i=1 are the "time-and perturbationdependent" observations. Define, with a slight abuse of notations, s t, ,µ (x, a 1 , u µ ) ≡ s t, ,µ as We now introduce the interpolating Hamiltonian ln P out,2 (Y µ |s t, ,µ ) It depends on W 2 and V through the terms (s t, ,µ ) n 2 µ=1 , and on W 1 through both (s t, ,µ ) n 2 µ=1 and the sum over i ∈ {1, . . . , n 1 }. When the (t, )-dependent observations (99) are considered, it reads ln P out,2 (Y t, ,µ |s t, ,µ ) The corresponding Gibbs bracket − n,t, , which is the expectation operator w.r.t. the (t, )-dependent joint posterior distribution of (x, a 1 , u) given (Y t, , Y t, , W 1 , W 2 , V) is defined for every continuous bounded function g on R n 0 × R n 1 ×k 1 × R n 2 as: In (103), Du = (2π) − n 2/2 n 2 µ=1 du µ e − u 2 µ/2 is the n 2 -dimensional standard Gaussian distribution and Finally, the interpolating free entropy is Note that the perturbation = ( 1 , 2 ) induces only a small change in the free entropy, namely of the order of s n 0 : Lemma 1 (Small free entropy variation under perturbation). There exists a constant C such that ∀ ∈ B n 0 : |f n, (0) − f n, =(0,0) (0)| ≤ Cs n 0 , uniformly w.r.t. n and the choice of (q ) ∈Bn 0 , (r ) ∈Bn 0 .
Proof. A simple computation shows that Lemma 11 (see Appendix C.2) applied for t = 0 reads E L n,0, = − 1 2 E Q n,0, where Q being the overlap Let u y (x) := ln P out,2 (y|x) and u y (x) its x-derivative. In a similar fashion to what is done in Appendix B.1, we compute ∂f n, (0) This quantity is bounded uniformly under the hypothesis (H2) (see the first part of the proof of Proposition 4). Then, by the mean value theorem, |f n, (0) − f n,(0,0) (0)| ≤ K ≤ Cs n 0 for appropriate numerical constants K and C.
By Jensen's inequality, one also has the lowerbound Put together, these lower and upper bounds give the lemma.
To conclude on that section, the interpolating model is such that: • at t=0, it reduces to the two-layer GLM; • at t=1, it reduces to one scalar inference channel associated to the term Ψ P out,2 plus one-layer GLM whose formula for the free entropyf n 0 ,n 1 in the thermodynamic limit is already known from [35].

Free entropy variation along the interpolation path
From the Fundamental Theorem of Calculus and (109), (113) f n = f n, (0) + 1 2 Most of the terms that form the potential (92) can already be identified in the expression (114). For the missing terms to appear, the t-derivative of the free entropy has to be computed first.
Recall the definitions: • u y (x) := ln P out,2 (y|x) and u y (x) is its derivative (w.r.t. x).
• The overlap Q: In Appendix B we show Proposition 2 (Free entropy variation). The derivative of the free entropy (105) verifies, for all t ∈ (0, 1), where O n 0 (1) is a quantity that goes to 0 in the limit n 0 , n 1 , n 2 → +∞, uniformly in t ∈ [0, 1] and ∈ B n 0 .

Overlap concentration
An important quantity appearing naturally in the t-derivative of the average free entropy is Q, the overlap between the hidden output X 1 and the sample x 1 := ϕ 1 W 1 x / √ n 0 , a 1 where the triplet (x, u, a 1 ) is sampled from the posterior distribution associated to the Gibbs bracket − n,t, . The next proposition mirrors Proposition 4 in [35] and states that the overlap concentrates around its mean.
Note from (92) and (114) that the second summand in (115) is precisely the term needed to obtain the expression of the potential on the r.h.s. of (114). We would like to "cancel" the Gibbs bracket in (115) to prove Theorem 1. One way to cancel this so-called remainder is to choose q (t) = E Q n,t, on which Q concentrates by Proposition 3. However, E Q n,t, depends on t 0 q (v)dv, as well as t, and t 0 r (v)dv. The equation q (t) = E Q n,t, is therefore a first order differential equation over t → t 0 q (v)dv. This will be addressed in details in the next section. For now we assume that we can take q (t) = E Q n,t, and prove Proposition 4. Assume that (H1), (H2), (H3) hold, that the interpolation functions (q ) ∈Bn 0 , (r ) ∈Bn 0 are regular, and that ∀(t, ) ∈ [0, 1] × B n 0 : q (t) = E Q n,t, . Then where O n 0 (1) is a quantity that vanishes as n 0 → ∞, uniformly w.r.t. the choice of the interpolation functions.
Proof. By the Cauchy-Schwarz inequality First we will look at the first term of this product. We have .
The second summand on the left-hand side is further upper bounded as: u Yt, ,µ (S t, ,µ )u Yt, ,µ (s t, ,µ ) 2 n,t, The first inequality follows from the Cauchy-Schwartz inequality, the subsequent equality from the Nishimori identity (Proposition 9 in Appendix A.1), the second inequality from Jensen's inequality and the last inequality from the Nishimori identity again. Remember u y (x) := ln P out,2 (y|x) = ln dP A 2 (a) 1 Putting everything together, we obtain the following bound uniform w.r.t. the choice of the interpolating functions: With our assumption on q and Proposition 3, the second term in 118 is easily bounded by C(ϕ 1 , ϕ 2 , α 1 , α 2 , S)n − 1 /8 . Thus Therefore the integral of (115) over (t, ) reads Here the summand O n 0 (1) vanishes uniformly w.r.t. to the choice of q and r . Replacing (119) in (114) leads to the claimed identity.

Lower and upper matching bounds
To end the proof of Theorem 1 one has to go through the following two steps: (i) Prove that under the assumptions (H1), (H2) and (H3) f RS (q 0 , r 0 , q 1 , r 1 ; ρ 0 , ρ 1 ) .
(ii) Invert the order of the optimizations on r 1 and q 1 .
To tackle (i), we prove that lim inf n 0 →∞ f n and lim sup n 0 →∞ f n are -respectively -lower and upper bounded by the same quantity sup f RS (q 0 , r 0 , q 1 , r 1 ; ρ 0 , ρ 1 ).
In these proofs we will need the following useful proposition. For t ∈ [0, 1] and ∈ B n 0 , we write R(t, ) = (R 1 (t, ), R 2 (t, )). E Q n,t, can be written as a function E Q n,t, = F n t, R(t, ) of n, t, R(t, ) with F n a function defined on Proposition 5. Let D • n be the interior of the set D n . F n is a continuous function from D n to [0, ρ 1 (n 0 )], and admits partial derivatives with respect to its second and third arguments on D • n . These partial derivatives are both continuous and non-negative on D • n .
Proof. Let define explicitly the function F n : D • n → R. Fix (t, r 1 , r 2 ) ∈ D • n . Consider the two sets of observations    Y t,r 2 ,µ ∼ P out,2 ( · | S t,r 2 ,µ ) , with S t,r 2 := 1−t n 1 W 2 ϕ 1 W 1 X 0 √ n 0 , A 1 + √ r 2 V + ρ 1 (n 0 )t + 2s n 0 − r 2 U ∈ R n 2 . Now, in complete analogy with (101), we define the Hamiltonian where s t,r 2 := 1−t n 1 W 2 ϕ 1 The Gibbs bracket corresponding to this Hamiltonian is denoted − n,t,r 1 ,r 2 and is used to define F n : F n (t, r 1 , r 2 ) := E Q n,t,r 1 ,r 2 = 1 n 1 Now that F n is defined, its continuity and differentiability properties follow from the standard theorems of continuity and differentiation under the integral sign. The domination hypotheses are easily verified under assumptions (H1), (H2).
The overlap E Q n,t,r 1 ,r 2 is related to the minimal-mean-squared error (using a Nishimori identity to get the second equality) r 1 only appears in the definition of Y t,r 1 := √ r 1 X 1 + Z , where Z is a standard Gaussian vector.
Then MMSE X 1 Y t,r 2 , Y t,r 1 , W 1 , W 2 , V is clearly a non-increasing function of r 1 , so F n (t, r 1 , r 2 ) is a non-decreasing function of r 1 . r 2 's only role is in the generation process of Y t,r 2 : Let 0 ≤ r 2 ≤ r 2 < ρ 1 (n 0 )t + 2s n 0 and V ∼ N (0, I m ) independent of everything else. Define independently of everything else. Now notice that F n (t, r 1 , r 2 ) ≤ F n (t, r 1 , r 2 ). It proves F n (t, r 1 , r 2 ) is a non-decreasing function of r 2 .
This is one of the assumption needed to apply Proposition 4. Now we show that the functions (q ) ∈Bn 0 and (r ) ∈Bn 0 are regular (see Definition 1). Fix t ∈ [0, 1]. R ≡ R(t, ·) : → (R 1 (t, ), R 2 (t, )) is the flow of (129), thus it is injective by unicity of the solution, and C 1 by F n 's regularity properties (see Proposition 5). The determinant of the Jacobian of the flow is given by the Liouville formula (see and is greater or equal to 1 because ∂Fn /∂r 2 is non-negative (see Proposition 5). Since the Jacobian determinant does not vanish the local inversion theorem and the injectivity of the flow imply that R is a C 1 -diffeomorphism. Moreover, since also the Jacobian determinant is greater than or equal to 1 the functions (q ) ∈Bn 0 and (r ) ∈Bn 0 are regular. Proposition 4 can now be applied to this special choice of regular functions: f RS q 0 , r 0 , q 1 , r; ρ 0 , ρ 1 (n 0 ) By a continuity argument f RS q 0 , r 0 , q 1 , r; ρ 0 , ρ 1 .

Plugging this back in
The upperbound (128) follows.

Activations comparison in terms of mutual informations
Here we assume the exact same setting as the one presented in the main text to compare activation functions on a two-layer random weights network. We compare here the mutual information estimated with the proposed replica formula instead of the entropy behaviors discussed in the main text. As it was the case for entropies, we can see that the saturation of the double-side saturated hardtanh leads to a loss of information for large weights, while the mutual informations are always increasing for linear and ReLU activations.

Learning ability of USV-layers
To ensure weight matrices remain close enough to being independent during learning we introduce USV-layers, corresponding to a custom type of weight constraint. We recall that in such layers, weight matrices are decomposed in the manner of a singular value decomposition, W = U S V , with U and V drawn from the corresponding Haar measures (i.e. uniformly among the orthogonal matrices of given size), and S contrained to be diagonal, being the only matrix being learned. In the main text, we demonstrate on a linear network that the USV-layers ensure that the assumptions necessary to our replica formula are met with learned matrices in the case of linear networks. Nevertheless, a USV-layer of size n × n has only n trainable parameters, which implies that they are harder to train than usual fully connected layers. In practice, we notice that they tend to require more parameter updates and that interleaving linear USV-layers to increase the number of parameters between non-linearities can significantly improve the final result of training.
To convince ourselves that the training ability of USV-layers is still relevant to study learning dynamics on real data we conduct an experiment on the MNIST dataset. We study the classification problem of the classical MNIST data set (60 000 training images and 10 000 testing images) with a simple fully-connected network featuring one non-linear (ReLU) hidden layer of 500 neurones. On top of the ReLU-layer, we place a softmax output layer where the 500 × 10 parameters of the weight matrix are all being learned in all the versions of the experiments. Conversely, before the ReLU layer, we either (1) do not learn at all the 784 × 500 parameters which then define random projections, (2)   train and test, losses and accuracies, for the different architectures are given in Table 1 and some learning curves are displayed on Figure 6. As expected we observe that USV-layers are achieving better classification success than the random projections, yet worse than the unconstrained fully connected layer. Interestingly, stacking USV-layers to increase the number of trainable parameters allows to reach very good training accuracies, nevertheless, the testing accuracies do not benefit to the same extent from these additional parameters. On Figure 6, we can actually see that the version of the experiment with 6 USV-layers overfits the training set (green curves with testing losses growing towards the end of learning). Therefore, particularly in this case, adding regularizers might allow to improve the generalization performances of models with USV-layers.

Additional learning experiments on synthetic data
Similarly to the experiments of the main text, we consider simple training schemes with constant learning rates, no momentum, and no explicit regularization.
We first include a second version of Figure 4 of the main text, corresponding to the exact same experiment with a different random seed and check that results are qualitatively identical.   We consider then a regression task created by a 2-layer teacher network of sizes 500-3-3, activations ReLU-linear, uncorrelated input data distribution N (0, I n X ) and additive white Gaussian noise at the output of variance 0.01. The matrices of the teacher network are i.i.d. normally distributed with a variance equal to the inverse of the layer input dimension. We train a student network with 2  ReLU layers of sizes 2500 and 1000, each featuring 5 stacked USV-layers of same size before the non linear activation, and with one final fully-connected linear layer. We use plain SGD with a constant learning rate of 0.01 and a batchsize of 50. In Figure 8 we plot the mutual informations with the input at the effective 10-hidden layers along the training. Except for the very first layer where we observe a slight initial increase, all mutual informations appear to only decrease during the learning, at least at this resolution (i.e. after the first epoch). We thus observe a compression even in the absence of double-saturated non-linearities. We further note that in this case we observe an accuentuation of the amount of compression with layer depth as observed by [40] (see second plot of first row of Figure 8), but which we did not observe in the binary classification experiment presented in the main text. On Figure 9, we reproduce the figure for a different seed.
In a last experiment, we even show that merely changing the weight initialization can drastically change the behavior of mutual informations during training while resulting in identical training and testing final performances. We consider here a setting closely related to the classification on  correlated data presented in the main text. The generative model is a a simple single layer generative model X =W gen Y + with normally distributed code Y ∼ N (0, I n Y ) of size n Y = 100, from which data of size n X = 500 are generated with matrixW gen i.i.d. normally distributed as N (0, 1/ √ n Y ) and noise ∼ N (0, 0.01I n X ). The recognition model attempts to solve the binary classification problem of recovering the label y = sign(Y 1 ), the sign of the first neuron in Y . Again the training is done with plain SGD to minimize the cross-entropy loss and the rest of the initial code (Y 2 , ..Y n Y ) acts as noise/nuisance with respect to the learning task. On Figure 10 we compare 3 identical 5-layers recognition models with sizes 500-1000-500-250-100-2, and activations hardtanh-hardtanh-hardtanhhartanh-softmax. For the model presented at the top row, initial weights were sampled according to W ,ij ∼ N (0, 4/n −1 ), for the model of the middle row N (0, 1/n −1 ) was used instead, and finally N (0, 1 /4n −1 ) for the bottom row. The first column shows that training is delayed for the weight initialized at smaller values, but eventually catches up and reaches accuracies superior to 0.97 both  Figure 10: Learning and hidden-layers mutual information curves for a classification problem with correlated input data, using a 4-USV hardtanh layers and 1 unconstrained softmax layer, from 3 different initializations. Top: Initial weights at layer of variance 4/n −1 , best training accuracy 0.999, best test accuracy 0.994. Middle: Initial weights at layer of variance 1/n −1 , best train accuracy 0.994, best test accuracy 0.9937. Bottom: Initial weights at layer of variance 0.25/n −1 , best train accuracy 0.975, best test accuracy 0.974. The overall direction of evolution of the mutual information can be flipped by a change in weight initialization without changing drastically final performance in the classification task.
in training and testing. Meanwhile, mutual informations have different initial values for the different weight initializations and follow very different paths. They either decrease during the entire learning, or on the contrary are only increasing, or actually feature an hybrid path. We further note that it is to some extent surprising that the mutual information would increase at all in the first row if we expect the hardtanh saturation to instead induce compression. Figure 11 presents a second run of the same experiment with a different random seed. Findings are identical.
These observed differences and non-trivial observations raise numerous questions, and suggest that within the examined setting, a simple information theory of deep learning remains out-of-reach.

A.1 The Nishimori identity
Proposition 9 (Nishimori identity). Let (X, Y) ∈ R n 1 × R n 2 be a couple of random variables. Let k ≥ 1 and let X (1) , . . . , X (k) be k i.i.d. samples (given Y) from the conditional distribution P (X = · |Y), independently of every other random variables. Let us denote − the expectation operator w.r.t. P (X = · |Y) and E the expectation w.r.t. (X, Y). Then, for all continuous bounded  Figure 10 of the Supplementary Material. Learning and hidden-layers mutual information curves for a classification problem with correlated input data, using a 4-USV hardtanh layers and 1 unconstrained softmax layer, from 3 different initializations. Top: Initial weights at layer of variance 4/n −1 , best training accuracy 0.999, best test accuracy 0.998. Middle: Initial weights at layer of variance 1/n −1 , best train accuracy 0.9935, best test accuracy 0.9933. Bottom: Initial weights at layer of variance 0.25/n −1 , best train accuracy 0.974, best test accuracy 0.973. The overall direction of evolution of the mutual information can be flipped by a change in weight initialization without changing drastically final performance in the classification task.
Proof. This is a simple consequence of Bayes formula. It is equivalent to sample the couple (X, Y) according to its joint distribution or to sample first Y according to its marginal distribution and then to sample X conditionally to Y from its conditional distribution P (X = · |Y). Thus the (k + 1)-tuple (Y, X (1) , . . . , X (k) ) is equal in law to (Y, X (1) , . . . , X (k−1) , X).
From now on, assume ρ 0 > 0. Given X 0 , one has W 1 X 0 √ n 0 1 ∼ N 0, X 0 2 n 0 . Therefore ) is a function on ]0, +∞[. It is easily shown to be continuous under (H2) thanks to the dominated convergence theorem. By the Strong Law of Large Numbers, X 0 2 /n 0 converges a.s. to ρ 0 . Combined with the continuity of h, one has Noticing that |h ( X 0 2 /n 0 )| ≤ sup ϕ 2 1 , the dominated convergence theorem gives

A.3 Properties of the third scalar channel
Proposition 10. Assume ϕ 1 is bounded (as it is the case under (H2)). Let V, U i.i.d. ∼ N (0, 1) and is twice-differentiable, convex, non-decreasing and ρ 1 2 -Lipschitz on R + . Then the function is convex, non-decreasing and α 1 Proof. For fixed ρ 0 and q 0 , let Ψ ϕ 1 ≡ Ψ ϕ 1 (q 0 , · ; ρ 0 ). Note that With the properties imposed on ϕ 1 , all the domination hypotheses to prove the twice-differentiability of ψ ϕ 1 are reunited. Denote − r the expectation operator w.r.t. the joint posterior distribution where Z(Y 0 , V ) is a normalization factor. Using Gaussian integration by parts and the Nishimori property (Proposition 9), one verifies that for all r ≥ 0 Hence Ψ ϕ 1 is non-decreasing and convex. The Lipschitzianity follows simply from The Nishimori identity was used once again to obtain the penultimate equality. Finally, f properties are direct consequences of its definition as the "sup inf" of convex, non-decreasing, ρ 1 2 -lipschitzian functions.

B Derivative of the averaged interpolating free entropy
This appendix is dedicated to the proof of Proposition 2, i.e. to the derivation of the interpolating free entropy f n, (t) with respect to the time t. First we show that for all t ∈ (0, 1) where A n, (t) := E n 2 µ=1 P out,2 (Y t, ,µ |S t, ,µ ) P out,2 (Y t, ,µ |S t, ,µ ) P out,2 (y|x) and P out,2 (y|x) denote the first and second x-derivatives. Once this is done, we prove that A n, (t) goes to 0 uniformly in t ∈ [0, 1] as n 0 , n 1 , n 2 → +∞ (while n 1/n 0 → α 1 , n 2/n 1 → α 2 ), thus proving Proposition 2.
B.1 Computing the derivative: proof of (132) Recall definition (105). Once written as a function of the interpolating Hamiltonian (102), it becomes Here, and from now on, one drops the dependence on (t, ) when writing Y and Y as they are dummy variables on which the integration is performed. We will need the Hamiltonian t-derivative H t given by The derivative of the interpolating free entropy for 0 < t < 1 thus reads where Z n,t, ≡ Z n,t, (Y, Y , W 1 , W 2 , V) is defined in (104). In the remaining part of this subsection B.1, to lighten notations, the second argument of the function ϕ 1 will be omitted (except in a few occasions). Namely, we will write for i = 1 . . . n 1 It does not hurt the understanding of the derivation of (132) as the latter relies on integration by parts w.r.t. the Gaussian random variables W 1 , W 2 , V, U, Z .
Let first compute T 1 . For 1 ≤ µ ≤ n 2 one has from (98) By Gaussian integration by parts w.r.t (W 2 ) µi , 1 ≤ i ≤ n 1 , the first expectation becomes In the last equality we used the identity u Yµ (x) + u Yµ (x) 2 = P out,2 (Yµ|x) P out,2 (Yµ|x) and the definition of the overlap Q. Now we turn our attention to the second expectation in the right hand side of (137). Using again Gaussian integration by parts, but this time w.
It remains to put the term E ϕ 1 in a nicer form, as seen from (135) After taking the sum over i ∈ {1, . . . , n 1 }, we get Therefore, for all t ∈ (0, 1), To obtain (132), we have to show that T 2 is zero. The Nishimori identity (see Proposition 9) says Performing the same integration by parts than the ones leading to (140), we obtain The last equality follows from a computation in the next section, see (149). The combination of (144), (145) and (146) shows that T 2 = 0.
B.2 Proof that A n, (t) vanishes uniformly as n 0 → +∞ The last step to prove Proposition 2 is to show that A n, (t) -see definition (133) -vanishes uniformly in t ∈ [0, 1] and as n 0 → +∞, under conditions (H1)-(H2)-(H3). First we show that Once this is done, we use the fact that ln Zn,t, /n 0 concentrates around f n, (t) to prove that A n, (t) vanishes uniformly. Start by noticing the simple fact that P out,2 (y|s)dy = 0 for all s ∈ R. Consequently, for µ ∈ {1, . . . , n 2 }, The "tower property" of the conditional expectation then gives E n 2 µ=1 P out,2 (Y t, ,µ |S t, ,µ ) P out,2 (Y t, ,µ |S t, ,µ ) This implies (147). Using successively (147) and the Cauchy-Schwarz inequality, we have A n, (t) = E n 2 µ=1 P out,2 (Y t, ,µ |S t, ,µ ) P out,2 (Y t, ,µ |S t, ,µ ) Making use of the "tower property" of conditional expectation once more, one obtains Conditionally on S t, , the random variables P out,2 (Yt, ,µ|St, ,µ) P out,2 (Yt, ,µ|St, ,µ) 1≤µ≤n 2 are i.i.d. and centered. There- Under condition (H2), it is not difficult to show that there exists a constant C > 0 such that Combining now (153), (152) and (151) we obtain that It remains to prove that Var X 1 2 /n 1 is bounded, where we recall that X 1 := ϕ 2 To do so, we use the identity and show that both terms in the right hand side are bounded. First, the term E Var( X 1 2 |X 0 ) . Conditionally on X 0 , the random variables (X 1 i ) 1≤i≤n 1 are i.i.d. and It follows that Under (H2), the expectation E ϕ 4 where g(c) := E ϕ 2 1 W 1 c / √ n 0 1 , A 1,1 for any n 0 -dimensional real vector c = (c 1 , . . . , c n 0 ). The partial derivatives of g satisfy for 1 ≤ j ≤ n 0 where (159) was obtained by integrating by parts w.r.t. (W 1 ) 1j . The prior P 0 has bounded support X ⊆ [−S, S] under the hypothesis (H1). Then, for every c ∈ X n 0 , we have for some constant C > 0. Here the hypothesis (H2) was used to bound the expectation in (159). Thus, the function g satisfies the bounded difference property, i.e. ∀j ∈ {1, . . . , n 0 } sup c∈X n 0 ,c j ∈X Applying Proposition 12 (see Appendix C.1) it comes It ends the proof of Var( X 1 2 ) /n 1 boundedness in the limit n 0 → +∞. Combining (150), (154) and the latter, it comes for n 0 large enough with K > 0 a constant independent of both t and . The uniform convergence of A n, (t) to 0 then follows from (164) and Theorem 2 in Appendix C.1, that shows E[(n −1 0 ln Z n,t, − f n, (t)) 2 ] vanishes uniformly in t, when n 0 → +∞.

C.1 Concentration of the free entropy
In this section, we prove that the free entropy of the interpolation model studied in Sec. 2.4 concentrates around its expectation (uniformly in t and ), i.e. we prove Theorem 2 stated below. C(ϕ 1 , ϕ 2 , α 1 , α 2 , S) will denote any generic positive constant depending only on ϕ 1 , ϕ 2 , α 1 , α 2 , S. Remember that S is a bound on the signal absolute values. It is also understood that the dimensions n 0 , n 1 , n 2 are large enough so that n 1/n 0 ≈ α 1 , n 2/n 1 ≈ α 2 .
The whole proof relies on two important variance bounds that are recalled below. The reader can refer to [41] (Chapter 3) for detailed proofs of these statements.
Proposition 11 (Gaussian Poincaré inequality). Let U = (U 1 , . . . , U N ) be a vector of N independent standard normal random variables. Let g : R N → R be a continuously differentiable function. Then Proposition 12. Let U ⊂ R. Let g : U N → R a function that satisfies the bounded difference property, i.e., there exists some constants c 1 , . . . , c N ≥ 0 such that Let U = (U 1 , . . . , U N ) be a vector of N independent random variables that takes values in U. Then The order in which the concentrations are proved matters. First, we show the concentration w.r.t. the Gaussian variables Z, Z , V, U, W 2 and w.r.t. A 2 thanks to the classical Gaussian Poincaré inequality and a bounded difference argument, respectively. This first part is performed working conditionally on W 1 , X 1 . Then the concentration w.r.t. W 1 and X 1 := ϕ 1 W 1 X 0 / √ n 0 , A 1 is obtained by proving the concentation w.r.t. X 1 , W 1 and X 0 , in this order.
C.1.1 Concentration conditionally on W 1 , X 1 In all this subsection, we work conditionally on W 1 , X 1 and we prove that ln Zn,t, /n 0 is close to its expectation w.r.t. all the other random variables, namely Z, Z , V, U, W 2 and A 2 : Lemma 3 follows simply from Lemmas 4, 5, 6 proven below.
Proof. Here g = ln Zn,t, /n 0 is seen as a function of Z, Z only and we work conditionally to all other random variables, i.e. V, U, W 2 , A 2 , X 1 , W 1 . The squared norm of the gradient of g reads Each of these partial derivatives are of the form ∂g = −n −1 0 ∂ H t, Ht, where the Gibbs bracket − Ht, pertains to the effective Hamiltonian (171). One finds and, replacing in (176), one gets ∇g 2 ≤ 4n −1 almost surely. Taking the expectation in (177) gives the lemma.
Proof. Here g = E[ln Zn,t, |V,U,W 2 ,A 2 ,X 1 ,W 1 ] /n 0 is seen as a function of V, U, W 2 only and we work conditionally to all other random variables, i.e. A 2 , X 1 , W 1 . From now on, and until the end of the proof, we will denote byẼ[ · ] the conditional expectation E[ · |V, U, W 2 , A 2 , X 1 , The same inequality holds for ∂g ∂Uµ . To compute the derivative w.r.t. (W 2 ) µi , first remark that Putting these inequalities together one ends up with Then the lemma follows once again of Proposition 11.
Next the variance bound of Lemma 12 is applied to show the concentration of E[lnẐn,t, |A 2 ,X 1 ,W 1 ] /n 0 w.r.t. A 2 .
Proof. Here g = E[ln Zn,t, |W 1 ,X 0 ] /n 0 is seen as a function of W 1 only and we work conditionally to X 0 . To lighten the equations, we writeẼ[ · ] in place of E[ · |W 1 , X 0 ] and the second argument of ϕ 1 and ϕ 2 is not written explicitly. The partial derivatives of g w.r.t. (W 1 ) ij reads In a similar fashion to what is done in previous proofs, the absolute value of the second term in this partial derivative can be upperbounded by √ r max n 3/2 0 2 √ r max sup |ϕ 1 | + 2 π · 2S sup |ϕ 1 | .
As mentioned before, this implies Theorem 2 thanks to (169).

C.2 Concentration of the overlap
This section presents the proof of Proposition 3. This proof is essentially the same as the one provided for the one-layer GLM [35]. All along this section t ∈ [0, 1] is fixed, and the averaged free entropy is treated as a mapping (R 1 , R 2 ) → f n, (t) of R 1 = R 1 (t, ) and R 2 = R 2 (t, ), given by (97). The same is true for the free entropy corresponding to a realization of the quenched variables: (R 1 , R 2 ) → F n, (t) := 1 n 0 ln Z n,t, (Y t, , Y t, , W 1 , W 2 , V) .
Define x 1 := ϕ 1 W 1 x / √ n 0 , a 1 where the triplet (x, u, a 1 ) is sampled from the posterior distribution associated to the Gibbs bracket − n,t, . Hence x 1 is nothing but a sample obtained from the conditionnal distribution of X 1 given (Y t, , Y t, , W 1 , W 2 , V). Let L := 1 n 1 First we prove a formula that we uses extensively, in particular in Lemma 1. Proof. For any realization of the quenched variables, the first two derivatives of the free entropy read dF n, (t) dR 1 = − n 1 n 0 L n,t, − n 1 2n 0 Averaging both identities with respect to the quenched disorder gives df n, (t) dR 1 = − n 1 n 0 E L n,t, − n 1 2n 0 ρ 1 (n 0 ) = − n 1 2n 0 ρ 1 (n 0 ) − E Q n,t, ; where we made use of n 1 i=1 E (x 1 i ) 2 n,t, = E[ X 1 2 ] and R 1 ≥ 1 . By assumption (q ) ∈Bn 0 and (r ) ∈Bn 0 are regular. Therefore R : ( 1 , 2 ) → (R 1 (t, ), R 2 (t, )) is a C 1 -diffeomorphism whose Jacobian J R verifies J R ( ) ≥ 1 for all ∈ B n 0 . Integrating over ∈ B n 0 we obtain