Reexamining the principle of mean-variance preservation for neural network initialization

Before backpropagation training, it is common to randomly initialize a neural network so that mean and variance of activity are uniform across neurons. Classically these statistics were deﬁned over an ensemble of random networks. Alternatively, they can be deﬁned over a random sample of inputs to the network. We show analytically and numerically that these two formulations of the principle of mean-variance preservation are very different in deep networks using rectiﬁcation nonlinearity (ReLU). We numerically investigate training speed after data-dependent initialization of networks to preserve sample mean and variance.


I. INTRODUCTION
The procedure used to initialize the weights and biases of a neural network has a large impact on backpropagation training dynamics, in some cases determining whether the network will train at all [1].One rule of thumb, dating back to the 1990s, is to set the biases to zero and randomly initialize the weights so that neural activities (or preactivities) have the same mean and variance throughout the network [2].This is known as "Kaiming Initialization" for networks using rectification nonlinearity [1].This nonlinearity is also known as the "ReLU" nonlinearity and is defined as: x, if x > 0 0, if x 0 noise input vectors, and visualize the preactivity distributions for single neurons in various layers (Fig. 1).The sample mean of the preactivity for a neuron can be far from zero.
The sample variance of the preactivity is similar for neurons in a layer but becomes smaller as we move deeper into the network.In short, Kaiming initialization does not preserve sample mean and variance.However, if we now randomize over networks, the preactivity mean is zero for all neurons, and the preactivity variance is the same across layers (Fig. 1).
In the following, we will elucidate the phenomena observed in Fig. 1 by applying the formalism of "signal propagation" [5][6][7].The formalism was invented for a superficially different purpose but can be applied to compute variance of preactivity in the limit of an infinitely wide network.We will compare the analytical result to numerical simulations in finite width networks.
We explore one method to implement the sample meanvariance preservation principle: Using a data-dependent initialization to, at each layer, initialize the bias to subtract out the sample mean of preactivity and scale the weights to make the sample variance of preactivity equal unity.We empirically show that this initialization scheme, similar to that of Krähenbühl et al. [8], can improve training speed of two moderately deep networks (10 and 20 layers) on an object recognition and image segmentation task.This may account for part of the success of batch normalization and related schemes [9][10][11][12][13] that can be interpreted as extensions of the principle of mean-variance preservation to hold throughout training rather than only at the beginning.Note that most of these results have appeared in preprint form [14].

II. NETWORK ENSEMBLE AND DEFINITIONS
We study fully connected, feedforward networks of uniform width N, FIG. 1. Numerical experiments on preactivity distributions in (a) layer 1, (b) layer 5, (c) layer 10, and (d) layer 50.Using Kaiming initialization, we randomly generate a 50-layer-deep, 1024-neuronwide network with ReLU nonlinearity.Gaussian white noise is input to the network.We display as a solid histogram the distribution of a single preactivation over white noise inputs.This process is repeated for three random networks (green, blue, and orange).We then generate 1024 networks using the same procedure and visualize the distribution of a preactivation over the joint distribution of networks and white noise inputs (red histogram).
where l is the layer, i = 1, 2, . . ., N is the neuron index and [u] + = max(u, 0) is (half-wave) rectification.We call y l the vector of preactivities and x l the vector of activities at layer l.
We distinguish between two kinds of randomness.First, the input x 0 to the network is drawn randomly from some distribution.Second, the weights of the network are drawn independently at random from a zero-mean Gaussian distribution with variance 2/N, and biases are initialized to zero (Kaiming initialization).Averages with respect to the random input will be denoted by angle brackets • and will be called "sample averages."Averages with respect to the random configuration of weights will be denoted by double angle brackets • and called "quenched averages." The preactivity fluctuates about the sample mean, and the sample mean in turn fluctuates about zero, y l i = (y l i − y l i ) + y l i .This leads to the variance decomposition where are the "total variance" and the variance of the sample mean.By symmetry these are the same for all neurons in a layer, so the index i is suppressed on the left-hand side.Note that the "total mean" of the preactivity y l i vanishes for all neurons in all layers, y l i = y l i = 0, and the quenched average also vanishes, y l i = 0.In contrast, the "sample mean" y l i is typically nonzero, is different for each neuron in a layer (Fig. 1), and depends on the configuration of weights.
In practice, we are interested in the sample variance (y l i ) 2 − y l i 2 for a frozen or "quenched" configuration of weights.For wide nets, Fig. 1 suggests that the sample variance can be approximated by its quenched average (v l ) 2 .
In Fig. 1, σ l is the width of the red line histogram, v l is the average width of a solid histogram, and m l characterizes the variability in the locations of the solid histograms.Under Kaiming initialization, the total variance (σ l ) 2 is the same across all layers l; all red line histograms have the same width.In the first layer, the total variance is entirely sample variance (σ 1 ) 2 = (v 1 ) 2 and m 1 = 0.With increasing layer l, the sample variance becomes a smaller and smaller fraction of total variance, until in the limit l → ∞ the total variance becomes entirely variance of the sample mean (m l ) 2 .
Before showing the preceding with mathematical formalism, we provide an intuitive explanation.It is straightforward to show that rectifying y increases its sample mean and decreases its sample variance (assuming that y has nonzero probability of being negative).Therefore rectification is expected to increase m/v.The other part of Eq. ( 1) is multiplication by a random matrix.We can show that this operation tends to preserve m/v because y l i 2 = 2 x l j 2 and (y l i ) 2 = 2 (x l i ) 2 , assuming that W l i j are independent and identically distributed with zero mean and 2/N variance.The essential point is that y l i = j W l i j x l j has zero mean when averaged over weights but is typically nonzero when sample averaged, which is why y l i 2 is nonvanishing.Because the networks we examine are repeated application of random matrix multiplication followed by half-wave rectification, we expect that m 2 should grow relative to v 2 in deeper layers.Note that this claim is specific to ReLU and may not hold for other nonlinear activation functions.

III. CALCULATING m l , v l IN WIDE NETWORKS
In Eq. ( 3), the quenched averages are easier to calculate because y l i is a sum of Gaussian random weights by Eq. ( 1).The sample averages are difficult because l nonlinearities intervene between y l i and the random input.Therefore it is advantageous to perform the quenched average before the sample average, as in (σ l ) 2 = (y l i ) 2 .To switch the order of averaging for (m l ) 2 , we use y l i1 and y l i2 to refer to the preactivities when the network is presented with two different input vectors drawn independent and identically distributed.Then This identity relates the variance preservation principle to a body of work on "signal propagation" [5][6][7], which addresses the following question: If two different inputs are given to the same random net, do their preactivity vectors become more or less similar as we move deeper into the net?According to Eq. ( 4), the variance of the sample mean (m l ) 2 is a measure of similarity between the two preactivity vectors in layer l.
We can write (σ l ) 2 and (m l ) 2 as the diagonal and offdiagonal elements of the 2 × 2 matrix Q , where Q is the covariance matrix, The index i is suppressed on the left-hand side because the right-hand side is independent of i by symmetry.It will also be convenient to define a normalized version of Q l , the matrix of Pearson correlation coefficients, Substituting y l+1 iα = N j=1 W l+1 i j [y l jα ] + into Eq.( 5) and using the fact that the weights are uncorrelated with variance 2/N, we obtain In the limit of an infinitely wide network, Poole et al. [5], introduced a meanfield approximation in which preactivities on the right-hand side are Gaussian random variables with covariance matrix Q l .Evaluating the Gaussian integral [6,7] yields the recursion relation where Since the diagonal elements of C l are unity [see definition in Eq. ( 6)], Eq. ( 7) implies Q l+1 αα = Q l αα , i.e., the diagonal elements of Q l remain constant with l.Then the recursion of Eq. ( 7) simplifies to which is solved by the mapping K iterated l times, C l αβ = K l (C 0 αβ ) [6,7].The correlation in the first layer is proportional to the inner product of the input vectors, Q and C 0 αβ is the cosine similarity between the same input vectors.Recursion yields the correlations in layer l, Averaging over inputs, the diagonal terms yield: and the off-diagonal terms yield: where the average is over independently chosen sample pairs α, β.In deeper layers, we can simplify this equation using the fact that K l (C 0 12 ) → 1 for large l, so as long as the input pair are not perfectly antiparallel (C 0 12 = −1): Using the identity (σ l ) 2 = (m l ) 2 + (v l ) 2 , we can derive behavior of v l in the large-l regime: In a wide ReLU network, preactivity sample variance is twice the variance per element of the length of the input vector.
When the input length has zero variance per element, which occurs when inputs are explicitly normalized to be uniform length, or when their elements are independent and identically distributed, the sample variance actually decays to zero as l increases.

IV. WHITE NOISE INPUTS
Suppose that elements of the input vector are independent and identically distributed with zero mean and variance equal to 1/2.Then N −1 x 0 2 → 1/2 and C 0 αβ → 0 for α = β almost surely as N → ∞.If we further assume the number of samples is large, so that α = β almost surely for any randomly chosen input pair, we can find that Eqs. ( 2), (10), and (11) yield: We plot m l , v l as a function of l (Fig. 2) in this white noise, large sample number regime.The ratio m l /v l actually grows linearly with depth.This can be shown analytically by combining a result from Hayou et al. [7] that computed behavior of K l (c) for large l and using Eq. ( 11) to write m l /v l in terms of K l .The result is that m l /v l ≈ 2/(9π 2 ) l for large l.
We test the validity of the infinite width approximation via numerical simulations in finite width networks with white Gaussian noise inputs.We sample 30 random networks of depth 50 with uniform width, N, per layer.We compare N = 30, 100, 300, 1000, 3000.We sample 100 input vectors with elements chosen independent and identically distributed from an N (0, 1/2) distribution.These are then propagated through each network and the empirical ratio of squared sample mean to average sample variance for each network is computed (Fig. 2).This ratio grows with network width and appears to asymptote to the value calculated in the infinite width regime.
One of the most widely used normalization schemes is batch normalization [9], which normalizes every preactivation by subtracting its mean and dividing by its standard deviation, computed over some random subset of samples.Because preactivations can be far from zero mean and unit variance at initialization, normalization can have a dramatic impact on the network's initialization.We now show that normalized initialization can modify training dynamics.
We do so by comparing training dynamics for two datadependent initialization schemes, along with batch normalized networks.The first scheme, total init, implements the total mean-variance preservation principle and the second, sample init, roughly implements the sample mean-variance preservation principle.We do not use Kaiming initialization because we train nets containing pooling and residual connections that modify the total variance of units.In practice, one it may acceptable to ignore this complication; however, for fair comparison we will implement a data-dependent scheme to preserve the total variance at each layer.
Total init first estimates the total variance of a preactivation in each layer by estimating the variance over all n l preactivations in the layer and B inputs.The weights are then rescaled so this empirical joint preactivation distribution has unit variance.No extra work is needed to preserve the total mean of a preactivation (it is zero because y l i = j W l i j x l j so y l i = 0).This algorithm is similar to Algorithm 1 from Mishkin and Matas [15].

Algorithm 1: Total Init
Sample B inputs {x 0 1 , x 0 2 , ..., x 0 B } from dataset; for each layer l do Initialize parameters: W l i j ∼ N (0, 1) and b l i = 0; Compute preactivations: {y l 1 , y l 2 , . . ., y l B }; Estimate variance: (σ l ) 2 = 1 n l B n l i=1 B b=1 (y l ib ) 2 ; Normalize weights: W l i j ← W l i j / (σ l ) 2 + ; end Sample init first reinitializes the biases, per-feature, to subtract out each preactivation's estimated sample mean.Rescaling is then done per layer using the estimated sample variance of each unit.This is similar to Algorithm 1 from Krähenbühl et al. [8].
We compare both initialization schemes, along with batch normalization, for two convolutional networks trained on two popular computer vision tasks: object recognition with ALL-CNN-C [17] on the CIFAR10 dataset [18] and image segmentation with UNet [19] on the ISBI2012 dataset [20].
Note that these networks are more complicated than the fully connected ones we have so far discussed.The ALL-CNN-C network contains nine convolutional layers, a global pooling layer, and a single fully connected output layer.UNet contains 23 convolutional layers, four layers of max pooling, and additionally contains skip connections from earlier to higher layers.We empirically confirm that the sample variance decays in both nets when initialized with total init.See the Supplemental Material [21] for figures showing the empirical sample mean-to-standard deviation ratio in these nets.ib − μ l i ) 2 ; Normalize weights: W l i j ← W l i j / (σ l ) 2 + ; end We train networks using stochastic gradient descent with momentum 0.9.We use a minibatch of 32 images for All-CNN-C and 1 image for UNet.We choose the optimal learning rate for each setting.See the Supplemental Material [21] for the learning rates.The optimal learning rate ended up being larger for sample init nets and batch norm nets than for total init nets.The learning curves and their standard deviation over three randomly initialized runs are shown in Fig. 3.The standard deviation is generally too small to see.
In both tasks, initializing the biases to subtract out each preactivation's sample mean results in improved training speed (Fig. 3).Using batch normalization (which both reinitializes and reparameterizes networks) sped up training more than just reinitializing networks.This suggests that part of the training speedup that is frequently associated with batch normalization may be a result of the impact it has on a network's initialization.

VI. DISCUSSION
We have thus far focused on the rectified linear activation, but our formalism applies to other activations as well.In sigmoid networks (initialized instead with Xavier initialization) we observe the sample mean to standard deviation ratio grows exponentially with depth.It is long been argued that the tanh nonlinearity, which is simply a rescaled and centered sigmoid: tanh(y) = 2 sigmoid(y) − 1, should be used instead [2].With this modification, the nonlinearity is antisymmetric around the origin, and we see no growth of the sample means.
We have studied feedforward networks, but our research bears similarity to studies of random recurrent networks [22][23][24].Our feedforward nets can be viewed as a discrete time dynamical system with different couplings between states at each layer.
Our research is mathematically similar to studies of random feedforward neural nets in the physics literature but conceptually distinct because those studies have been motivated by explaining generalization performance.Li and Saad [25] investigated the sensitivity of neural activities in each layer to small perturbations of a randomly initialized net.De Palma et al. [26] showed that ReLU network output becomes less sensitive to perturbations of binary-valued inputs with increasing network depth.Mingard et al. [27] applied an information theoretic argument to show that typical functions implemented by deep random ReLU networks are "low entropy."This observation was used in Ref. [28] to hypothesize that this relative invariance to input perturbations could be a reason for neural network generalization performance.
In the computer science literature, Hayou et al. [7] applied the signal propagation formalism of Poole et al. [5] to show that neural activations from two different inputs become increasingly similar in higher layers.Hanin and Rolnick [29] used a combinatorial argument to show that deep random ReLU networks have "surprisingly few activation patterns." Our formalism provides a simple interpretation of the above phenomena: samples are mapped to increasingly small fluctuations around an increasingly large mean as they propagate through layers of a deep ReLU network.This suggests two peculiar aspects of neural activities in higher layers of these networks.One, half the neurons are "nearly dead"; half of the preactivations y are negative for most samples so the rectified activity [y] + is zero for most samples.Two, the other half are "nearly linear"; half of the preactivations y are positive for most samples so [y] + = y.Centering each preactivation, thus ensures the nonlinearity for every unit is effectively used at initialization.

FIG. 2 .
FIG. 2. Preactivation statistics in random ReLU networks with white noise inputs.This provides a quantitative description of the phenomenon observed in Fig. 1.(a) Predicted sample standard deviation (v), standard deviation of sample mean (m), and total standard deviation (σ ) of preactivations in wide net limit.(b) Numerically simulated ratio of m/v in finite width networks.Solid line indicates average and shaded regions indicate standard deviation of m/v over 30 random configurations of networks.

FIG. 3 .
FIG. 3. Training curves for (a) ALLCNN-C object recognition network and (b) UNet segmentation network.sample init, which implements the sample mean-variance preservation scheme, improved training speed compared to total init, which implemented total meanvariance preservation.Batch normalization yielded faster training networks than either initialization.