Abstract
Before backpropagation training, it is common to randomly initialize a neural network so that mean and variance of activity are uniform across neurons. Classically these statistics were defined over an ensemble of random networks. Alternatively, they can be defined over a random sample of inputs to the network. We show analytically and numerically that these two formulations of the principle of mean-variance preservation are very different in deep networks using rectification nonlinearity (ReLU). We numerically investigate training speed after data-dependent initialization of networks to preserve sample mean and variance.
- Received 30 January 2020
- Accepted 23 June 2020
DOI:https://doi.org/10.1103/PhysRevResearch.2.033135
Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.
Published by the American Physical Society