1 Introduction

The idea of noisy optimization originates from the physical process of annealing [14], where noise helps to stabilize the state of the system through changing it by small random perturbations. Injecting noise in various ways is exploited both in convex problems [3] and non-convex ones, specifically, in the training of neural networks [1]. Research empirically shows that noise injection into the neural network objective optimization benefits the training process and improves generalization [7, 19]. In our work we consider the particular case of noise injection into the model parameters, i.e., the neural network weights.

Optimization of the objective function for various tasks can also be performed in a decentralized manner, i.e., multiple learners train on distributed data sources and synchronize according to a chosen schedule [11, 13, 17]. Decentralized training has a strong motivation coming from a growing amount of devices, e.g., mobile phones, with possibly privacy sensitive data sources and the ability to perform local computations. However, models obtained via synchronizing results of local training often do not reach the performance of serial baseline trained on a centralized dataset.

In order to tune the quality of the distributed training setup various synchronizing protocols were investigated, which address communication timing and aggregating operators [11,12,13]. We consider using an existing approach for improving serial training, namely noise injection, for the case of decentralized setup in order to reach higher accuracies of the local models.

We give an overview of the related research in Sect. 2. For prior theoretical investigation we prove in Sect. 3 that in the special case of linear neural networks zero-mean noise does not have an adversarial effect on the results of training, because it cancels out in expectation. This is supported by experiments in Sect. 4.1. Since non-linear neural networks have more practical usage, in Sect. 4.2 we empirically study noise injection to non-linear neural networks trained in the decentralized manner and show that it improves performance. The simplest case of noise injection to the neural network weights is initialization noise. Experiments of McMahan et al. [17] suggest that averaging of independently trained neural networks on different local datasets results in a model that performs worse than any model of the local learners. However, when averaging periodically, our experiments show that noisy initialization is indeed beneficial. Further experiments in Sect. 4.2 are concerned with noise injection during each synchronization step and show an improved quality of the trained models compared to the non-noisy setup for two considered classification tasks. Section 5 summarizes the research results and suggests possible future work.

2 Related Work

Various ways of noise injection into the process of neural networks training were thoroughly investigated in multiple research works. For example, Bishop [4] has shown that additive noise on the inputs is equivalent to a regularization term in a loss function if the noise amplitude is sufficiently small. An [1] explored noise injection to the inputs, outputs, and weights of neural networks. He observed that additive noise injection to the weights on each update step, either in on-line or in batch gradient descent optimization, leads to higher generalization of the learned solution. It was shown that noise added to the inputs affects only the smoothness of the resulting function, while noise injected to the weights also punishes large values and activations. Injected output noise changes the loss function only by a constant value and thus does not affect the quality of the learned function. Later Wang and Principe [22] demonstrated that noise added to the desired signal affects the variance of weights and leads to faster convergence. A different approach to noise injection to the training process is disturbing the gradient updates. It is investigated, for example, in works of An [1] or Neelakantan et al. [19]. Empirical results show that adding decaying noise to gradients helps achieving a global minimum, but does not affect the generalization quality of the resulting network. Yet reaching a global minimum might be even harmful for generalization, since such solutions tend to overfit on the training data.

Decentralized training of neural networks on distributed data sources with different ways of synchronization is investigated in works of McMahan et al. [17], Jiang et al. [10], Smith et al. [20] and some others. The issue closest to our research is the initialization aspect considered by McMahan et al. [17]. The process of the neural network objective optimization is sensitive to the initial state of the model as it was presented by Kolen and Pollack [15]. According to McMahan et al. [17] the averaging is harmful if initialization of the locally trained models is different, which corresponds to injection of random noise to initially equal states. Nevertheless the effect of initialization noise together with periodic averaging was not investigated.

3 Noisy Averaging

To start our theoretical investigation we consider the simplest case of models, that is linear models, and analyze the effect of noise injection into their parameters for a common training setup. Here we present the summarization of used basic concepts and a proof that noise injection in the considered special case for linear models has no detrimental effect on the final model.

3.1 Basic Concepts

The task of training a neural network is an optimization problem which is commonly solved by using stochastic gradient descent (SGD) or SGD-based algorithms. SGD is an on-line optimization algorithm in which a local minimum of the objective function is determined by moving in the negative direction of its gradient [6]. SGD is derived from the gradient descent algorithm and estimates the gradient by considering only one training example. Let X denote the input space, Y the output space, \(\mathcal {F}\) the model parameters space and \(f\in \mathcal {F}\) the model parameters. The SGD update rule for minimizing an objective function \(\ell : \mathcal {F} \times X \times Y \rightarrow \mathbb {R}_+\) reads

$$\begin{aligned} \varphi _{\eta }^{SGD}(f) = f- \eta \nabla \ell (f, x, y), \end{aligned}$$

where the learning rate \(\eta \) controls the step size.

In our work we consider mini-batch SGD which approximates the gradient by taking batch size B many training examples into account:

$$\begin{aligned} \varphi _{B,\eta }^{mSGD}(f) = f- \eta \sum _{j=1}^{B}\nabla \ell ^j(f), \end{aligned}$$
(1)

where \(\ell ^j(f) = \ell (f, x_j, y_j).\)

An example of a loss function is squared loss .

Choosing this loss function and computing its gradient, we can write the learning algorithm update rule in the following form:

$$\begin{aligned} \varphi _{B,\eta }^{mSGD}(f) = f- \eta \sum _{j=1}^{B}\bigg (\Big (\langle f, x_j \rangle -y_j \Big )x_j \bigg ). \end{aligned}$$

3.2 Periodic Averaging Protocol with Noise Injection

As a protocol for the theoretical investigation of the effect of noise injection in the decentralized setup we consider the periodic averaging protocol with zero-mean noise injection (Algorithm 1). Let \(\varPsi \) denote a probability distribution with mean zero and \(\epsilon _t\ge 0\) the time-dependent noise level factor.

figure a

In the following we analyze the influence of noise injection on the behavior of the learning algorithm for the special case of linear models. For this case we can prove that adding zero-mean noise to the parameters optimized by mini-batch SGD with squared loss does not change the model parameters in expectation.

Lemma 1

For a linear model let \(f_t\) denote the model parameters attained by using mini-batch SGD with squared loss and scaled additive zero-mean weight noise injection, i.e.

$$\begin{aligned} f_{t+1} = \varphi _{B,\eta }^{mSGD}(f_{t} + \epsilon _t\psi ). \end{aligned}$$

Let \(g_t\) denote the model parameters attained by using common mini-batch SGD, i.e.

$$\begin{aligned} g_{t+1} = \varphi _{B,\eta }^{mSGD}(g_{t}). \end{aligned}$$

If the learning algorithms are identically initialized, it holds

$$\begin{aligned} \mathbb {E}\big [f_t\big ] = g_t\quad \quad {\textit{for all }} t=1,\ldots ,T. \end{aligned}$$
(2)

Proof

We use induction over t. The case \(t=1\) follows immediately since by initialization the model parameters trained with and without noise are the same. Applying the definitions above, the expected model parameters for mini-batch SGD with noise injection read

$$\begin{aligned} \mathbb {E}\Big [f_t\Big ]&= \mathbb {E}\Big [\varphi _{B,\eta }^{mSGD}\left( f_{t-1} + \epsilon _t\psi \right) \Big ] \\&= \mathbb {E}\bigg [f_{t-1} + \epsilon _t\psi - \eta \sum _{j=1}^{B}\bigg (\Big (\langle f_{t-1}+ \epsilon _t\psi , x_{(t-1,j)} \rangle -y_{(t-1,j)} \Big )x_{(t-1,j)} \bigg ) \bigg ] \\&= \epsilon _t\underbrace{\mathbb {E}\big [\psi \big ]}_{=0} + \mathbb {E}\bigg [f_{t-1} - \eta \sum _{j=1}^{B}\bigg (\Big (\langle f_{t-1}, x_{(t-1,j)} \rangle -y_{(t-1,j)} \Big )x_{(t-1,j)} \bigg ) \bigg ] \\&\qquad - \underbrace{\mathbb {E}\bigg [ \eta \sum _{j=1}^{B}\Big (\epsilon _t\langle \psi , x_{(t-1,j)} \rangle x_{(t-1,j)} \Big ) \bigg ]}_{=0}, \end{aligned}$$

where we used that \(\psi \) is centered. By employing the induction assumption we get \(\mathbb {E}\big [f_{t-1}\big ] = g_{t-1}\) and conclude

$$\begin{aligned} \mathbb {E}\Big [f_t\Big ]&= g_{t-1} - \eta \sum _{j=1}^{B}\bigg (\Big (\langle g_{t-1}, x_{(t-1,j)} \rangle -y_{(t-1,j)} \Big )x_{(t-1,j)} \bigg ) \\&= \varphi _{B,\eta }^{mSGD}(g_{t-1}) \\&= g_{t}. \end{aligned}$$

   \(\square \)

The equivalence of Algorithm 1 to the non-noisy periodic averaging protocol in expectation directly follows from the linearity of the expected value.

Corollary 1

For linear models trained with mini-batch SGD and squared loss given identical initialization, the expected model obtained by the periodic averaging protocol with zero-mean noise injection is equivalent to the model obtained by the non-noisy periodic averaging protocol.

For the specific considered case of linear models we observe that noise injection has no influence on the model parameters in expectation both for serial and distributed learning. This theoretical result is supported by empirical evidence in the following Section. In contrast, for non-linear models the research of An [1], Edwards and Murray [7] has shown that noise injection in serial training helps improving generalization. We conjecture that in the distributed training setup injecting noise into non-linear models might also improve generalization properties of the obtained solution compared to the non-noisy training. We leave a theoretical investigation of the effect of noise injection into non-linear models in distributed training setup for future work. To substantiate our conjecture, in the following Section we perform an empirical analysis of zero-mean noise injection in a distributed setup for non-linear neural networks.

4 Empirical Evaluation

To investigate the effect of noise injection into the neural networks trained in a decentralized manner we performed a set of experiments described further.

The decentralized setup of these experiments is periodic synchronization via averaging (cf. Algorithm 1). Apart from the distributed synchronized models two baselines are trained: a local model without any synchronization with other local learners (no-sync) and a model with full data centralization (serial). These baselines are necessary to assess the performance of the synchronizing local learners, since synchronization aims to reach the performance of the serial model and outperform the no-sync baseline. Noise injection into the serial baseline is also a subject of interest, allowing to compare the possible gains in generalization ability in centralized and distributed case. For evaluation of the synchronizing learners the last averaged model obtained during the training process is used. For evaluation of the no-sync baseline we pick one random model among locally trained ones.

Fig. 1.
figure 1

Effect on test accuracy of noise injection to local learners and to the serial baseline.

Since we explore the effect of noise injection, we are interested in the behavior of models trained throughout several experiments that differ only by the used noise. Thus all the setups were run 10 times without using fixed random seed. This produces an indication of how distribution of possible outcomes of the training process looks like. The results are presented in the form of box plots. Here the box shows the observed values from the first to the third quartile, with a line at the median. The whiskers show the range of the results and points are representing outliers.

4.1 Linear Neural Networks

First set of experiments is performed to empirically evaluate the theoretical result obtained in Sect. 3. We employed a linear neural network with three layers having 2, 64 and 1 neuron correspondingly for approximating the target column of SUSY dataset [2]. For linear model experiments the dataset was normalized thus having \(-1\) and 1 as targets and accuracy was calculated with 0 threshold. The optimal training parameters determined were mini-batch of \(B = 10\) examples, learning rate of \(\eta = 1\mathrm{e}{-5}\) for the serial and no-sync baselines and \(\eta = 2.5\mathrm{e}{-5}\) for the local learners. We employed squared loss as explained in Sect. 3. During the training each local learner was presented 20000 examples, while serial baseline had \(m \cdot 20000\) examples with \(m = 10\). The noise used for this experiment is additive uniform noise in the range \([-0.5, 0.5]\) with decay factor equal to the synchronization round number. Absence of decay factor was leading to fast overflow thus making experiments not runnable.

Figure 1 shows the test accuracy evaluated for 10 runs for the baselines and synchronizing learners on the independent test set of 1000000 examples. We can observe that larger noise injected into the weights of the models leads to larger variance of obtained accuracy at the same moment leaving the median value throughout different setups the same. This empirically supports the effect of noise cancellation in expectation for linear models in this training setup.

4.2 Non-linear Neural Networks

In the following we evaluate noise injection to the decentralized training of non-linear neural networks on the basis of two classification tasks. For our experiments we choose to add uniform noise in range \([-0.5, 0.5]\) and Gaussian noise in the same range.

Binary Classification. For preliminary evaluation of the approach for non-linear case we have considered the classification task on the SUSY dataset. In contrast to the linear case, the employed model architecture is a three-layered dense network with sigmoid activations. The first layer has 32 neurons, the second 64 and the output layer has 2 neurons with softmax activation. We have determined the optimal parameters of the non-noisy learning algorithm on a small fraction of the dataset. That is training mini-batch of \(B = 10\) examples, learning rate of \(\eta =0.1\) for the serial and no-sync baselines and \(\eta =0.25\) for the local learners.

Initialization Noise. The simplest way of noise injection to the distributed neural networks is one step noise injection right after initialization. In McMahan et al. [17] it was shown that such noise together with one-time synchronization after local training results in a worse model than each local one in terms of training loss. We want to investigate whether periodic synchronization is capable of being more robust to initial noise injection.

With regard to Algorithm 1, initialization noise corresponds to choosing \(\epsilon _1> 0\) and \(\epsilon _t= 0\) for all \(t > 1\). It means that we add randomly sampled noise to each initial weight before starting the training process.

Fig. 2.
figure 2

Effect on cumulative training loss of uniformly and normally distributed noise added to the equal initializations of local learners and to the serial baseline.

Figure 2 shows that both Gaussian and uniform initialization noise improves the serial and periodically synchronizing models in terms of cumulative training loss when using it up to some small extent (\(\epsilon _t< 1\)). On the contrary, higher levels of noise (e.g. \(\epsilon _t= 5\)) make training harder in each setup. Interestingly, a noise level of \(\epsilon _t= 2\) for both Gaussian and uniform noise leads to a higher cumulative training loss in the serial model while the distributed setup benefits from it.

Even though the Gaussian distribution is a very popular choice for initializing neural networks (Glorot and Bengio [9]), adding large levels noise drawn from the normal distribution deteriorates the training process worse than uniformly distributed noise. One possible explanation is that the initial weights are already distributed normally according to the best practices and additional Gaussian noise intervenes with it in a destructive way.

Initialization noise experiments reveal that one-time noise injection to the initially equal model weights helps the training process since the cumulative training loss is decreasing. This motivates follow-up experiments with continuous noise injection.

Continuous Noise. Extending the initialization noise setup we now perform additional noise injection steps: Zero-mean noise gets injected to the local models’ weights after every synchronization step. Formally, in Algorithm 1 this corresponds to setting \(\epsilon _1> 0\), and \(\epsilon _{t+1} > 0\) for all \(t \mod b = 0\) and \(\epsilon _t= 0\) otherwise. Following the work of Murray and Edwards [18] the noise is decaying and the decaying factor is equal to the index number of the synchronization step, i.e. noise level is given by \(\nicefrac {\epsilon _t}{t}\).

We want to explore whether continuous noise injection improves the generalization ability of the resulting models in the distributed setup. Therefore we calculated the evaluation accuracy on an independent test set of 1000000 examples for each of the trained models. During training, each of the local learners i is presented 1000 examples from the training dataset, while the serial model sees \(m\cdot 1000\) examples. Various setups together with evaluated validation accuracies are depicted as box plots in Fig. 3.

Fig. 3.
figure 3

Effect on test accuracy of injecting uniformly or normally distributed noise every b time steps throughout the training process of local learners and the serial baseline.

The evaluation shows that noise injection can substantially improve the generalization ability of the models. When comparing results of the setup with 10 and 20 learners, we see that a larger amount of learners leads to a larger spread of the results of the training process that can mean either a better generalization or to the contrary convergence to a worse model.

One can also observe that with growth of the uniform noise level the resulting test accuracy becomes more unstable, i.e., for having a possibly higher median we get a larger range of values. In the experiments, Gaussian noise is more stable than uniform noise while on the other hand it has more outliers below the median. The spread of the serial baseline with noise injection is very small compared to the distributed models. It might be interesting to investigate the reasons why noise has a more pronounced effect in the distributed setup than in the serial one.

Fig. 4.
figure 4

Effect on test accuracy of injecting uniformly and normally distributed noise every b time steps throughout the training process of local learners and the serial baseline.

Image Classification. To further investigate the effect of noise injection into non-linear models in a distributed setup we have chosen the classification task on the MNIST dataset [16]. The model architecture is more complicated than in the previous experiment. It has two dense layers with 512 neurons each and a dropout layer after each of them. The output layer performs a softmax activation to predict one of the ten classes. The activation of the dense layers is ReLU. We have determined the optimal parameters of the non-noisy learning algorithm on a small fraction of the dataset. That is, equivalently to the first experiment, training mini-batch of \(B = 10\) examples, learning rate of \(\eta =0.1\) for the serial and no-sync baselines and \(\eta =0.25\) for the local learners. During training each learner is presented 500 examples from the training set and evaluation is performed on the 10000 images of the test set.

We observe that in this experiment, compared to the previous, noise injection requires different level values in order to obtain the improved quality of the models (Fig. 4). More precisely, for noise levels \(\epsilon _t\ge 1\) we get prohibitively low test accuracies for both uniform and Gaussian noise. One possible explanation is that the ReLU activation is much more sensitive to noise disturbance of the weights compared to the bounded sigmoid function which we have used in the first experiment. Moreover the dropout layers might be contributing to this effect since dropout is also supposed to prevent a model from overfitting [21]. More refined research on this effect is left for future work. Concentrating on lower levels of noise (\(\epsilon _t< 0.5\)), we see that we can again improve the generalization ability of the trained models. In the distributed setup this effect is more pronounced than in the serial one similarly to the first set of experiments.

5 Conclusion

The research presented in this paper investigates noise injection into neural networks, trained in a decentralized manner on distributed data sources. We have proven that for linear models in a common training setup zero-mean noise injection retains the results of the non-noisy setup. We have performed experiments to empirically underline this theoretical statement. Further experiments show that with non-linear models in a distributed setup noise injection improves the quality of the models. The evaluation shows that indeed carefully chosen levels of noise have a positive effect on the generalizing abilities of the synchronized models. It might be explained by the fact that noise enforces wider exploration of the space of solutions [5, 8] which is an interesting subject for further investigation. Also, experiments show that the impact of noise in the distributed training is even greater than in the serial case.

Future research could investigate the theoretical background of noise injection to non-linear models in a decentralized training setup as well as the effect of various network architectures and training parameters. A promising framework for studying noise injection effects is regularization theory, since injected noise can be described as a regularization term that is added to the loss function [1, 4].