Synaptic 1/f noise injection for overfitting suppression in hardware neural networks

Overfitting is a common and critical challenge for neural networks trained with limited dataset. The conventional solution is software-based regularization algorithms such as Gaussian noise injection. Semiconductor noise, such as 1/f noise, in artificial neuron/synapse devices, which is often regarded as undesirable disturbance to the hardware neural networks (HNNs), could also play a useful role in suppressing overfitting, but that is as yet unexplored. In this work, we proposed the idea of using 1/f noise injection to suppress overfitting in different neural networks, and demonstrated that: (i) 1/f noise could suppress the overfitting in multilayer perceptron (MLP) and long short-term memory (LSTM); (ii) 1/f noise and Gaussian noise performs similarly for the MLP but differently for the LSTM; (iii) the superior performance of 1/f noise on LSTM can be attributed to its intrinsic long range dependence. This work reveals that 1/f noise, which is common in semiconductor devices, can be a useful solution to suppress the overfitting in HNNs, and more importantly, further evidents that the imperfectness of semiconductor devices is a rich mine of solutions to boost the development of brain-inspired hardware technologies in the artificial intelligence era.


Introduction
In the artificial intelligence (AI) era, brain-inspired deep neural networks have demonstrated substantial potential in various neuromorphic tasks such as visual recognition, natural language processing, and autonomous driving [1][2][3][4][5][6][7]. Despite the remarkable progress, those neural networks often encounter the underfitting and overfitting problems, both resulting in unsatisfactory accuracy: underfit leads to high prediction errors for both training and test data, while overfit leads to a very low prediction error on the training data but a very high prediction error on the test data [8][9][10][11], as schematized in figure 1.
Underfitting happens because the neural network is too simple to capture all the features in the training data. The practical solutions can be simply training the network for a longer duration or just use a network with higher complexity. Overfitting happens because the neural network is too complex for a limited training data size, forcing the network to overly memorize the irrelevant detail and noise in the training data. Of course, increasing the size of training data could be the most straightforward solution. However, in real-world situations, the training data size is often limited by time, budget or technical constrains [12], making overfitting practically more difficult to deal with than underfitting [13].
Recently, a group of techniques, collectively referred to as 'regularization', which is the process of shrinking the coefficients in neural networks, have been used to select the networks' complexity by automatically penalizing features that make the network too complex. Using regularization, the learning algorithms of neural networks are modified to reduce its generalization error but not its training error [14]. The most common regularization methods include [15]: (a) Early stopping: stop training automatically when a specific performance measure stops improving; (b) Weight decay: incentivize the network to use smaller weights by adding a penalty to the loss function; (c) Dropout: randomly ignore certain nodes in a layer during training; (d) Model combination: average the outputs of separately trained neural networks; (e) Noise injection: allow some random fluctuations in the data through augmentation.
Among them, noise injection is a very popular method against overfitting [16]. The addition of noise during training has a regularization effect and improves the robustness of the neural network [17]. In practice, noise can be added in between training iterations and onto different parts of the neural networks, such as input signal, weights and activation functions, to make it difficult for the network to find a solution that fits precisely to the original training data, and thereby reduces overfitting. In software-based DNNs, noise injection is normally realized with the addition of a separate zero-mean Gaussian noise layer, such as the 'tf.keras.layers.GaussianNoise' in TensorFlow [18].
The software-based neural networks, which are still based on the traditional von Neumann architecture initially designed for sequential computing [19], are challenged by the proliferation of massive data in terms of computing ability and power consumption, driving people to look for alternative solutions. Again, inspiration comes from the brain: the power budget of the human brain is around 20 W, and its computation capabilities range in the 10 17 FLOPS, equivalent to the best supercomputers [20]: the world's fastest supercomputer in 2021, Fugaku, has a computation capability is 4.42 × 10 17 FLOPS, but with a power of 29 899.23 kW [21].
In recent years, there has been a large push toward a hardware implementation of artificial neural networks, i.e. hardware neural networks (HNNs), aiming to overcome the calculation complexity of softwarebased implementations by using semiconductor technology to directly emulate the behaviour of neurons and synapses [22][23][24][25]. Unlike the conventional von-Neumann architecture that is inherently sequential in nature [19], HNNs profit from massively parallel processing, and various architectures, such as multilayer perceptron (MLP), convolutionary neural network, recurrent neural network (RNN) and long short-term memory (LSTM) have been proposed using semiconductor devices (transistors, memristors, etc) and circuits.
Since HNNs are implemented with real-world devices, the natural-existing imperfectness of devices inevitably affects the performance of HNNs. Previously, such imperfectness was considered as detrimental factors that bring undesirable disturbance to HNN's parameters, causing variation and drift to the performance [22][23][24]. However, as the brain is of high error tolerance and so should be HNN, what is more attracting is that such intrinsic imperfectness of semiconductor devices might be utilized to, instead, improve the performance of HNNs. For example, the stochastic memristive switching behaviour has been used to realize the dropout function of HNN [25]. The intrinsic read noise of memristive devices has been used to prevent HNN from getting trapped into local minima and thus converge to sub-optimal solutions [26,27].
Motivated by the previous explorations, it is natural to link semiconductors noise to overfitting suppression in HNNs. An obvious benefit is that the intrinsic noise in devices waives the necessity to design complex circuitry specialized for Gaussian noise generation using Zener diodes or other devices [28]. Various types of noise exist in semiconductor devices, such as thermal noise [29], random telegraph noise [30], 1/f noise [31], etc [32,33], but a comprehensive study on the overfitting suppression effect of noise, at least for one or two types of noise, is still missing.
Among those noises, 1/f noise is the low frequency noise for which the noise power spectral density is inversely proportional to the frequency [34,35]. It can be observed in a wide range of semiconductor devices, such as transistors [36] memristors [37][38][39][40], diodes [41], and photoelectric devices [42]. 1/f noise is also the 'background noise' of the brain [43]. For example, the channel noise in neurons, which is thought to arise from the random opening and closing of ion channels in the cell membrane, is seen to be 1/f [44]. Similarly, it has been shown that both magnetoencephalography and electroencephalogram recordings of spontaneous neural activity in humans displayed 1/f -like power spectra in the α, μ, and β frequency ranges [45]. 1/f noise is also an optimal communication channel for complex networks as in art or language and may therefore be the channel through which the brain influences complex process and is influenced by them [46]. This inspires people to wonder if the 1/f noise in real semiconductor devices could be used to mimic some natural neural behavior in human brains, and play a role in the HNN.
From the mathematical perspective, 1/f noise is well-known for its 'memory', or long-range dependence, which basically refers to the level of statistical dependence between two points in the time series [47]. More specifically, it relates to the rate of decay of statistical dependence between the two points if the distance between them increases. For example, if a time series has a short memory, it is predictable from only its immediate past. The memory of a time series can be expressed using the autocorrelation. Autocorrelation refers to the correlation of a given signal with itself at various points in time [48]. For a time series with short memory, its autocorrelations decay quickly as the number of intervening observations increases. 1/f noise is an intermediate between white noise (a process without memory) and brown noise (a process with an infinite memory) [49]. The long-term memory of 1/f noise can be quantified using the autocorrelation function (ACF).
In this work, we proposed the idea of noise injection on HNNs by using the intrinsic 1/f noise in semiconductor devices. We demonstrated the overfitting suppression ability of 1/f noise in MLP and LSTM for handwriting data recognition and weather prediction tasks, and attribute the superior performance of 1/f noise on LSTM, which is used to process time series data, to the long range dependence of 1/f noise. This work reveals that 1/f noise in semiconductor devices can be a useful solution to suppress overfitting in HNNs, and inspires that the imperfectness of semiconductor devices is a rich mine of solutions to boost the development of brain-inspired hardware technologies.

Noise measurement and simulation
For the purpose of demonstration, experimental 1/f noise is measured from the drain current in a backgated MoS 2 field effect transistor (FET) (figure 2(a)), in both time domain and frequency domain. The channel length is 5 μm, width is 19.4 μm and the MoS 2 has nine layers. 1/f noise can be easily tuned with V gs (figure 2(b)) or V ds (figure 2(c)). It should be emphasized that, due to the huge number of neurons and synapses in a neural network, in this work, noise simulated with software was used instead of experimental data measured from practical devices, to study the impact of noise injection on overfitting suppression. The 'pinknoise.m' function in Matlab [50], which includes a random stream generator, a series of randomly initiated second-order section (SOS) filters and a gain, was used to generate a time-domain 1/f noise, as schematized in figure 2(d). The Gaussian noise was simulated using the 'randn.m' function in Matlab [51].

Multi-layer perceptron (MLP) simulation
Multi-layer perceptron (MLP) is a popular and practical neural network model consisting of three types of layers-the input layer, output layer and hidden layer [52]. In a MLP the data flows in the forward direction from input to output layer, while the synapses in the MLP are trained with the back propagation learning algorithm. The major use cases of MLP are pattern classification, recognition, prediction and approximation. The MLP is sometimes called a 'memoryless' classifier because if one presents a pattern on its input units, the output units respond with an activation pattern, and those outputs depend only on the inputs at that moment, regardless of the previous input history.
In this work, an MLP with two hidden layers was simulated using Python, with 784, 50, 50 and 10 neurons in the input layer, 1st hidden layer (tanh), 2nd hidden layer (tanh) and output layer (relu), respectively. The MLP was trained and validated using the Modified National Institute of Standards and Technology handwritten digit database [53], in which 60 000 images were used for training and the other 10 000 were used for validation. During training and validation, the batch size is 100 and the learning rate is 0.0005. The loss function is cross entropy loss and the optimizer is Adam.
Overfitting is clearly realized in this MLP: after training starts, both the training loss and validation loss decreases (underfit), until at around 10th epoch when the training loss keeps decreasing but the validation loss reaches its lowest point (optimal) and started to increase. After that, the training loss keeps decreasing while the validation loss keeps increasing, which is a typical feature of overfitting, as shown in figure 3(b). To evaluate the impact of 1/f noise on overfitting suppression, a simulated time-domain 1/f noise, whose amplitude is calculated according to where the SNR refers to a fixed signal-to-noise ratio (SNR) and the signal is the weight value updated after back propagation in each epoch, is added to the weight before validation, as schematized in figure 4(a). For comparison, Gaussian noise with the same SNR is injected in the same way.
The simulated 1/f noise is applied onto the three layers of weights, i.e. the weights between the input layer and the 1st hidden layer (Wih1), between the 1st and 2nd hidden layer (Wh1h2) and between the 2nd hidden layer and the output layer (Wh2o), respectively ( figure 4(b)). Obviously, the location of noise injection makes major differences: noise injection on Wih1 lead to converged training and validation curves but with higher final loss for both. For noise injection on Wh1h2 the overfitting is even worse. For noise injection on Wh2o, the training and validation curve are closer and the final validation loss is ∼50% lower than the initial level, indicating that the overfitting has been suppressed with the injection of 1/f noise. The training and validation loss after 100 epochs are summarized in figure 4(c). As shown in figure 4(d), Gaussian noise with the same SNR, shows similar effect as the 1/f noise, probably due to the fact that since MLP is static and has a memoryless network architecture [54], it does not respond differently to 1/f noise or Gaussian noise, as shown in figure 5(b).

Long-short term memory (LSTM) simulation
Artificial neural networks are expected to mimic the architecture and performance of human thoughts which have persistence. For example, the reader of this paper understands each word based on the understanding of previous words, instead of throwing everything away and start thinking from scratch again. However, traditional neural networks, such as MLP, cannot do this, which is a major shortcoming. RNNs, which has loops inside and allows information to persist, addresses this issue. In practice, it is found that RNN can learn to use the past information well if the gap between the relevant information and the place that it is needed is small. If such gap becomes large, which is entirely possible, RNNs become less capable of learning to connect the information, due to some fundamental reasons, such as vanishing or exploding gradient.
LSTM networks are a special kind of RNN explicitly designed to avoid the long range dependency problem. In standard RNNs, the repeating module has a very simple structure, such as a single tanh layer. For LSTMs, the repeating module has a different structure, consisting of a cell, an input gate, an output gate and a forget gate. The forget gate allows unneeded information to be erased and forgotten. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information associated with the cell, thus solving the long range dependence problem, as shown in figure 6(a). Since one of the key features of 1/f noise is its long range dependence, 1/f noise might play a special role in suppressing the overfitting in an LSTM for time serial data tasks such as weather prediction. In this work, the climate time-series dataset recorded by the Max Planck Institute for Biogeochemistry is used to train the LSTM. The entire dataset consists of 14 features, which were recorded once per 10 minutes at the Weather Station of Max Planck Institute for Biogeochemistry in Jena, Germany. Three features out of the 14 in the dataset, i.e. temperature, pressure and air density, are selected for a quick demonstration, as visualized in figure 6(b). The data of those three features are recorded once per 10 minutes, and is collectively defined as the data of 'one moment'. In other words, each moment contains three data points: temperature, pressure and air density. Ten consecutive moments are used for training and 500 consecutive moments for validation. Specifically, the first ten moments for training and moments from the 20000th to the 20500th for validation, to avoid overlap. Since every feature has values with varying ranges, normalization is carried out to confine feature values to a range of [0, 1] before training the LSTM, by subtracting the minimum and dividing by the difference between the maximum and minimum of each feature. During training and validation, the batch size is 7 and 497 respectively. Therefore, the 8th, 9th, and 10th moments of training will be used as training labels while the 498th, 499th and 500th moments of validation will be used as validation labels. The optimizer is Adam and the learning rate is 0.0001. The loss function is the mean square error loss. During training, a simulated time-domain 1/f noise is applied onto the hidden state (ht) for overfitting suppression. The amplitude of noise is calculated according to (equation (1)) where the SNR refers to a fixed signal-to-noise ratio (SNR) and the signal is the hidden state value updated in each epoch. Gaussian noise is injected in the same way and of the same SNR for comparison.
Without using noise injection, overfitting can be clearly observed in figure 6(c). After training starts, both the training loss and validation loss decreases (underfit), until at ∼180th epoch when the training loss keeps decreasing but the validation loss reaches its lowest point (optimal) and started to sharply increase. After that, the training loss keeps decreasing while the validation loss keeps increasing. After ∼250th epoch, the training and validation loss finally saturate at around zero and 0.5, respectively. 1/f noise and Gaussian noise with SNR = 0 dB is injected onto this LSTM for an initial demonstration (figures 7(b) and (c)). It can be clearly observed that the loss is controlled at the optimal point between the 280th and 470th epoch, while for the Gaussian noise injection, the loss function reaches the lowest point at the 270th epoch, sharply increases to a peak around 0.38 at the 380th epoch, and then gradually decrease afterwards. This supports that for the LSTM, 1/f noise and Gaussian noise could both suppress overfitting, but with different performance. Compared with the Gaussian noise case where although the loss is lower than the 'without noise injection' case, the loss fluctuates and could be as high as ∼0. 38, the 1/f noise could effectively fix the loss at around the optimal point for around 200 epochs. The impact of SNR on loss is further demonstrated in figure 7(d), showing that the optimal SNR for overfitting suppression is −1 dB.
The 1/f noise shows ∼100% stronger overfitting suppression effect compared with the Gaussian noise: 1/f noise lowers the loss by 0.4 (from ∼0.5 to ∼0.1, which is the optimal level in figure 7(b)), while Gaussian noise only lowers the loss by 0.2 (from ∼0.5 to ∼0. 3), as shown in figure 7(c). This is very different from the MLP and showing strong indication that there might be some 'coupling effect' that enhance the overfitting suppression effect of the long-term memory of 1/f noise on the LSTM. At lower SNR below 10 dB, 1/f noise shows stronger capability to suppress overfitting compared with Gaussian noise ( figure 7(d)). For SNR below  −5 dB, the noise will be far larger than the signal, and the network will be practically learning the noise instead of the signal.
To evaluate if the memory ability of LSTM really makes a difference, the 1/f noise is sampled at a fixed interval length to mimic the training time per epoch (default interval = 1). For comparison, a zero-mean Gaussian noise is also simulated and added.
This assumption is further confirmed in figure 8(a) when the noises are sampled at different intervals, i.e. using different sampling frequencies. For the 1/f noise, the loss is dependent in logscale on the sampling frequency, while the loss is almost not dependent on the sampling frequency of the Gaussian noise. If the interval is larger than 50, the loss of 1/f noise and Gaussian noise becomes similar. This is strong evidence, that the autocorrelation of 1/f noise plays an important role in the overfitting suppression of LSTM. Considering that LSTM's sequential structure. We can predict that the autocorrelation of 1/f noise could make it a special solution for the overfitting suppression in LSTM. However, for the Gaussian noise, which is memory-less, does not have such benefit. Figure 8(b) compares the loss during 1000 epochs using the optimal SNR of −1 dB. For the 1/f noise injection, the loss function remains at the lowest point during the 280th and 470th epoch, and starts to gradually increase afterwards. For the Gaussian noise injection, the loss function reaches the lowest point at the 270th epoch, sharply increases to a peak at 380th epoch, and then gradually decrease afterwards. Although the 1/f noise injection does not eliminate totally the overfitting phenomenon, it still shows significant effect in suppressing overfitting and keeping the LSTM at the optimal condition for 190 epochs (from the 280th to the 470th), much better than the Gaussian noise.
The different performance between 1/f noise and Gaussian noise can be explained by using the 'xcorr' function in Matlab to calculate the ACF of 1/f noise and Gaussian noise [55], as shown in figure 9. The 1/f noise and Gaussian noise are sampled in the way as figure 5(a), with various intervals. Obviously, as the lag increases, the ACF of 1/f noise decays gradually, supporting that 1/f has long memory with autocorrelation. However, for Gaussian noise, the ACF is almost independent from the lag, indicating that it does not have memory, or long range dependence. It should also be noted that if 1/f noise is sampled using very large interval, say 50, its ACF seems also independent from the lag, which is a strong indication that the long range dependence has decayed and now the sampled 1/f noise is very similar to the Gaussian noise, which also lead to similar effect in figure 8(a).
The aim of this paper is to give a preliminary demonstration that physics properties of materials, such as 1/f noise, shows some advantage over the software based approach, in the development of HNNs based on semiconductor devices such as transistors and memristors where implementing the software-based noise injection could be difficult: complex peripheral circuitry need to be designed to generate and modulate the Gaussian noise. For example, a conventional additive white Gaussian noise is to use a Zener diode in a reversedbiased circuit to produce Gaussian noise. This will bring additional area and power consumption to the HNN.
On the other hand, the semiconductor devices that form the HNN are naturally great sources of different noises, such as the thermal noise originated from the thermal agitation of the charge carriers, shot noise originated from the discrete nature of electric charge, random telegraph noise from the trapping of carriers, and 1/f noise originated from the carrier number fluctuation, in addition to the Gaussian noise. This is a significant advantage, as noise can be conveniently obtained from the semiconductor devices that form the HNN, without using the Zener diode or other devices/circuits for noise generation. Furthermore, another advantage is that those noises have different varies characteristics, such as the long-term memory/dependence of 1/f noise, which provides even better overfitting results, if they are properly selected and modulated, for some special architectures such as the LSTM.

Conclusions
In this work, we proposed the idea of using 1/f noise injection to suppress overfitting in different neural networks, and demonstrated that: (i) 1/f noise could suppress the overfitting in MLP and LSTM; (ii) 1/f noise and Gaussian noise performs similarly for the MLP but differently for the LSTM; (iii) the superior performance of 1/f noise on LSTM can be attributed to its intrinsic long range dependence. This work reveals that 1/f noise, which is common in semiconductor devices, can be a useful solution to suppress the overfitting in HNNs. This work could also provide strong support that the imperfectness of semiconductor devices can be exploited to provide solutions for the development of hardware AI technologies, mimicking the human brains which are not always precise but have been used efficiently and accurately for millions of years.