Biologically-inspired neuronal adaptation improves learning in neural networks

ABSTRACT Since humans still outperform artificial neural networks on many tasks, drawing inspiration from the brain may help to improve current machine learning algorithms. Contrastive Hebbian learning (CHL) and equilibrium propagation (EP) are biologically plausible algorithms that update weights using only local information (without explicitly calculating gradients) and still achieve performance comparable to conventional backpropagation. In this study, we augmented CHL and EP with Adjusted Adaptation, inspired by the adaptation effect observed in neurons, in which a neuron’s response to a given stimulus is adjusted after a short time. We add this adaptation feature to multilayer perceptrons and convolutional neural networks trained on MNIST and CIFAR-10. Surprisingly, adaptation improved the performance of these networks. We discuss the biological inspiration for this idea and investigate why Neuronal Adaptation could be an important brain mechanism to improve the stability and accuracy of learning.


Introduction
Deep neural networks outperform humans in Atari games [1] and the Game of GO [2], but fall short of humans in tasks such as art, music, and translations. Thus, looking for inspiration from the brain may help to improve current deep neural networks. Many researchers are drawing inspiration from the brain to close the gap between biological and machine learning. For example, while the backpropagation algorithm (BP) has long been used to train deep neural networks by backpropagating error signals through the layers of the network [3], the debate over whether biological neurons could support such an operation [4,5] has led to several more biologically plausible algorithms being proposed [6][7][8][9][10][11][12]. This paper will focus on two of these: contrastive Hebbian learning (CHL) [13][14][15] and the closely related equilibrium propagation (EP) [12,[16][17][18]. These algorithms model the network as a dynamical system and learn from temporal differences in activity rather than explicitly calculated errors or gradients.
CHL and EP consist of two learning phases: a free or negative phase, followed by a clamped or positive phase (note that these learning "phases" are not to be confused with the phases of sinusoidal components that make up the larger-scale brain rhythms, but are rather computational stages in CHL and EP models). During the free phase, an input signal is presented and the network is allowed to equilibrate to a steady state. The clamped phase then clamps output neurons to the desired targets, and the network is allowed to reequilibrate. CHL clamps output neurons completely, while EP uses "soft" clamps, and nudges the output activities toward the desired level -which may be more biologically plausible. Weight updates are based on the differences between free-and clamped-phase activities of neurons on either side of the weight. Our previous study [19] showed that free-phase steady-state activity can be well predicted based on the first few steps of neural dynamics. This would mean that a full free-phase equilibration may not actually be required before output clamping, and that learning could occur without two distinct phases -making the EP concept even more biologically plausible.
In this study, we add adjusted adaptation to CHL and EP. This is based on the Neural Adaptation effect that has been observed in biological neurons in several systems [20,21]. During neural adaptation, a neuron's response to a given stimulus will usually decrease over a short period of time from its initial level. This could be interpreted as a smooth change of sensitivity to the stimulus.
The neural adaptation phenomenon is well studied from both biological and computational perspectives. For example, the integrate-and-fire model of Hertäg et al. [22] included adaptation and reproduced realistic spiking behaviors. Jolivet et al. [23] showed that spike-frequency adaptation was a key feature for a computational model to accurately predict spike trains of real neurons. Reutimann et al. [24] studied the cortex of monkeys, suggesting that firing rate adaptation in inhibitory neurons causes climbing activity in the cortex which represents the passage of time and predicts the timing of important events. Stemmler et al. [25] derived an unsupervised learning rule describing how firing rates could be modulated for maximum information compression. Fontaine et al. [26] studied membrane potential variability in barn owls, concluding that it is a genuine feature of neurons as opposed to a result of noise. Carandini et al. [27] used a rectification model [28] to model the effect of adaptation in the cat visual cortex, and Treves [29] has discussed adaptation as a prerequisite for several key computational mechanisms throughout the brain [30][31][32]. Given the vast research on adaptation and its importance in the brain, here we suppose that adaptation may offer advantages for artificial neural networks and test its effect on networks trained to perform benchmark classification tasks.
Here, we model adaptation during the clamped phase, such that each neuron's clamped-phase activity is gradually pushed back toward its free-phase activity. This reduces the difference between the free-and clamped-phase activations; modulating weight updates in a way that makes training smoother. This paper extends our previous work [19,33] by applying the adaptation concept to multilayer perceptrons (MLPs) and convolutional neural networks (CNNs) trained on MNIST [34] and CIFAR-10 [35] by CHL and EP. We make the following contributions: (1) Demonstrate that models with adaptation achieve better performance than models without adaptation on both MNIST and CIFAR-10 tasks. (2) We provide an explanation of why adaptation may work by comparing the training gradients created with and without adaptation and comparing these to the gradients created by backpropagation.

Method
In this section, we discuss some notations, methods, and model specifications for our experiments.

EP and CHL
During the free phase of EP, the network calculates the dynamics of activity without any target signals or gradient signals. In a one-hidden-layer MLP, the equations of dynamics for activity at each layer are described based on previous work [12,16] as: where x is an activity, w represents weights for each layer, i, j, and o are indexes of input, hidden, and output layer neurons, and b is a bias. p is an activation function such as the sigmoid function, and γ is the feedback parameter. h is the Euler method's time-step. Please note that for consistency with our previous work [19], we use letter o for indexing output units. We hope that "o" will not be confused with number 0, which is not present in our equations. During the clamped phase, the output neurons are influenced by target signals. During this phase, the network calculates the dynamics of activity in the output layer as: where y is a target signal, and β is a nudging parameter that pushes output-layer activations back toward their free-phase level. Given these activations at the free and clamped phases, weights will be updated by where x is an activity at the clamped phase, x _ is an activity at the free phase, α is the learning rate. pre and post are previous and post layer neuron indexes, respectively (e.g. for Δw i;j f g , pre and post will be i and j, respectively).
CHL is very similar to EP. Instead of Equations 3 and 4, CHL uses: Note that Equation 5 is used only for the clamped phase at the output layer. The free phase uses Equation 1 at the output layer. Thus, CHL clamps output neurons completely, where EP uses a "soft" clamping effect. The soft clamping weekly nudges the outputs at the free phase to their targets to minimize the difference between current outputs and targets. On the other hand, CHL uses a "hard" clamping, where the output neurons are clamped at the desired target value. Note that when gaps between x and x _ are large, especially, at the top layer, Δw would also be largepotentially leading to abrupt jumps in network weights. Softening the clamping by reducing β does not solve this problem, because the 1=β term in Equation 4 amplifies the differences. We can reduce α, but this slows learning convergence [36,37], and a smaller learning rate could lead the network to find sharp local minima [38].

Predictive learning rule
In this subsection, we discuss our learning rule. Our rule modifies Equation 6 by replacing x _ pre (the free-phase activity of the presynaptic neuron) with x pre (the clampedphase activity of the presynaptic neuron). Our previous work [19] showed that a rule of this form emerges naturally if we assume that each neuron is working to maximize its metabolic energy. The new rule is as follows: In case of CHL, 1 β is removed. We call this update rule the predictive learning rule because our previous study used predicted free-phase steady-state activity (x) in place of x _ (allowing neurons to predict their own free-phase steady state may be more biologically plausible, as it allows learning to occur without requiring two distinct phases). However, for the purpose of investigating adaptation, this study computes free-phase steady-state activities in the conventional way -which is the same as assuming perfect predictions of free-phase activity (a reasonable assumption, as our previous work found correlation between predicted (x) and actual free-phase activity ( x _ ) was R = 1 ± 0.0001 SD [19]). For consistency with this previous study [19], we apply delay to a clamped (teaching) signal to models trained by CHL. For example, Figure 1 shows that the signal is clamped after 12 steps. In our previous paper, we discuss that such delay could be more biologically plausible, as in the visual cortex, a top-down "teaching" signal (similar to clamped phase) arrives tens of microseconds later than the initial bottom-up signal (similar to the free phase).

Adjusted adaptation
Adjusted Adaptation was introduced by our previous study [33], based on neural adaptation observed in the brain. To implement Adjusted Adaptation, the activities at the clamped phase are nudged toward activities at the free phase to reduce the gap between these activities. We speculate that these smaller gaps give smoother weight updates compared to conventional EP. We model adaptation as follows: is a steady state at the free phase. Equation 8 is applied during additional steps after computing the clamped-phase dynamics. The form of Equation 8 was chosen to allow for a simple formula to vary the strength of adaptation. This formulation results in smooth, exponential change in neuronal activity across time steps ( Figure 1, steps: 120-140), which is similar to changes observed experimentally in actual neurons during adaptation [20]. However, the exact relation of our parameter c to the biological time constant(s) of adaptation will be for future work as it may require conducting experimental work to properly relate the two. The time steps for free and clamped phases are 120, and teaching signals are given after 12 steps. The adjusted adaptation steps are 20. After 120 steps at the clamped phase, additional 20 steps for the adaptation are applied using Eq. 8.
We describe this algorithm as pseudocode next to Figure 1. Figure 1 depicts the adapted dynamics of activities during a CHL clamped phase. In this case, the clamped (teaching) signal is given 12 steps after the input signal is presented, as explained earlier.
Adaptation is applied after 120 steps, and nudges the clamped phase activity back toward free-phase activity. The input signal is constant (not changing) throughout free and clamped phase. Details of the training process of our model with adaptation are shown in Figure 1 in Algorithm 1.

Model specifications
For MLP with CHL on the MNIST dataset, the number of time steps for the free phase and clamped phases was set to 120. We tried seven different architectures: 782-6-10 (meaning the number of neurons at input-hiddenoutput layers), 782-50-10, 782-50-10 with lateral connection on the hidden layer, 782-50-10 with lateral connection on both the hidden and top layer, 782-1000-10, 782-1000-10 with lateral connection on the hidden layer, 782-1000-10 with lateral connection on both the hidden and top layer. We conducted these experiments with different architecture to check that we still get robust results on our models even if we change parameters. We used the learning rate of 0.1 on 782-6-10 and 782-50-10 models, and learning rate 0.03 for 782-1000-10 models. The teaching signal delay was 12 steps and h was 0.1 for all the networks. For our model with the adjusted adaptation, extra steps for adjusted adaptation are 20 and c was 0.1 (c = 0.1 gave the best performance in our simulations; however, a wide range values of c between 0.01 and 0.2 also resulted in consistent improvements over the model with no adaptation; we did not test extensively values of c above 0.2 because larger c diminishes impact of clamped (teaching) signal which leads to a very slow learning in the network). All models use the sigmoid activation function. For all experiments, we used the AdaGrad optimizer [39] (to find how to implement CHL with AdaGrad please see our code at https://github.com/ykubo82/bioCHL/blob/add-license-1/ CHL_clamped.py, specifically Line 107).
For the CNNs with EP on the CIFAR-10 dataset, the time steps for the free phase and clamped phases are 130 and 30, respectively, and β is 0.18. We set the learning rates for the network to 0.21, 0.021, and 0.021 for the first, second convolutional layer, and fully connected layer, respectively. Our model consists of 256 and 512 filters whose sizes are 3 × 3 for both 1st and 2nd layers and followed by one dense layer for the output layer. h is 1.0 for this model. For the clamped phase, activities from the free phase at time step 110 are used as the initial activities (this is the same as teaching signal delay). This can be seen as a delay similar to our previous experiments. For our model with the adjusted adaptation, extra steps for implementing adaptation are 10 and c is 0.1. The activation function for these models is the hard sigmoid function [17]. We did not use any optimizer for these CNN models. Our CNN models are based on [16] and [33].
Code for our networks showing all the implementation details is available at https://github.com/ykubo82/AdpNet Table 1 and Figure 2 show the results for MLP with/without the adaptation on the MNIST dataset. We found that adaptation improved performance, for all tested models. Similarly, CNNs with the adjusted adaptation (CNN-ADP) on CIFAR-10 achieved a better test error of 19.51 ± 0.6% as compared to the model without the adjusted adaptation (CNN, a test error of 22.46 ± 0.59%). Figure 3 shows the learning curves for these models. Note that the learning curves for models with adaptation are smoother. We consider this smoothness to be the result of adaptation, which reduces the gap between activities at the free and clamped phase.

Results
In previous work, we already showed that our model even without adaptation achieved error rate similar to neural networks with comparable architecture trained with the BP algorithm on MNIST dataset [19]. Moreover, previously, we directly compared our model with the BP algorithm in a convolutional network trained on the CIFAR-10 dataset. To ensure the generality of that comparison, we repeated the training with BP three times using different learning rates for each simulation. Using BP, we achieved the smallest error rate of 21.23%, which was similar to our algorithm when applied to the same convolutional network (error rate of 20.88%) [19]. Therefore, here we did not run the same comparisons to BP, as it was already published before.

Gradient checks
Why does adaptation work? To answer this question, we calculated the angles for the weights' gradients between CHL models with/without the adaptation and models trained with backpropagation on the MNIST dataset. These gradients were calculated without applying the optimizer. This comparison is inspired by Lillicrap et al. [40]. Specifically, we calculated the angle θ between two weight gradient vectors, a and b:   where each value in vector a represents a single weight update calculated as explained above: Similarly, corresponding values in vector b represent change in weights calculated using BP method, in the same network. While the gradients for BP are not always optimal in general (sometimes BP converges to local minima), BP is currently the gold standard for neural network training, and thus a good comparator. We found that weights' gradients calculated in a network with adaptation were closer to BP gradients, than gradients without adaptation ( Figure 4). Note that while the shape of the curves in Figure 4 is similar (indicating that the learning process progresses at similar rates), however, the curve for adaptation (blue) is always closer to 0°, indicating that adaptation consistently produces individual updates that are closer to BPs. Moreover, to ensure a fair comparison between gradients for the networks with and without the adaptation, we used the same initial weights to calculate the gradients for the networks with and without the adaptation. This might be one of the reasons for the similar shape of the curves. In addition, we also confirmed the robustness of our results in a network with a larger number of neurons on MNIST dataset (Figure 4 right side). In summary, our results suggest that adaptation changes network dynamics in a way that makes weight gradients closer to BP. The meaning of these angles is best understood as the difference in the direction that each network is moving through the multidimensional parameter space during training. An angle of zero would mean, for example, that every weight that is increasing/ decreasing in one network is also increasing/decreasing in the other. The fact that the networks with adaptation have lower angles in Figure 4 shows that adaptation makes weight updates more similar to those that would be experienced under BP.
For sanity check, we also tested that implementing adaptation is not simply equivalent to reducing a learning rate. To test it, we trained MLP (782-6-10) without the adaptation but with learning rates reduced to (1) 0.05, (2) 0.01, and (3) 0.001. We trained these models until the test error started increasing. The smallest test errors for networks with the learning rates (1), (2), and (3) were 16.03 ± 1.38%, 13.95 ± 0.39%, and 13.22 ± 1.25%, respectively. This shows that networks without adaptation and with smaller learning rates still have larger test errors as compared to the network with adaptation (11.93 ± 0.84%; learning rate: 0.1).
In addition, we also checked angles in the network on the CIFAR10 dataset. The mean angle between weight gradients calculated using BP and our model with adaptation was ∆Adp∡∆BP = 8.1° ± 4.29 SD. The model with no adaptation resulted in angles larger on average by 1.06° ± 0.3SD. Thus, analyses of networks on CIFAR10 and MNIST datasets provide consistent results.

Conclusions
Neural adaptation has been observed in all types of neurons in vertebrates, as well as in invertebrates . Mean angle between weight gradients calculated using BP and CHL (left: 6 neurons a hidden layer, right: 12 neurons in a hidden layer). Blue line shows the angle calculated for CHL with adaptation, and orange for CHL without adaptation. "W1" means the gradients for the weights between the input and hidden layers, and "W2" means the gradients for the weights between the hidden and output layers. The angle between the gradients with the adaptation and BP is denoted as ∆Adp∡∆BP . The angle between the gradients without the adaptation and BP is denoted as ∆NoAdp∡∆BP . Note that the blue line is consistently closer to 0, which demonstrates that CHL with adaptation provides weights updates more similar to BP gradients. [20,21]. Neuronal adaptation can be defined as a change in activity over time in response to the same sensory stimulus, like sound, light, or tactile stimulation. Usually, neuron activity adapts most rapidly at the beginning, and plateauing at a steady-state value; similarly as implemented in our model ( Figure 1). Interestingly, it was also proposed that neuronal adaptation could be a brain mechanism for surprise minimization, which may underlie conscious perception [33]. Considering that neural adaptation is a ubiquitous phenomenon across neuronal systems, thus, it may serve an important function in neural information processing. Here, we provide the first quantitative account of how neuronal adaptation can improve learning in deep neural networks.
Such improved learning could be due to the fact that if activity in the clamped phase is much different from activity without clamp, then learning may deteriorate as those two network states could be in different modes of the energy function [10,12,33]. Adaptation may thus reduce this problem by bringing clamped state closer to already learned state (free phase). This could also have a more cognitive interpretation. For that, let us use an analogy: if part of a car is occluded by a tree, then purely on sensory information, we cannot say what is behind that tree. However, based on what we learned about the world so far, we know what shape has a car, and thus we can assume that the rest of the car is likely behind the tree. Similarly, neuronal adaptation may allow a network to use already learned information (manifested by activity in free phase) as a strong context to more appropriately store new information (clamped signal).
We also note that adaptation could provide regularization effect: if there is a large difference between a particular neuron's clamped and free-phase activities, that neuron would experience a stronger push back toward free-phase activity. The result would be a reduction of that neuron's effect on the overall network's direction during learning. Thus, adaptation could be seen as a new activation regularization method, similar to the commonly used dropout method [41], which is also an activation regularization.
It is also interesting to ask why an adaptation would make weight updates more similar to those calculated with BP. The most likely explanation for those results could be provided based on earlier theoretical work, which showed equivalence between CHL and BP [10,42], and equivalence of EP and BP [17,18], under certain assumptions. Despite different assumptions in those theoretical derivations (e.g. different network architectures), the common result is that when difference between activity in clamped and free phase approaches zero, then the gradients become proportional to error derivatives calculated with BP. In other words, if the change in network state caused by clamping the output neurons is infinitesimal, then it is equivalent to the error signal spread by backpropagation, except for a scalar prefactor [42]. In our model, an adaptation reduces exactly that difference between activity in clamped and free phase for each neuron. Therefore, an adaptation could result in gradients which follow more closely BP gradients. This suggests an intriguing idea that biology may use adaptation to implement a better approximation of the BP algorithm, which in many practical applications is a very effective learning algorithm. However, the meaning of the angles being quite large (e.g. >50°) is not entirely clear to us. It is possible that there are many ways to solve a given classification problem, and that EP tends to learn a different solution than BP (but adaptation makes the solution more like that found by BP).
It is also interesting to notice that our model is consistent with oscillatory brain activity. We suggest that a single ~2-20 Hz oscillation in the brain, consisting of ~50-500 ms long burst of neuronal population activity, can perform computation as in our algorithm. For example, the initial steps 1-12 in Figure 1 could correspond to activity evoked by the bottom-up sensory information. The following clamped phase in steps 13-120 may correspond to activity driven by top-down feedback information from higher-order areas. This activity is then modified toward levels expected from the initial bottom-up signals, which corresponds to adaptation in steps 121-140 in Figure 1. All those "sub-stages" may be seen as nested oscillations within a single slower oscillation which encompasses steps 1-140 in Figure 1. Interestingly, this idea implies that sensory information may be processed in discrete units, and each <~20 Hz oscillation may represent computation of a single percept [19]. For instance, this could explain why we cannot process visual information faster than ~20 frames per second, as it can take ~50 ms for the brain to evaluate if a single image is consistent with expectation, and only after that the next image could be processed in the next oscillation. However, more experimental work is needed to investigate how our model could be related to the nested network of rhythms or clocks in the brain.
In future work, we plan to apply models with adaptation to neuronal data analyses [43][44][45][46][47] and to reinforcement learning tasks [48,49]. Regularization effect of adaptation may help to improve training networks with BP. This might also help us to better understand the biological function of adaptation.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This study is supported by NSERC DG, Compute Canada, and CIHR Project grants to AL.