Spike-based computation using classical recurrent neural networks

Spiking neural networks (SNNs) are a type of artificial neural networks in which communication between neurons is only made of events, also called spikes. This property allows neural networks to make asynchronous and sparse computations and therefore drastically decrease energy consumption when run on specialized hardware. However, training such networks is known to be difficult, mainly due to the non-differentiability of the spike activation, which prevents the use of classical backpropagation. This is because state-of-the-art SNNs are usually derived from biologically-inspired neuron models, to which are applied machine learning methods for training. Nowadays, research about SNNs focuses on the design of training algorithms whose goal is to obtain networks that compete with their non-spiking version on specific tasks. In this paper, we attempt the symmetrical approach: we modify the dynamics of a well-known, easily trainable type of recurrent neural network (RNN) to make it event-based. This new RNN cell, called the spiking recurrent cell, therefore communicates using events, i.e. spikes, while being completely differentiable. Vanilla backpropagation can thus be used to train any network made of such RNN cell. We show that this new network can achieve performance comparable to other types of spiking networks in the MNIST benchmark and its variants, the Fashion-MNIST and the Neuromorphic-MNIST. Moreover, we show that this new cell makes the training of deep spiking networks achievable.


Introduction
In the last decade, artificial neural networks (ANNs) have become increasingly powerful, overtaking human performance in many tasks.However, the functioning of ANNs diverges strongly from the one of biological brains.Notably, ANNs require a huge amount of energy for training and inferring, whereas biological brains consumes much less power.This energy greediness prevents ANNs to be used in some environments, for instance in embedded systems.One of the considered solutions to this problem is to replace the usual artificial neurons by spiking neurons, mimicking the function of biological brains.Spiking Neural Networks (SNNs) are considered as the third generation of neural networks [Maass, 1997].Such networks, when run on neuromorphic hardware (like Loihi [Davies et al., 2018] for instance), can show very low power consumption.Another advantage of the SNNs is their event-driven computation.Unlike usual ANNs that propagate information in each layer and each neuron at each forward pass, SNNs only propagate information when a spike occurs, leading to more event-driven and sparse computations.Nonetheless, the development of SNNs face a challenging problem: the activation function that is usually used to generate spikes is not differentiable, therefore preventing any training using usual backpropagation [Rumelhart et al., 1986], which is a the core of ANNs success.Several solutions are being considered nowadays, as discussed in section 2. The classical approach consists in using a simple model for the spiking neurons to which are added learnable weights.Then, methods inspired from classical machine learning are used to train, either by directly training the SNN, or by first training an ANN and then converting it into a SNN.
In this paper, we approach the problem from the other side: from the well-known Gated Recurrent Cell (GRU) [Cho et al., 2014], we derive a new event-based recurrent cell, called the Spiking Recurrent Cell (SRC).SRC neurons communicate via events, generated with differentiable equations.The SRC and its equations are described in section 3.Such event-based cell permits to leverage the potential of classical recurrent neural networks (RNN) training approaches to create networks that compute using spikes.
The performance of SRC-based RNNs has been tested on neuromorphic versions of classical benchmarks, such as the MNIST benchmark and some variants, whose results are discussed in section 4. SNNs built with SRCs achieve comparable results to other types of SNNs on these benchmarks.

Related Works
This section aims at introducing RNNs and SNNs.Different approaches to train SNNs are also described.

Recurrent Neural Networks
RNNs are a type of neural networks that carry fading memory by propagating a vector, called the hidden state, through the time.More precisely, a RNN is usually composed of recurrent layers, also called recurrent cells, and classical fully-connected layers.Each recurrent cell has its own hidden state.At each time step, a new hidden state is computed from the received input and the hidden state.This allows RNNs to process sequences.Mathematically, this gives: where h[t] and x[t] are respectively the hidden state and the input at time t, ϕ is the recurrent cell and Θ its parameters.
Training RNNs has always been difficult, especially for long sequences, due to vanishing and exploding gradients [Pascanu et al., 2013].Indeed, RNNs are trained using backpropagation through time (BPTT) [Werbos, 1990].This algorithm consists in first unfolding the RNN in time, i.e. turning it into a very deep feedforward network whose number of hidden layers is equal to the sequence length and whose weights are shared among layers.Then usual backpropagation is applied to this network.However, due to the huge number of layers, gradient problems are much more prone to appear than in usual feedforward networks.There exist several solutions to solve or at least attenuate these problems.For instance, exploding gradients can be easily solved using gradient clipping [Pascanu et al., 2013].But the most notable improvement in RNNs was the introduction of the gating mechanism: gates, i.e. vectors of reals between 0 and 1, are used to control the flow of information, i.e. what is added to the hidden state, what is forgotten, etc.This has led to the two most known recurrent cells: the Long-Short Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] and the Gated Recurrent Unit (GRU) [Cho et al., 2014].LSTM uses 3 gates, while GRU is more lightweight and uses 2 gates.The new recurrent cell introduced in this paper (section 3) is a derivation of GRU and can be expressed as a usual recurrent neural network.

Spiking Neural Networks
Biological neurons communicate using spikes, i.e. short pulses in neuron membrane potential, generated by a non-linear phenomena.These membrane potential variations are created from the flow of ions that go in and out of the cell.There exist a lot of different models to model neuron excitable membranes, the most notable being the Hodgkin-Huxley model [Hodgkin and Huxley, 1952], and similar models called conductance-based models.Such models represent the neuron membrane as a capacitance in parallel with several voltage sources and variable conductances that respectively model the electrochemical gradients that apply on the different ions and the ions gates.Despite being very physiological, this model contains too many equations and parameters to be used in machine learning.
That is why much more simple, phenomenological models are usually used to model spiking neurons in a SNN.
A classical model of this type is the Leaky Integrate-and-Fire (LIF) model.It is composed of a leaky integrator, to integrate the input current into membrane potential variation, associated to a reset rule that is triggered once a threshold potential is reached.Once the threshold potential is reached, a spike is emitted and the potential is reset to its resting value.Unlike conductance-based models, the LIF model generates binary spikes, i.e. spikes that last one timestep and whose value is always 1. Mathematically, this gives: where V [t], x[t] and s[t] are the membrane potential, the input and the output at time t, respectively, α V is the leakage factor, V thresh the threshold and V rest the resting potential.The LIF model is far less physiological then conductance-based models, but it is much more lightweight and retains the core of spike-based computation.
LIF neurons can be organized in layers to form a complete network.The question is now how to train such a network ?Due to the non-differentiable reset rule, usual backpropagation can not be used (or at least can not be used directly).To achieve reasonable training performance, many approaches to train SNNs have been proposed [Yamazaki et al., 2022, Tavanaei et al., 2018], which can be split into three categories.First, SNNs can be trained using unsupervised learning rules, which are local to the synapses [Masquelier and Thorpe, 2007, Neftci et al., 2014, Diehl and Cook, 2015, Lee et al., 2019].These learning rules are often derived from the Spike-timing-dependent plasticity process [Markram et al., 1997], which strengthens or weakens synaptic connections depending on the coincidence of pre and post-synaptic spikes.This non-optimization-based training method is usually slow, often unreliable, and lead to unsubstantial performance.The second category is an indirect training.It consists in first training a usual ANN (with some constraints) and then converting it into a SNN [Cao et al., 2015, Diehl et al., 2015, Esser et al., 2016].Indeed, ANNs can be seen as special spiking networks that uses a rate-based coding scheme.These methods allow to use all the algorithms developed for training ANNs, and thus can reach high performance.However, they do not unlock the full potential of spiking networks, as rate-coding is not the only way of transmitting information through spikes.Also, rate-based coding usually results in a higher number of generated spikes, weakening the energy-efficiency of SNNs.The third and last approach is to rely on gradient-based optimization to directly train the SNN [Bohté et al., 2000, Sporea and Grüning, 2013, Hunsberger and Eliasmith, 2015, Zenke and Ganguli, 2018, Shrestha and Orchard, 2018, Lee et al., 2016, Neftci et al., 2019].These methods usually smooth the entire networks or use a surrogate smoothed gradient for the non-differentiable activation to allow backpropagation.SNNs trained by gradient-based algorithms have achieved good performance, even competing with ANNs on some benchmarks.Notably, Huh and Sejnowski [2018] used a smooth spike-generating process which replaces the non-differentiable activation of the LIF neurons.This approach is closely related to ours, as they both use soft non-linear activations to generate spikes.

Spiking Recurrent Cell
The new spiking neuron introduced in this paper is derived from the well-known recurrent neural network GRU.This section describes its derivation and the different parts of the neuron, namely the spike-generation and the inputs-integration parts.

Spike-generation
As the starting point of the derivation of the SRC equations, another recurrent cell will be used, itself derived from GRU: the Bistable Recurrent Cell (BRC) created by Vecoven et al. [2021].Its main property is its never-fading memory created by the bistability property of its neurons.
Here are the equations of GRU: And here are the ones of BRC: Both use two gates (z and r) to control the flow of information.There are two major differences between GRU and BRC, highlighted in red.First, the memory in BRC is cellular, meaning that each neuron of the cell has its own internal memory that is not shared with the others, while in GRU all internal states can be accessed by each neuron.The second difference is the range of possible values of r: in GRU, it is included between 0 and 1 while in BRC, it is included between 0 and 2. This difference allows the BRC neuron to switch from monostability (a ≤ 1) to bistability (a > 1).
These two properties of BRC, i.e. the cellular memory and the bistability, can be used to generate spikes.The cellular memory can represent the membrane potential of the spiking neurons, while the bistability is created by a local positive feedback, which is the first step of a spike.Indeed, a spike can be described in two steps: a fast local positive feedback that brings the potential to a high value followed by a slower global negative feedback that brings back the potential to its resting value.Therefore, integrating such a negative feedback to BRC equations will allow the cell to generate spikes.This can be done by adding a second hidden state h s which lags behind h (Equation 1c) and a new term in the update equation of h (highlighted in red in Equation 1a).As no information can be transmitted between neurons except when a spike occurs, the fast hidden state h is passed through a ReLU function to isolate the spikes from the small, subthreshold variations of h.This creates the output spikes train s out (Equation 1d).The input of SRC, i.e. the integration of the input pulses, will be discussed afterwards, therefore we will simply use x to denote the input used by the spike generation.
This leads to the equations that generate spikes: Two new gates (r s and z s ) have to be introduced.To enforce that no computation could be achieved through alterations in the shape of a spike, the 4 gates does not depend anymore on learnable weights.Three of them are fixed to constant values: r = 2, r s = −7 and z = 0.The fourth one, z s , controls the speed at which h s catches up with h: the lower the faster.To create spikes with short depolarization periods, z s should be low at depolarization potentials, and larger at subthreshold potentials, mimicking the voltage-dependency of ion channel time constants in biological neurons.This is modeled using Equation 1b, where z hyp s is the value at hyperpolarization potentials (low h) and z dep s the value at depolarization potentials (high h).In practice, z hyp s = 0.9 and z dep s = 0.
Finally, the bias b h controls the propensity of neurons to fire spikes: the higher, the easier.However if it reaches too high values, the neurons may saturate.As this is a behavior that we would rather avoid, the bias should be constrained to always be smaller than some value.In the experiments, we have fixed this higher bound to −4.
Figure 1 shows the behavior of one SRC neuron given different inputs x and biases b h .It can be observed that for a high bias (Figure 1b), the neuron is able to spike even with a null input, while for a lower one (Figure 1a), the neuron remains silent.
SNNs are often put forward for their very small energy consumption, due to the sparse activity of spiking neurons.It is thus important to be able to measure the activity of said neurons.In the context of SRC neurons, the spikes do not last exactly one timestep.It is therefore better to compute the number of timesteps during which spikes are emitted rather than the number of spikes.This brings us to define the relative number of spiking timesteps: where H denotes the Heaviside step function.

Input-integration
The last point to be addressed before being able to construct networks of SRCs is how to integrate the input spikes.We have decided to use leaky integrators with learnable weights w i : where α is the leakage factor.
To prevent the SRC from saturating due to large inputs, we also add a rescaled hyperbolic tangent to i The equations of a whole SRC layer therefore writes, starting from the input pulses s in up to the output pulses s out : To sum up, Equation 3a first integrates the input pulses using a leaky integrator.The result then passes through a rescaled hyperbolic tangent in Equation 3b.z s is computed, based on h, in Equation 3c.This forms the input used by the spike generation part (Equation 3d and Equation 3e) to update h and h s .Finally, Equation 3f isolates the spikes from the small variations of h and generates the output pulses.
The rescaling factor ρ is set to 3, forcing x to be between −3 and 3. Finally, like the other recurrent cells, SRC can be organized in networks with several layers.

Experiments
This section describes the different experiments that were made to assess SRC performance.

Benchmarks
The SRC has been tested on the well-known MNIST dataset [Deng, 2012], as well as two variants.
The Fashion-MNIST dataset [Xiao et al., 2017] contains images of fashion products instead of handwritten digits.It is known to be more difficult than the original MNIST.The second variant is the Neuromorphic MNIST (N-MNIST) [Orchard et al., 2015] which, as its name suggests, is a neuromorphic version of MNIST where the handwritten digits have been recorded by an event-based camera.
The MNIST and Fashion-MNIST datasets are not made to be used with spike-based networks, therefore their images must first be encoded into spike trains.To do so, a rate-based coding and a latency-based coding were used in the experiments.The first one creates one spike train per pixel, where the number of spikes per time period is proportional to the value of the pixel.More precisely, the pixel is converted into a Poisson spike train using its value as the mean of a binomial distribution.
To avoid having too many spikes, we have scaled the pixel values by a factor (the gain) of 0.25.Therefore, a white pixel (value of 1) will spike with a probability of 25% at each timestep, while a black one (value of 0) will never spike.The latency-based coding is much more sparse, as each pixel will spike at most one time.In this case, the information is contained in the time at which the spike occurs.The idea is that brighter pixels will spike sooner than darker ones.The spike time t spk of a pixel is defined as the duration needed by the potential of a (linearized) RC circuit to reach a threshold V th if this circuit is alimented by a current I equivalent to the pixel value: where τ is the time constant of the RC circuit.In our experiments, we have used a τ = 10 and a V th = 0.01.The spike times are then normalized to span the whole sequence length, and the spikes located at the last timestep (i.e. the spikes whose t equals to V th ) are removed.The encodings were performed using the snnTorch library [Eshraghian et al., 2021].All the experiments were made using spikes trains of length 200.Therefore, the MNIST (or Fashion-MNIST) inputs of dimension (1, 28, 28) are converted to tensors of size (200,1,28,28).
On the other hand, N-MNIST already has event-based inputs.Indeed, each sample contains the data created by an event-based camera.Therefore this data just need to be converted to tensors of spikes.An event-based camera pixel outputs a event each time its brightness changes.There are therefore two types of events: the ones issued when the brightness increased and the ones issued when it decreases.
A N-MNIST sample is a list of such events, which contains a timestamp, the coordinates of the pixel that emitted it, and its type.The Tonic library [Lenz et al., 2021] was used to load the N-MNIST dataset and convert its samples into tensors of size (200,2,34,34).The first dimension is the time, the second is related to the type of the event and the two last are the x and y spatial coordinates.

Readout layer
In order to extract the predictions from the outputs of an SRC network, the final SRC layer is connected with predefined and frozen weights to a readout layer of leaky integrators, with one integrator per label and a leakage factor of 0.99.Each integrator is excited (positive weight) by a small group of SRC neurons and is inhibited (negative weight) by the others.In our experiments, this final SRC layer contains 100 neurons.Each integrator is connected to all neurons: 10 of these connections have a weight of 10, while the others have a weight of -1.The prediction of the model corresponds to the integrator with the highest value at the final timestep.

Loss function
The networks were trained using the cross-entropy loss, which is usually used in classification tasks.This function takes as inputs the values x of the leaky integrators at the final timestep and the target This function basically applies the Softmax function to x and then computes the negative log likelihood.For a whole batch, we simply take the mean of the l's.

Learning
The loss function being defined, it is now possible to train networks of SRCs using the usual automatic differentiation of PyTorch.Experiments showed that bypassing the ReLU during backpropagation really fasten learning.As explained in subsection 3.1, this ReLU is used to isolate the spikes (high variations of h) from the small fluctuations.Considering the backward pass, this ReLU blocks the gradients when no spike is currently occurring, i.e. h[t] < 0. We therefore tested to let these gradients pass even when no spike is occurring.This reminds of the surrogate gradient optimization [Neftci et al., 2019] used to train LIF neurons, where, in our case, the activation function is a ReLU, while the backward pass assumes it was a linear activation: Figure 2 shows the evolution of the accuracy and cross-entropy of two SRC networks composed of 3 layers, one trained with the surrogate gradient, the other without.For each network we have trained 5 models.Except the usage of the surrogate gradient, all the other parameters are the same.It is clear that the surrogate gradient speeds up the learning, and will therefore be used in all our experiments.

Results
All experiments have been performed using PyTorch on GPUs, without any modification of the automatic differentiation and backpropagation algorithm, except for the ReLU that is bypassed during backward passes.All training have lasted 30 epochs.We have used the Adam optimizer with an initial learning rate of 0.005 that decays exponentially with a factor of 0.97.For each set of parameters, 5 models were trained.

Shallow networks
As a first experiment, we have tested several shallow networks with either 1, 2, or 3 SRC layers.The final layer always contains 100 neurons connected to the readout layer, as described in subsection 4.2.
The size of the hidden layers, if the model has any, was fixed to 512 neurons.The leakage factor of the SRC integrators was set to 0.9, except in the experiments where the latency-based coding was used, where it was set to 0.99 to deal with the high sparsity of the inputs.The leakage factor of the readout integrators was fixed to 0.99.Table 1 shows the different testing accuracies achieved by these networks.We can observe that SRC networks were able to learn and achieved comparable performances to other non-convolutional SNNs, despite only being trained for 30 epochs.We also observe that multi-layers networks performs better than single-layer ones.
As previously mentioned, another important aspect of such networks is the neurons activity.Using the measure defined in Equation 2, the mean activity of the neurons has been computed and is shown in Table 2.The mean activity stays quite low, which is good.It can also be observed than when the encoding is not sparse (rate coding), the shallow the network the lower the activity, while it is the opposite for the sparse encoding (latency coding).

Training deeper neural networks
Shallow networks of SRC neurons have successfully been trained.However, one of the breakthroughs in deep learning was the ability to train deep neural networks.Training deep SNNs is known to be difficult.We have therefore tested several networks with different number of hidden layers to see if SRCs manage to learn also when the network becomes deeper.As previously, all trainings have lasted 30 epochs.These were made on the MNIST dataset with the rate-based coding.All hidden layers consist of 512 neurons, while the final SRC layer still contains 100 neurons.
Figure 3 shows the results of this experiment.All networks manage to learn and achieve good performances after 30 epochs.However, the higher the number of hidden layers, the slower the training.This explains why the models with a high number of hidden layers do not perform as good as shallow networks at the end of the 30 epochs.Nevertheless, the goal of this experiment was not to assess the performance but rather the ability to learn of deep networks.Furthermore, the top-right graph shows the duration of each epoch for each number of hidden layer.It obviously increases with respect to the network depth but training duration stays quite small even for large number of hidden layers.For instance, the training of the 10 hidden layers networks have lasted about one day.

Conclusion
In this paper, we have introduced a new type of artificial spiking neuron.Instead of deriving this neuron from existing spiking models, as it is classically done, we have started from a largely used RNN cell.This new spiking neuron, called the Spiking Recurrent Cell (SRC), can be expressed as a  Accuracy on test set (%) 9 6 .3 2 9 7 .7 8 9 7 .6 7 9 7 .6 9 9 6 .9 9 9 5 .6 6 9 4 .9 8 usual recurrent cell.Its major advantage is the differentiability of its equations.This property allows to directly apply the usual backpropagation algorithm to train SRCs.Spiking neural networks made of SRCs have been tested on the MNIST benchmark as well as two variants, the Fashion MNIST and the Neuromorphic MNIST.These networks have achieved results which are comparable to the ones obtained with other non-convolutional SNNs.Also, multi-layers networks have shown to be able to learn.This proof of concept shows promising results and paves the way for new experiments.

Layers
For instance, trying a convolutional version of the SRC on more complex image classification tasks would be interesting.Also, adding feedback connections could increase the computational power of the SRC, as it is up to now only a feedforward SNN.Improving the initialization of the synaptic weights should also be considered.Finally, the neuron is currently only able to modify the synaptic weights and the biases via backpropagation.It is possible to add new learnable parameters in the SRC equations in order to let the possibility to the neuron to control more aspects of the neurons, like for instance the firing rate or the firing pattern.

Figure 1 :
Figure 1: Simulation of a SRC neuron for some inputs sequence x and different biases b h .

Figure 2 :
Figure2: Evolution of the cross-entropy and the accuracy on the MNIST dataset for two networks composed of 3 SRC layers.One of these two networks has been trained with the surrogate gradient, while the other has not.

Figure 3 :
Figure 3: (left) Evolution of the cross-entropy (top) and the accuracy (bottom) on the validation set during the training of several SRC networks with different number of hidden layers.(top, right) Evolution of the epoch duration in seconds of these networks.(bottom, right) Cross-entropies and accuracies obtained on the MNIST test set by the models with respect to their number of hidden layers.

Table 1 :
Neuron biases b h are initialized to 6 and the Xavier uniform initialization is used for the synaptic weights W s .Accuracies (in %) obtained on the test sets of the different datasets and the different encodings after 30 epochs.

Table 2 :
Mean spiking activity (in %) on the test sets of the different datasets and the different encodings after 30 epochs.