Keys to Accurate Feature Extraction Using Residual Spiking Neural Networks

Spiking neural networks (SNNs) have become an interesting alternative to conventional artificial neural networks (ANN) thanks to their temporal processing capabilities and energy efficient implementations in neuromorphic hardware. However the challenges involved in training SNNs have limited their performance in terms of accuracy and thus their applications. Improving learning algorithms and neural architectures for a more accurate feature extraction is therefore one of the current priorities in SNN research. In this paper we present a study on the key components of modern spiking architectures. We design a spiking version of the successful residual network architecture and provide an in-depth study on the possible implementations of spiking residual connections. This study shows how, depending on the use case, the optimal residual connection implementation may vary. Additionally, we empirically compare different techniques in image classification datasets taken from the best performing networks. Our results provide a state of the art guide to SNN design, which allows to make informed choices when trying to build the optimal visual feature extractor. Finally, our network outperforms previous SNN architectures in CIFAR-10 (94.14%) and CIFAR-100 (74.65%) datasets and matches the state of the art in DVS-CIFAR10 (72.98%), with less parameters than the previous state of the art and without the need for ANN-SNN conversion. Code available at https://github.com/VicenteAlex/Spiking_ResNet


Introduction
Artificial Neural Networks (ANNs) have achieved in recent years unprecedented performances in many computer vision tasks. However, these artificial systems still can-not be compared to a real brain in terms of robustness, energy consumption or generalization capabilities. Therefore, as an attempt to imitate more of the valuable properties of the brain, artificial Spiking Neural Networks (SNNs) have been proposed as an alternative to conventional ANNs. SNNs closely replicate the functioning of biological neurons, allowing for sparse asynchronous computations and time-dependent neuronal functionality. The full potential of these properties is yet to be explored, but it has already been proved how substantial improvements in energy efficiency can be obtained by implementing SNNs in neuromorphic hardware [1,2], bringing efficiency gains of up to 100 times less compared to standard ANNs in CPU/GPU hardware [3]. Given the ever increasing network size and power demands of standar ANNs, such energy efficiency gains are of particular interest as they allow to reduce SWaP (Size, Weight, and Power) for energy efficiency operations [4].
However, training SNNs is a more challenging task than training regular non-spiking networks. Non-spiking ANNs owe most of their success to the back-propagation of error (BP) algorithm [5], but in the case of SNNs the spiking behaviour inside the neurons creates a non-differentiable function, hindering the application of BP. Moreover the time dependencies of the neuronal states add extra complexity to the credit assignment calculations. These drawbacks result, in most cases, in SNNs having a lower final accuracy than regular ANNs.
In order to overcome the aforementioned challenges, some approaches use conversion methods [6,7,8], where they train non-spiking ANNs and then approximate their computations using an SNN. Compared to directly training an SNN, these methods are not able to perform online learning, they lose temporal resolution, and in most cases they have higher latency and energy consumption. This is why improving directly trained SNNs is still a necessity.
Direct training can be performed through bio-plausible unspervised methods such as Spike-timing-dependent plasticity (STDP), but when ground truth is available for the task to solve, supervised learning through surrogate gradient BP [9] is the best performing method. In this work we focus on the latter.
In order to improve the feature extraction process of SNNs in visual tasks, in this paper we present a study on the key components of modern spiking architectures and use the obtained conclusions to propose a novel and highly optimized SNN. Our results prove how directly training SNNs can already outperform conversion methods, allowing to exploit all the benefits of spiking computations without compromising accuracy. Additionally, the lessons learned from our experiments can also be valuable for those designing new SNN feature extractors in the future.
Specifically, the contributions of the paper are as follows. First, it presents an in-depth study on the possible implementations of spiking residual connections which highlights their properties in terms of accuracy, network activity, characteristics of their derivatives and implications of the computations in hardware requirements. This study introduces a novel residual connection for SNN which has been named the "Voltage to Voltage" connection and a revamped implementation of the "Spikes to Spikes" connection.
Then, it provides empirical results demonstrating the effects of different network design choices on the final accuracy. These include network size, batch normalization strategies, boosting methods, spike generation for frame based datasets, hyper-parameter optimization and fine-tuning. When designing an SNN, the conclusions drawn from these experiments allow to make optimal design choices maximizing the accuracy of the system. Finally, a new spiking network is defined which achieves higher accuracy than the previous state of the art in CIFAR-10 and CIFAR-100, and matching it for DVS-CIFAR10 with many less parameters than previous methods.
Additionally, a study on the compromise between latency and accuracy is presented. Through the experiments performed in it we also obtain novel results demonstrating a relationship between the processing time and the optimal leakage factor for a leaky integrate-and-fire model.

Related work
As mentioned in the previous section, one limitation in implementing SNNs is the difficulty to train them. Conventional gradient descent algorithms are not directly applicable given the intrinsic presence of non-differentiable spiking functions, as a result, different workaround strategies have been proposed. These strategies can be mainly categorized into two groups, ANN to SNN conversion methods and direct training methods. In this section, we overview the state of the art of this two approaches.

Conversion methods
In order to overcome the challenges in SNN training and to obtain the most accurate SNN systems, many works have adopted conversion approaches. These methods allow to bypass the training challenges of SNNs by training a nonspiking ANN and then transforming it to spiking format. This transformation reconstructs each of the neurons in the original network using spiking neurons, therefore the key challenge is to represent continuous activation values using the binary outputs of spiking neurons.
Most of these techniques are based on rate-based conversion [6,7,10,11], where the network is set up such that the spiking frequency of the converted neuron is proportional to the activation value of the original one. These methods can only convert ANNs using the Rectified Linear Unit (ReLU) activation function.
In order to reduce the energy cost of these conversions, Temporal-SwitchCoding (TSC) [12] was proposed, where the activation value is encoded in the latency of spiking rather than the frequency thus generating less spikes. On the other hand, methods such as [13,14] focus on reducing the conversion error without the use of a large number of time-steps, which allows for competitive results without long simulation times.
Finally, ReLU networks can also be approximated using the method in [15], where a binary ANN is trained in order to approximate the original in just one time-step. The reported results are less accurate than state of the art SNN conversions, but they allow for a 1 step inference without temporal computations.
Alternatively, other approaches such as [8,16] can be applied to any type of network. The first one manages to do this by using circuits of neurons in order to approximate arbitrary functions. The second one does the same by using FS-neurons, a parametric neuron model that can be optimized to approximate any function.
Converted networks can be implemented in energy efficient neuromorphic hardware; however, forcing the SNN to imitate non-spiking computations makes it lose some of its properties. A converted network can not perform online learning and, because it approximates dense activation maps, it is prone to lose sparsity. Moreover, it has a lower temporal resolution, which is likely to cause underperformance when processing neuromorphic data as proved by [17].

Direct training
Directly training the SNN without conversion allows one to exploit all its valuable properties; however, the challenge becomes then to successfully train it given that gradient descent based methods cannot be applied to non-differentiable spiking functions. The most common strategy in state of the art methods is the use of surrogate gradients [18,19], a method where the spiking function is used in the forward path, but when calculating its derivative in the backwards path, a continuous tractable function is used, which tries to approximate the behaviour of the real derivative.
Another option is to use a version of the SNN model that is directly differentiable. Some examples can be found in [9]. We can find models using soft non-linearities [20], probabilistic models [21] or latency-based networks [22].
Alternatively, supervised learning can also be performed without the differentiation of the whole network. Some examples use local approaches with algorithms such as [23], where the loss is computed locally in each neuron, or by using three factor learning rules [24].
Depending on the needs of the system, the optimal learning method might change, but when talking about final task accuracy, surrogate gradient BP is the best performing method so far. All the best SNN feature extractors consistently use this method, but the BP implementations and the surrogate functions they use vary between them.
Concerning the BP implementation, different variations can be found among the best performing networks. Some works such as [19] choose to simply unroll the network in time and use Back-propagation Through Time (BPTT). A slightly different implementation is found in [25], where the authors use a Spike-based BP algorithm which proposes a novel way of accounting for the leak factor of LIF neurons. Finally, there are also BP approaches where the input spikes are convolved with spike response kernels like in [26], which allows for convenient spike response implementations at the cost of saving more spike time-stamps in memory.
For the surrogate functions, there is no consensus either. We find triangle shape surrogates in [19], rectangular shaped in [27], and arc-tangent shaped in [28,29].

SNN architectures
Regarding the state of the art of SNN topologies, literature usually measures their feature extraction capabilities by assessing their image classification accuracy in public datasets. In the present day, among directly trained networks, the highest accuracies are reported for networks basing their topologies on VGG [30] and ResNet [31] architectures.
In non-spiking deep learning, after the development of deep feed-forward networks such as VGG, the next big improvement came with the addition of residual connections. As demonstrated in [31], residual connections allowed to successfully train much deeper architectures, giving rise to a more accurate and efficient family of networks.
The reason for this improved performance is that resid-ual connections help alleviate the problem of depth-induced accuracy degradation. Without residual connections, when increasing the depth of the network, the accuracy firstly saturates, but then it degrades rapidly. This is caused by the fact that extra layers increase the complexity of the problem to optimize, therefore it can get to a point where the benefit of adding extra layers does not compensate for the harm of increasing optimization difficulty. The way residual networks solve this problem is by making the network easier to optimize. Given an input x and the mapping function of a layer F (x), the output of a layer with a residual connection will be: Then, the residual mapping F (x) = H(x) − x should become easier to optimize than the original F (x) = H(x). This is because an identity mapping H(x) = x can be accomplished just by setting the weights in the layer to zero (F (x) = 0), allowing the network to easily ignore unnecessary layers, and therefore not degrading the result. Alternatively, when the optimal solution is not an identity mapping it might still be closer to it than to a zero mapping, making for a better initialization [31].
In order to port these benefits to SNN, Lee et al. [25], Zheng et al. [27] and Fang et al. [29] implement the first trainable spiking ResNets, managing to train deeper networks than VGGs and achieving competitive results. On the other hand, [19,28] implement VGG-like architectures which are shallower, but larger in number of parameters. These non-residual feed-forward networks still outperform the aforementioned ResNets in many datasets (see Table 14 in Section 5).

Spiking neuron model
In order to perform their computations, SNNs simulate the behaviour of biological neurons by means of mathematical models. In this work we use the Leaky Integrate-and-Fire (LIF) model [32]. Despite their simplicity, LIF neurons found great success in many state of the art systems.
The LIF model can be formulated as the differential equation seen in Eq. 2, where U (t) is the membrane potential, U rest the resting potential, τ is the time constant and I(t) is the input current. When the voltage U (t) surpasses a set threshold U th , the neuron emits a spike and the potential is reset by subtraction.
In order to easily program this behaviour in machine learning models, explicit iterative versions of this differen-tial equation are used. Let i be a post-synaptic neuron, u i,t is its membrane potential, o i,t its spiking activation and λ the leak factor. The index j belongs to the pre-synaptic neuron and the weights w i,j dictate the value of the synapses between neurons. Then, the iterative update of the neuron activation is calculated as follows: where g(x) is the thresholding function, which converts voltage to spikes: After spiking, a reset is performed by the subtraction u * i,t = u i,t − U th , where u * i,t is the membrane potential after resetting.

Spiking Residual Network
With the objective of building the most accurate SNN feature extractor, our starting point is to implement a spiking residual network (S-ResNet).
The motivation to choose this architecture is that almost all the non-spiking state of the art ANNs make use of residual connections in order to allow for the training of very deep networks. On the contrary, in the SNN domain, the state of the art is still based in VGG-like architectures for datasets such as CIFAR-10, CIFAR-100 and DVS-CIFAR10. Therefore we define a new S-ResNet that will allow to outperform the previous state of the art and justify the use of residual connections also in the SNN domain.

Implementation of a spiking residual connection
In order to design our S-ResNet, the first step is to define the implementation of the spiking residual connection. The skip connection in a non-spiking network just sums the activation value of a previous layer to the activation of the current one (Eq. 1), but when using spiking neurons this sum can be performed in several ways.
Given a multilayered feed-forward SNN of LIF neurons, the membrane state vector u l,t of a layer l at time t is given by Eq. 5, where o l,t is the layer's spiking activation and W l the synaptic weight matrix. These spiking activations are obtained by means of the spiking function g (Eq. 6).
Then, the residual information coming from a previous layer at position l − n can be integrated to the current layer l using one of the following strategies: Spike output to membrane (S2M): The spiking output of a previous layer l − n feeds the membrane potential of the neurons in layer l. A set of synaptic weights W l−n will be needed to define the amount of voltage communicated by these spikes (Eq. 7). These weights will typically be a non-learnable parameter, then if W l−n = U th the residual connection will implement an identity mapping when W l−1 o l−1,t + λ · u l,t−1 = 0. In any other case, the final activations are not guaranteed to be o l,t = o l−n,t .
Regarding its training through back-propagation, the properties of the residual connection can be observed in the network's derivative. Consider a generic residual block where the residual input W l−n o l−n,t has n = 2 (Eq. 7), skipping the intermediate layer l − 1, and where l − 1 has no residual input (Eq. 8).
Then, deriving Eq. 7 with respect to o l−2,t , we get: (9) Eq. 9 shows how the residual connection adds an extra W l−2 ∂o l,t ∂u l,t term to the gradient, a term which is not influenced by the value of the learnable weights W l−1 , in contrast to W l−1 ∂o l−1,t ∂o l−2,t . This is the reason why this residual connection will alleviate the vanishing gradient problem even when W l−1 is arbitrarily small. Still, given that ∂o l,t ∂u l,t will be the derivative of the spiking function, the skip connection defined by this implementation will have its gradient scaled by the value of the surrogate function, which might be a concern depending on the setup.
The authors in [29] argue that the surrogate derivative g of g(u l,t ) will typically not implement a function such that g (W l−n o l−n,t ) = 1 when o l−n,t = 1. Therefore scaling the derivative of the residual stream by this value could contribute to the vanishment or explosion of the gradient.
This kind of connection has previously been used in [25] with W l−n = U th = 1 and in [27] weighted by their threshold-dependent batch normalization (potentially compromising the identity mapping). The S2M connection is represented in Fig. 1 as the green connection.
Spike output to spike output (S2S): The spiking output of a previous layer l − n is added to the spiking output of layer l (Eq. 10). If o l,t = 0 this residual connection will successfully implement an identity mapping o l,t = o l−n,t .
Additionally, this implementation avoids applying the thresholding function to the residual path. Therefore, when using back-propagation, the contribution of the residual connection will be unaltered by the value of the surrogate function (Eq. 11).
(11) Regarding the information flow inside the SNN, this kind of connection has some implications that are worth noticing. It is implemented as an addition between activation maps, which is a different operation than adding voltages to a membrane and needs to be supported in the substrate implementing it (or else extra synapses will be needed). Moreover, it allows for the generation of non-binary activation maps, as the sum between activations could result in a value bigger than 1. In order to implement this, it will require to either sum activation maps and communicate nonbinary values in the spike activation (as some neuromorphic hardware already supports [33]) or to avoid grouping spikes in one synapse by defining multiple individual connections such that: Finally, in network topologies such as our S-ResNet (that we will define in the following section), we can find situations where the number of neurons d 1 in o l,t ∈ N d1 is different than d 2 in o l−n,t ∈ N d2 . As proposed in [31], we solve this by applying a 1×1 convolution f to o l−n,t such that f : N d2 → N d1 . This is relevant for the S2S connection because, as seen in Eq. 13, by applying this convolution o l−n,t gets now multiplied by the learnable W , which weights the activations transforming them into non-binary voltage values. The implications of these non-binary spiking activations are no different than that of the multiple spikes, it can be implemented as graded spikes in neuromorphic hardware or by defining extra synapses. The formulation for the later can be seen in Eq. 14, where the contribution of o l,t and o l−n,t to the membrane u l+1,t is split as two different incoming connections.
This kind of connection has been used in [29]. Its implementation is the same than the one in this work for maps at the same resolution, but it differs in the downsample paths. Differently from our proposal, a spiking neuron layer is added after the 1×1 convolution. This was avoided in this work in order to eliminate the effect of the surrogate function in the derivatives of the residual path.
The S2S connection is represented in Fig. 1 as the purple arrow.
Voltage to voltage (V2V): The previous two implementations created a residual mapping in the activation map. This residual mapping can also be enforced at the membrane potential level if a V2V connection is defined.
Let the spiking input to a layer l − n be W l−n−1 o l−n−1 plus a residual input r l−n,t . Then, in a V2V implementation, the input that feeds a layer l − n will also become the residual input to the layer l (Eq. 16). Like this, if W l−1 o l−1,t = 0 and u l,t−1 = u l−n,t−1 the residual will implement an identity mapping of the membrane potentials such that u l,t = u l−n,t . This will also cause o l,t = o l−n,t if the thresholds of the two layers are the same.
Regarding the derivative of the network, deriving with respect to o l−n−1,t in the same setup as before (n = 2) we get: As it happened for the S2M, the derivative of the residual path will also depend on the surrogate function. Still, in the context of a hierarchical network, compared to an S2M implementation, the surrogate derivative will have less influence on this residual path, as r l,t is a function of r l−n,t , which does not depend on ∂o l−n−1,t ∂u l−n−1,t . In the case of the S2M implementation the residual is r l,t = W l−n o l−n,t which fully depends on ∂o l−n,t ∂u l−n,t adding an additional spiking function into the residual path with each residual block.
Finally, notice that implementing the V2V connection will have the same effect in the information flow than S2S. This is caused by the dependency of Eq.15 on r l−n,t . In Eq.18 we unravel this expression in order to show how the voltage sent by the residual connection r l,t is just a sum of post-synaptic potentials (PSP) from previous layers W l−i·n−1 o l−i·n−1,t . Therefore, this can be implemented either by defining (l/n) − 1 extra connections per each r l,t or by summing the PSPs together and then communicating the voltage value through graded spikes.
The V2V connection is represented in Fig. 1 as the red connection.  From an implementation point of view, this analysis showed how an S2M connection can be accomplished by a single conventional synapse while S2S and V2V require either to define multiple synapses or to perform a special kind of computation. This computation requires to sum spiking activations together for the S2S connection and to sum PSPs together in the case of V2V. Then the resulting value is transmitted to the membrane of the target neuron. With the neuromorphic hardware available in the present In this work, we test the three approaches (section 4) analysing their spiking activity (Fig. 3) and final accuracy ( Table 2). We choose S2S for the final implementation, as it provides the most accurate results. This is consistent with the previous theoretical analysis, as S2S is the only solution avoiding spiking functions in the residual path.

Network topology
With the residual connection implementation defined, the following choice to be made is the global network architecture. In the non-spiking domain it has already been proven how the original ResNet architecture [31] outperforms feedforward architectures without residuals; therefore, in order to test if the same principles apply to SNN, the obvious choice is to reuse the same topology.
Depending on the resolution and complexity of the dataset to target, the optimal architecture can vary; that is why in [31] the architecture used for the ImageNet dataset and for CIFAR-10 are different. CIFAR images have a resolution of 32×32, while the images are 224×224 for Im-ageNet (after resizing), meaning that more downsampling operations will be needed in the second one in order to have a comparable receptive field. As we are targeting CIFAR-10, CIFAR-100 and DVS-CIFAR10, we will base our global network architecture on the smaller ResNet proposed for these datasets. The architecture is defined in [31] in a table, such as Table 1.
Regarding the batch normalization (BN) layers in the architecture, regular BN can be used in an SNN, but improved performance has been reported by using Batch Normalization Through Time (BNTT) [19], a time-varying BN that learns different statistics for each time-step. This is consistent with the studies performed in non-spiking RNNs, where works such as [34] argue that the statistics of different time-steps can differ significantly. For that reason, in our final architecture we use BNTT. As further proof, Table  5 in Section 4 demonstrates the performance gains of using BNTT compared to regular BN. A diagram of the final architecture can be found in Fig. 2.
To the best of our knowledge, this work is the first to implement the aforementioned architecture for SNN training.  [ 25,27] implement alternative topologies with extra fully connected layers and larger amounts of channels in convo-lutional layers (see the difference in parameters in Fig. 4 in Section 5). The authors in [29] define their main network for ImageNet and reuse the original ResNet's topology for this dataset which is different from the CIFAR-10 one. Additionally, they propose a residual network targeting DVS-CIFAR10. Compared to ours, this network is wider and shallower (resulting in a larger parameter count), instead of strided convolution, it relies on max pooling for downsampling and it processes inputs of 128×128 resolution. Apart from that, those three networks differ from ours in the normalization strategies, as they use time averaged statistics where we use BNTT, and also in the residual connection implementation.

Boosting strategies
Boosting techniques allow to combine the predictions of multiple weak classifiers to create a stronger one. Previous work in SNNs [28] has already applied simple versions of this strategy by converting the classification layer into a voting layer.
We tested the same approach as [28] and adapted the last fully-connected to have 10 × C neurons, where C is the number of classes. Then an average pooling layer of kernel size 10 and stride 10 reduces the dimension back to the number of classes C. This process computes the score of each class as the average of 10 neuron states, which can be seen as a voting scheme for 10 different sub-networks.
In Section 4, Tables 7 and 6 demonstrate the effects of adding the boosting layer. Some networks provided improved performance when using this strategy, while others did not, so we keep this layer only in those cases where it is beneficial. In our final results, only the CIFAR-10 network uses it.

Training framework
Our network is trained to perform image classification through supervised learning. In order to allow for this classification, the last neuron layer is defined with no leak and cannot spike. Then the voltage accumulated in the layer after T time-steps divided by T is considered the output value.
The output class scores are compared to the ground truth by means of a cross-entropy loss (Eq. 19), where C is the number of classes, u i,T the voltage of neuron i after the last time-step, and y i are the ground truth labels: With the loss defined, the weight updates for the learning process are calculated through BPTT.
The final voltage at each layer is dependent of the contribution of all previous time-steps, therefore the derivative of the loss function with respect to the network weights can be defined as the sum in Eq. 20, for neurons in the output layer, and as the sum in Eq. 21 for neurons in the hidden layers.
where p i,t is the current transmitted through the synapses after applying the weights: Then, taking into account the temporal dependency of the membrane potential along with its dependency on input spikes, we obtain: Notice that ∂ot,i ∂ut,i requires to compute the derivative of the thresholding function, which is non-differentiable. We solve this by using a triangle shaped surrogate gradient. As in [19], we set α = 0.3.
In practice this can be easily implemented using autodifferentiation tools such as Pytorch [35].

Input preprocessing
Frame-based datasets: Frame-based images need to be encoded into spikes in order for an SNN to process them. Works like [19] use a Poisson spike generation process which transforms the image frame into a sequence of spikes. Other works [28,27] feed the unprocessed frame to the first SNN layer, making the pixel intensity equivalent to a constant input voltage for the first neurons.
The latter allows for better results, as all of the information is presented at each time-step, while the former will require many steps to represent all of the information and will add variability to the data. Still, we believe using a spike generation process is a better representation of a scenario where the input data is spiking information (such as the data coming from event cameras), so choosing one method or another should depend on the objective of the simulation. Therefore in this work we use both approaches in order to compare results. Our best performing networks are trained without Poisson encoder in order to maximize accuracy. Additionally, images are always normalized with respect to the statistics of the dataset.
Neuromorphic datasets: Data produced by neuromorphic cameras represent the changes in the scene, and these are often presented in event format. An event is a discrete package of information indicating location, time-stamp and polarity (i.e. change in brightness).
We use the events to build frames containing spiking activations. Such frames have two channels, one for positive polarity and one for negative, and they accumulate all events occurring in a time window. The size of the time window is defined by the amount of time-steps we want to have for each sequence. We implement this process using the Spik-ingJelly library [36].
Data augmentation: Frame based datasets were augmented using random horizontal flips and random crops.

Hyper-parameters
The performance of the proposed network depends on certain hyper-parameters, such as the leak factor of the membrane, the number of time-steps or the learning rate for training. The optimal value of these parameters varies depending on the architecture of the network, the training procedure and the task at hand. That is why in order to properly asses how useful an architecture or a training method is, we first need to find its optimal hyper-parameter setup.
We address this challenge by using BOHB [37], a hyperparameter optimization technique that combines Bayesian Optimization (BO) and Hyperband (HB), a multi-armed bandit strategy. Using this method, we optimize the hyperparameters for S-ResNet38 in the CIFAR-100 dataset. The learning rate for this training is divided by 10 at 70%, 80% and 90% of the training process. The resulting hyperparameters are also used for the rest of networks and datasets, as with the hardware available we could not afford to run an individual search per setup.
The best performing parameters are: leak = 0.874, timesteps = 50, learning rate = 0.0268 for a batch size of 21.
Notice that the target of the search was only to optimize accuracy, therefore the amount of time-steps tends to be maximized as it has a monotonically non-decreasing relationship with the accuracy. Section 5.2 demonstrates the effects of reducing the amount of time-steps.

Experiments
In order to maximize the accuracy of our method, we conducted a search for the key components in state of the art architectures that allow for improved performance. In this section we present empirical results obtained from testing these components in our networks. The results from these comparisons allow us to compose a network which outscores previous approaches in multiple datasets.
Residual connection implementation: In section 3.2.1 three ways of implementing residual connections in SNN were defined. We tested the performance of S-ResNet38 with each one of them ( Table 2). The highest accuracy is obtained by the S2S connection. This result is consistent with our theoretical analysis, as the residual path in S2S does not go through spiking functions, therefore it allows a better flow of the gradient during back-propagation. Still, the performance of the V2V implementation is very close. On the other hand, the S2M implementation has a substantially lower accuracy. This decrease in accuracy could potentially be attenuated with further hyper-parameter search and improved optimization, but we hypothesize that such setup is more difficult to find due to the less convenient gradient properties of S2M.
Apart from that, by adding any of these three residual connections, the network is expected to propagate more spikes to deeper layers. In order to analyse this effect, we averaged the spiking activity of the networks across the test set of the CIFAR-100 dataset (Fig. 3). We also display the spiking activation obtained with a non-residual network (spiking VGG11) for comparison.
Before starting the comparison, it is important to realise the effect of BNTT in the spiking activation. As observed by [19], by allowing to learn a different learnable weight γ per time-step, the network is allowed to scale the activation of each layer depending on the time-step. Because of this, it tends to localise the spiking activity of each layer in a certain time range. The value of this weight for each network is visualized in the second row of Fig.3.
When looking at the S-ResNet networks, we observe how there are more layers active at each time-step, as the spiking connections propagate activations to deeper layers bypassing the BNTT weighing. The effect of BNTT is more noticeable in the S2M implementation and less in V2V and S2S. Still, all of them learn a time-dependant weight distribution, indicating that, according to back-propagation, that is the optimal solution for image classification.
Apart from that, S-ResNet activity maps show a characteristic striped pattern. This is caused by how the residual connections always skip one layer, connecting only evennumbered layers (as defined in [31]).
Finally, the more abrupt changes in activation percentage localized in layer 14 and 26 are caused by the resolution change, which changes the number of total neurons in the layer and makes the residual connection go through a 1×1 convolution.
Overall the contribution of the residual connections behaves as expected. It propagates the spiking activations to deeper layers, which allows the back-propagation algorithm to successfully train deeper architectures. Additionally we see how the spiking activity is higher for S2S implementations compared to V2V or S2M, as the "multiple spikes" behaviour favours sending higher amounts of voltage between layers. This can be relevant for applications which are sensible to the volume of spiking activity. In those tasks, the optimal choice for the residual implementation can vary, as there is a compromise between accuracy and volume of spikes. In cases where a lower network activation is needed V2V poses an efficient alternative to S2S with a very similar accuracy. Regarding their implementation, S2S and V2V require to define extra synapses per residual connection or to implement spike/PSP sum, therefore, S2M is the most suitable option for applications which want to avoid this. Network depth: The residual connections in S-ResNet allow to increase the depth of the network without the concern of catastrophic accuracy degradation. As expected, this allows us to train very deep architectures. Table 3 presents the classification accuracy in CIFAR-10 achieved by the S-ResNet with different depths and the same training hyperparameters. The results shows how the accuracy grows from 20 to 38 layers, but stays roughly the same from 38 to 44.
Given these results, for the rest of our experiments we choose S-Resnet38 as the default network. Still, the optimal depth of the network changes depending on the dataset and task to solve, therefore we encourage those researchers looking for optimal performance to tune this parameter for their specific task.
Spike generation for frame based datasets: As mentioned in Section 3.4, when working with frame based datasets, we tested two different methods for the spike encoding process. One consists in transforming the intensity values into spikes by means of a Poisson spike generation process. The other consists in transforming them by means of the first convolutional layer (i.e. feeding the raw image to the network).
As expected, the results in Table 4 show how encoding by means of the first convolutional layer gives a better result than generating spikes as a Poisson process. In order to maximize accuracy, for all of our experiments we use the encoding by convolution approach. Batch normalization strategies: We compare performances using time-dependent BN statistics versus time averaged statistics. Table 5 shows how BNTT outperforms regular BN for the same network.
Boosting layer: As introduced in Section 3.2.3, a simple boosting layer can improve the accuracy of the system in some cases. Tables 6 and 7 show the effect of this component in the accuracy of our networks. In the CIFAR-10 datasets the accuracy is improved by using this technique, while in the CIFAR-100 one, where we have more classes, increasing the size of the last fully connected in order to perform boosting ends up being detrimental.
Parametric Leaky Integrate-and-Fire: The authors in [28] propose to learn the leak coefficient of the LIF neurons directly through back-propagation as another parameter of the network. By doing this they can also afford to learn a different leak value for each layer. They call this method the Parametric Leaky integrate-and-fire (PLIF) neuron. Table  8 shows our results after training S-ResNet38 with PLIF and with a single leak coefficient learned through hyperparameter search.
We do not achieve our best results using the PLIF neuron; still, we believe this strategy is a very efficient way of finding this hyper-parameter. For this reason, we test it again for the search of a shared leak value instead of calculating a different one per layer. Table 9 shows the difference between the leak value found through hyper-parameter search and the one found by back-propagation. It is interesting to see how the two values differ by a considerable amount, having the one found by back-propagation a slower leakage than the one found through the BOHB method.
Still, both values perform well when the network adapts its weights to work with them. The performance comparison between them can be found in Table 10, where we compare our network trained with the BOHB optimized value to an identical network which learned the shared leak value through PLIF.
Extra training data: In the deep learning domain, most state of the art performances in computer vision are achieved by means of fine tuning. This strategy consists in taking a network that has already been trained in a different dataset and then training it further for the task at hand. In the visual domain this strategy works well, as visual data has many transferable features.
We test this strategy by pre-training our networks with CIFAR-100 and then fine-tunning for DVS-CIFAR10 and CIFAR-10. The results are presented in Table 11 and Table  12. We obtain higher accuracy results in all cases but for the larger S-ResNet in CIFAR-10. Moreover these trainings converge faster, making it a great solution for any further work building on top of these feature extractors. In our public code, users can find our pre-trained weights so that they can perform fine-tunning in any future system building from this one.

DVS-CIFAR10 image resolution:
The event streams found in the DVS-CIFAR10 dataset were generated by recording 10,000 images from the original CIFAR10 dataset with a DVS camera while applying a repeated closed-loop smooth movement [38]. Despite the resolution of CIFAR-10 being 32×32, the DVS camera resolution was 128×128 and therefore the resulting event maps have also a 128×128 resolution. As our S-ResNet architecture is optimized for inputs of size 32×32, in our previous experiments we downsampled the DVS-CIFAR10 dataset to that resolution.
In most datasets, downsampling the input causes information loss and therefore accuracy degradation. In order to test if this applies to the unique case of DVS-CIFAR10, we tested the performance using 64×64 and 128×128 resolution as input. We adapt the architecture of the network for the new input sizes by adding, in the case of 64×64 a stride=2 in the first convolution (c32k3s2), and in the case of 128×128 a stride=2 and kernel=5x5 in the first convolution (c32k5s2) followed by a Max Pooling of stride=2 and kernel=2 (MPk2s2). Table 13 presents the test results with the three resolutions. It can be seen how the best performance is obtained when using a 64×64. We do not obtain any improvement by using the full 128×128 resolution. Our best architecture for full resolution uses a bigger kernel and max pooling, similarly to how [31] handles the bigger ImageNet frames. We hypothesize that this setup does not bring improved performance because the down-scaled 64×64 events already contain the necessary information and therefore the bigger 128×128 network just brings unnecessary complexity.

State of the art comparison
In this section we compare our final results to the current state of the art for image classification in the CIFAR-10, CIFAR-100 and DVS-CIFAR10 datasets.
As noted in [28], most previous works train on the training set, evaluate the test set at each step, and then report the highest test accuracy obtained. We consider this approach to be reporting validation accuracy rather than test. In our setup, we evaluate the test set after all the training epochs, without using its value for tuning the training. We also evaluate validation accuracy in the same manner than the previous methods in order to make a fair comparison.
The developed S-ResNet outperforms all previous SNN methods in classification accuracy for the CIFAR-10 and CIFAR-100 datasets (Table 14). In the DVS-CIFAR10 dataset, we find that the validation accuracy for the best performing network outperforms ours, but when measuring test score, ours is superior.
Before our work, in the CIFAR-10 and CIFAR-100 datasets, the most accurate network was a conversion method. These new results prove how directly training an Table 14. Image classification validation performance on CIFAR-10, CIFAR-100 and DVS-CIFAR10. Our S-Resnet38 in CIFAR-10 and CIFAR-100 stands for the wider version of the architecture defined in Section 3.2.2 with n = 6, 32 base filters, and boosting layer.
In DVS-CIFAR10 we use the 16 filters version without boosting and with the pre-training step. We refer to the residual network in [27] as S-ResNet', as it follows a different architecture than our S-ResNet  Table 15 we compare the performance of our S-ResNet to its non-spiking ANN version. We compare the version with 16 and 32 base filters without boosting. We can see how the performance on the trained SNN is not far from its non-spiking counterpart, demonstrating how improvements in SNN training can push these technologies to comparable levels with conventional deep learning.
Comparing to the previous trainable SNN architectures, our network uses many less parameters. Fig. 4, 5 and 6 show a map of the accuracy versus the number of parameters. The main cause for the difference in parameters is that our network has a smaller number of channels in convolutional layers and only a single fully-connected layer. Then, even when our network is deeper than the others, it is actually lighter in terms of synaptic connections.

The latency -accuracy compromise
Apart from raw accuracy, the efficiency of algorithms is a major factor when deploying systems in the real world. For image classification in SNN, the amount of time-steps used for prediction regulates a trade-off between accuracy and time or volume of computations.
In order to elucidate the effect of this trade-off in our

Val -Ours S-ResNet38
Test -Ours S-ResNet38 Val -Fang's Wide-7B-Net Figure 6. DVS CIFAR-10 accuracy versus number of parameters. We compare our network to the best performing trainable SNNs and the other spiking ResNets. The number of parameters for other works was counted using their publicly available code. The "Val" prefix stands for validation accuracy while "Test" stands for testing accuracy.
system, in Table 16 we present the accuracy of S-ResNet38 with different numbers of time-steps. Starting from our best network trained with 50 time-steps, we test how the accuracy degrades when dropping the last 10/20/30/40 steps. Additionally, we compare this to the result obtained by directly training with less time-steps. The results show how for CIFAR-100, the network trained with 20 steps performs better than dropping the last 30 steps of a 50-step network. Still this same experiment in the CIFAR-10 dataset shows the opposite results by a close margin, indicating that the 50-step network had a more complete training.
At 10 steps, the degrading of the 50-step network becomes more obvious. Interestingly the network trained with 20 time-steps does not degrade as much, as it is only losing half of its computations and therefore still managing to extract the core visual features.
Finally, we hypothesise that the optimal leakage coefficient for the neurons might be correlated to the amount of time-steps the network is ran for. Given that the leak factor that we use was obtained through the hyper-parameter search process, and given that this process prioritized large amounts of time-steps, we believe the optimal leak factor for 20-step inferences could be different from the one we are using. We empirically test this by training the network again with PLIF neurons, a process that allows us to optimize the leak value in a single training run. The results, as seen in Table 17, prove how we obtain a better performance when the leak coefficient is optimized for the number of inference steps, confirming our hypothesis.
From this study we learn how the optimal solution is to perform training with the same amount of time-steps that we want to target at inference time and to optimize hyperparameters such as the leak factor for this same objective. Still, our SNNs can withstand the effect of early stopping, retaining most of their accuracy even when big percentages of their computation steps are dropped. This allows to provide early estimates in time sensible tasks or to reduce computational cost.

Conclusions
In this paper we presented a new SNN architecture which outperforms the previous state of the art in different image classification datasets. This system is the product of an in-depth study on spiking residual connections and design choices based on the empirical results from our experiments. These experiments demonstrate the effects of multiple design choices in the final performance. On top of that, the analysis performed on residual connections sheds new light on the effects of these connections in terms of network activity and hardware requirements. The lessons learned from these studies also become a guide for SNN design, as Table 16. Influence of the number of time-steps in the validation accuracy. Results of the evaluation of the best performing S-ResNet38 with boosting. Training time-steps specifies the number of steps used during training, inference time-steps the steps used for inference. If the inference number is smaller than the training one, early stopping is applied and the last N time-steps (and learned BNTT layers) are not used. For comparison, the training is reproduced also with 20 time-steps. Clarification: The architecture is the same but the results for CIFAR-100 use the weights trained in CIFAR-100 and the CIFAR-10 results use the weights trained in CIFAR-10.  The results of this work demonstrate how SNNs do not need to use conversion methods in order to maximize their accuracy. Additionally, they contribute to pushing their performance closer to that of non-spiking deep learning. From here, we hope that new applications can benefit from increased accuracy by fine tuning our networks and more experiments can follow in order to keep pushing the SNN state of the art.