mlGeNN: accelerating SNN inference using GPU-enabled neural networks

James Paul Turner; James C Knight; Ajay Subramanian; Thomas Nowotny

doi:10.1088/2634-4386/ac5ac5

1. Introduction

For image classification, deep convolutional neural networks (CNNs), trained on large datasets such as ImageNet2012 [1] are the gold standard. However, even after a CNN has been trained, using it for inference requires on the order of millions (if not billions) of multiply-accumulate operations (see Hanif et al [2], figure 13.5) to propagate real-valued unit activations through many layers of trained weights. This results in computation and energy [3] requirements that make them unsuitable for many mobile and edge-computing applications and raises concerns about the carbon footprint of neural network algorithms. Various approaches have been employed to reduce compute time and energy cost, including simplifying model structure [4], using low-precision data types [5] and even specialised accelerator hardware [6].

Spiking neural networks (SNNs) represent a more fundamental change of approach. Taking closer inspiration from biological brains, neurons in SNNs communicate using spatio-temporal patterns of discrete events known as 'spikes', rather than the real-valued 'rates' exchanged by units in deep networks. Because typically only a small subset of neurons in an SNN are active (spiking) at any given time, simulating an SNN model for a single timestep requires significantly fewer operations. Furthermore, because each spiking neuron's activation is binary, propagating it through a layer of weights requires only addition rather than multiply-accumulate operations, which may allow much more efficient hardware implementations. A wide variety of specialised neuromorphic hardware has been developed to exploit these properties, using technologies ranging from analogue electronics [7, 8] to standard CMOS ASICs [9–11] and fully-programmable CPUs [12, 13].

Numerous techniques for directly training SNNs have been developed. Notably, back-propagation can be applied to exact spike times [14–16] or the non-differentiable transfer function of each spiking neuron can be replaced with a 'surrogate gradient' function, allowing back-propagation through time [17, 18] or more biologically-plausible learning rules [19–21] to be applied to spiking models. However, none of these techniques are yet capable of training deep feedforward SNNs. For these models, an alternative approach is to convert fully-trained traditional deep networks to SNNs. The most common method is to translate the activation of the units in a deep network to firing rates of spiking neurons in an SNN of the same geometry [22–24]. Although good inference performance has been obtained using transfer methods of this kind, the resulting SNNs need to be simulated for hundreds if not thousands of timesteps before a classification is obtained. This overhead cancels out the lower cost of each SNN timestep compared to each ANN step. An alternative approach is to translate the deep network model in such a way that the activations of ANN units are encoded in the spike times of the neurons in the SNN model [25, 26]. By using precise spike times, activations can be represented in many fewer timesteps allowing much more competitive speed of inference.

SNNs are likely to be ultimately deployed on specialised neuromorphic hardware, but researchers still need efficient tools for developing the conversion and training algorithms on standard hardware. There are a large number of SNN simulators [27–30] but most of them are designed for computational neuroscience applications and, as such, lack native support for common machine learning building blocks like convolutional layers and batching. Therefore, many ML researchers have chosen to develop alternative spike-based ML libraries [31–34] on top of ML platforms such as PyTorch [35], TensorFlow [36] or JAX [37], so that they can continue to take advantage of the ML infrastructure these platforms provide. These SNN packages each have their own strengths and weaknesses, and many specialise in simulating specific classes of models. For instance, SpykeTorch [34] is a Python package which specialises in training SNN models where each neuron produces at most one spike per stimulus whereas BindsNET [31] aims to be a more general purpose SNN simulation package.

The underlying ML platforms all work by assembling computational graphs from operations, such as matrix multiplications and reductions, operating on a tensor datatype. These operations are implemented as highly-optimised GPU kernels and, to ensure that they fully occupy GPU resources and to counteract the overheads of launching large numbers of kernels, operations are 'batched' across an additional tensor dimension. This is a very efficient approach for simulating feedforward ANNs as only a single traversal of the model is required for each input tensor. However, the employed tensor types only have minimal support for sparsity meaning that, to simulate SNNs within these frameworks, the activity of spiking neurons is typically represented as dense tensors with many zeros (no spikes) and a few ones (spikes). These tensors are then propagated through the weights of the network using the standard matrix multiplication based approach. Therefore, each SNN timestep requires the same number of arithmetic operations as a complete traversal of an equivalent ANN. As SNN simulations involve processing multiple timesteps, they will typically be slower than an equivalent ANN by a factor equal to the number of required timesteps when using libraries built on top of standard ML platforms. Nonetheless, due to the highly-optimised kernels and powerful optimization tools provided by modern ML frameworks, this approach has proved popular for simulating small models.

In this work we present mlGeNN—a framework for converting feedforward ANN models, defined in Keras, to SNN models. It is freely available online at https://github.com/genn-team/ml_genn. Under the hood, rather than using a tensor based ML library to provide GPU acceleration, mlGeNN uses PyGeNN [38]—a Python interface to the GeNN GPU-accelerated SNN simulator [39] which has previously demonstrated excellent performance at accelerating a range of computational neuroscience models [40, 41]. We have extended GeNN with two methods for efficiently implementing convolutional connectivity. Firstly, we have used the 'procedural connectivity' approach, previously employed for simulating extremely large computational neuroscience models [41], to directly implement 2D convolutional connectivity. Secondly, we have implemented a new parallelism strategy tailored to the sparse structure of 2D convolutional connectivity. Finally, we have added batching support to GeNN. With these improvements and because GeNN has been optimised for spatial sparseness of connections and temporal sparseness of spiking it can considerably increase the simulation speed of SNNs. However, as we will demonstrate below, in spite of large improvements over using standard ML platforms, the required multiple timesteps in SNN simulations mean that SNN simulations remain slower than ANNs on GPU hardware.

2. Methods

2.1. GPU architecture

GeNN (and therefore mlGeNN) utilises NVIDIA graphics processing units (GPUs) to compute neuronal and synaptic updates. As such, GPU SNN simulations using GeNN have a significant performance advantage over equivalent CPU implementations [40]. NVIDIA provides the CUDA language—an extension of the C++ language—for writing programs which will run on NVIDIA GPUs.

An obstacle to CUDA and other GPU programing is the required domain expertise to maximise an application's GPU resource usage. While CPU architectures are optimised for low-latency computation, with fast local caches and branch prediction but just a few processor cores, GPU architectures are optimised for computational throughput, with vast pools of arithmetic cores and complex hierarchical memory systems to keep them fed with data. At the highest level of the memory hierarchy, global DRAM memory has the largest capacity but also the lowest memory bandwidth and highest latency. Special care needs to be taken to ensure that global memory reads and writes follow strict patterns (referred to as 'coalesced' memory access) to utilise the GPUs' DMA circuitry most efficiently and minimise access latency. As shown in figure 1, the computational resources of NVIDIA GPUs are arranged into units called 'streaming multiprocessors' (SM), where each SM has a pool of relatively fast shared and cache memory, a number of arithmetic cores and a large register file.

**Figure 1.** Organisation of arithmetic and memory compute resources into 'SM' units on NVIDIA GPUs.
Download figure:
Standard image High-resolution image

Unlike typical parallel CPU programing, where each thread of the application executes different instructions on data from potentially different memory regions, NVIDIA follows the single instruction multiple data paradigm. The concept of the function is replaced with the kernel, in which a single set of instructions is executed concurrently by many processor cores on largely monolithic memory segments. When a CUDA kernel is executed, the computation is organised into sets of threads called 'warps', which are divided among available SMs. The SM's warp scheduler then allocates registers and shared memory to warps and schedules them for execution. Execution is fastest when each thread has identical control flow—i.e. follows the same branch in conditional statements—and must be repeated for each divergence in control flow.

The existing GPU-based SNN optimisations of GeNN provide very fast and memory efficient SNN models and hides the complexity of GPU programing from the end user, allowing them to focus entirely on model definition and evaluation.

2.2. Datasets

In our experiments, we use the CIFAR-10 [42] and ImageNet [1] ILSVRC datasets. CIFAR-10 [42] is a dataset of 60 000 low-resolution (32 × 32) RGB colour images divided into ten image classes, with 6000 images per class. The set is divided into 50 000 training images and 10 000 validation images. CIFAR-10 is a very popular simple image classification dataset, often used when prototyping image classification ANN architectures. The ImageNet [1] ILSVRC dataset is a much larger dataset, consisting of 1281 167 training images, 50 000 validation images and 1000 classes. Being more challenging, it is often used to both demonstrate the performance of ANNs on more realistic data, and as a basis for transfer learning.

Each dataset is pre-processed before use, following the method described by Sengupta et al [24]. CIFAR-10 images are pre-processed to have zero mean and unit variance in each colour channel as well as being randomly cropped and randomly horizontally flipped with a probability of 0.5. ImageNet images are pre-processed in the same way but, additionally, each image has AlexNet-style PCA-based noise applied [43].

2.3. Models

In our experiments, our base ANN models follow those used by Sengupta et al [24], which are in turn based on the VGG [44] and ResNet [45] architectures. Due to the constraints of the rate-based ANN to SNN conversion methods discussed in section 2.6, all layers in the ANN model must have rectified linear units (ReLU) activations and not use bias tensors, with max pooling layers replaced by average pooling layers. Because batch normalisation cannot be used without biases, it is replaced with dropout.

The VGG-like ANN, illustrated in figure 2 (left), is essentially VGG-16 [44] modified to meet the above constraints. It is a CNN consisting of several blocks of convolution layers, separated by average pooling layers with a stride of 2, and topped with a final fully connected block. The number of feature maps in each of the five convolution blocks is 64, 128, 256, 512, 512, and the first two layers of the fully-connected block have 4096 units. Each convolution or fully-connected layer not preceded by a pooling layer is, instead, preceded by a dropout layer with a dropout probability of 0.25. When instantiated with an input size of 32 × 32 (e.g. for CIFAR-10), this model has 287 754 neurons and, with an input size of 224 × 224 (e.g. for ImageNet), it has 13 707 240 neurons.

For the ResNet-like ANNs, the architectures again follow those described by Sengupta et al [24]. The ResNet-like architecture for classifying the CIFAR-10 dataset (see section 2.2) is shown in figure 2 (right). It has 224 266 neurons and begins with a block of initial pre-processing convolution layers. These are followed by a series of residual blocks [45] which consist of two pathways—a non-identity pathway consisting of two convolution layers, and an identity skip pathway—which converge via add layers. Finally, the network is topped with a global average pooling layer followed by a fully-connected output layer. In each residual block, the first convolution layer of the non-identity path has ReLU activation, as usual, but the ReLU activation of the second convolution layer in the path is deferred until after the junction between the identity and non-identity pathways. This corresponds to 'version one' of ResNet, given in the original paper [45]. As in the VGG-like ANNs, the ResNet-like models periodically downsample feature maps while increasing the number of said maps. However, rather than pooling layers, ResNet ANNs use 1 × 1 convolutions with a stride of two to downsample layers. This corresponds to 'type B' identity pathways described by He et al [45]. The model has 16 feature maps in the first three residual blocks, followed by 32 in the following three blocks, followed by 64 in the last three residual blocks. Dropout is applied after the first non-identity convolution layer in each residual block, with probability 0.25.

The ResNet-like ANN for classifying the ImageNet dataset is similar, except that the nine residual blocks are replaced as follows: the first three blocks have 64 feature maps, followed by four blocks with 128 feature maps, followed by six blocks with 256 feature maps, and finally three blocks with 512 maps. In addition, a stride-two average pooling layer is added after the initial pre-processing layers. This results in a model with 10 136 552 neurons.

Our training procedure mostly follows that described by Sengupta et al [24]. ANN models are trained in TensorFlow 2.4.0 with stochastic gradient descent, with batch size 256, weight decay 0.0001, and momentum 0.9. For the CIFAR-10 dataset, the ANN is trained for 200 epochs, with an initial learning rate of 0.05 which is divided by ten after 81 epochs, and again after 122 epochs. For the ImageNet dataset, models are trained for 120 epochs, with an initial learning rate of 0.05 which is divided by ten after epochs 30, 60 and 90.

2.4. Sparse connectivity for efficient convolutions

Convolutional layers are a key component of almost all modern deep network architectures [43–45]. Therefore, providing efficient means to propagate activity through these layers is vital for any ML framework. The multi-channel 2D convolution operation used in these layers can be described as follows:

$\begin{equation}y(l,m,n)=\sum\limits _{i}^{{K}_{\text{height}}}\sum\limits _{j}^{{K}_{\text{width}}}\sum\limits _{k}^{{I}_{\text{chan}}}k(i,j,k,n)x(l+i,m+j,k),\end{equation} \tag{ 1 }$

where x is a I_height × I_width × I_chan input tensor, k is a K_height × K_width × I_chan × O_chan kernel and y is a O_height × O_width × O_chan output tensor. Many different approaches have been taken to efficiently implement convolution operations on GPUs. The first ML libraries [46] directly implemented equation (1) but, in order to do this in efficient library code, there are lot of corner cases which need optimising and tuning for ever-evolving GPU architectures. Alternatively, convolution can be 'lowered' into a matrix multiplication by a doubly-blocked Toeplitz matrix generated from the convolution kernel. Memory can be temporarily allocated to hold these matrices and then the multiplication can be performed using highly-optimised general purpose matrix multiplication libraries such as cuBLAS [47]. However, for large models, the memory requirements can be prohibitive so, instead, convolution kernel weights can be expanded on the fly within the matrix multiplication kernel [48]. However, although the doubly-blocked Toeplitz matrix is very sparse, this approach still has the same computational complexity as general matrix multiplication. One approach for reducing this complexity is to compute convolutions using an equivalent fast Fourier transform [49]. Alternatively, in the common case where K_width = K_height = 3 or K_width = K_height = 5, algorithms based on Winograd's minimal filtering algorithms can reduce the complexity of convolutions by up to 4× [50]. These algorithms divide each input channel into small tiles (typically 4 × 4 when K_width = K_height = 3). Each of these tiles is combined with the kernel elements from each output channel to obtain a minimal filter for each tile and these are summed across all output channels. Finally, this flattened representation is transformed back into output space by multiplying it by a much smaller matrix. TensorFlow and PyTorch both use the cuDNN library [51] to provide highly-optimised implementations of all of these algorithms for NVIDIA GPUs and switch between them based on each convolutional layer's parameters.

As discussed previously, propagating spikes through a connection matrix is a rather different problem because only a (small) subset of the input neurons will be active during a given timestep and no multiplications are required. Existing GPU SNN simulators have numerous efficient algorithms for propagating spikes through sparse connectivity [40, 52, 53]. However, using a general purpose sparse data-structure to store convolutions is rather wasteful of memory as the same weights from the convolution kernel are duplicated in many locations in the sparse matrix. For example, implementing a VGG-16 architecture for ImageNet in this way would require over 100 GB of GPU memory.

2.4.1. Procedural connectivity

In our previous work [38] we introduced 'procedural connectivity', a technique where neurons' outgoing sparse random connectivity is regenerated on the fly when they spike, rather than being stored in memory. Using procedural connectivity we demonstrated how a model—so large it could previously only be simulated on a supercomputer—could be simulated on a single GPU. Here we apply the same technique to directly implement equation (1). One thread handles each incoming spike using the following algorithm (support for padding and stride has been omitted for clarity):

where GENSYNAPSE will extract the correct weight from the kernel and apply it to the input of the postsynaptic neuron selected by neuronID using an atomic add operation. Because this algorithm is instantiated using GeNN's code generator, constant terms will be hard-coded into the kernel, allowing the CUDA compiler to transform divisions by constants into more efficient instructions, unroll loops and optimise for special cases such as O_chan = 1. Furthermore, unlike with the probabilistic connectivity investigated in our previous work [38], combining procedural convolutions with learning is entirely possible as the algorithm to back-propagate through the convolutional connectivity is very similar and equally efficient.

2.4.2. Toeplitz connectivity

As discussed previously, 2D convolution is equivalent to multiplying by a doubly-blocked Toeplitz matrix. In this section, we show how convolutions are 'lowered' into such matrices. This explanation is based on [54] but updated to use machine-learning rather than signal-processing conventions. A Toeplitz matrix is one where the values along all diagonals are constant, i.e.

$\begin{equation*}\left(\begin{matrix} {a}_{0} & {a}_{-1} & {a}_{-2} & \dots & \dots & \dots & {a}_{-(N-1)} \\ {a}_{1} & {a}_{0} & {a}_{-1} & {a}_{-2} & & & {\vdots} \\ {a}_{2} & {a}_{1} & {a}_{0} & {a}_{-1} & & & {\vdots} \\ {\vdots} & \ddots & \ddots & \ddots & \ddots & \ddots & & {\vdots} \\ {\vdots} & & & \ddots & {a}_{0} & {a}_{-1} & {a}_{-2} \\ {\vdots} & & & & {a}_{1} & {a}_{0} & {a}_{-1} \\ {a}_{M-1} & \dots & \dots & \dots & {a}_{2} & {a}_{1} & {a}_{0} \end{matrix}\right).\end{equation*}$

Furthermore, if we build a matrix A out of Toeplitz sub-matrices A_k and the structure of A with respect to these submatrices is also Toeplitz:

$\begin{equation*}A=\left(\begin{matrix} {A}_{0} & {A}_{-1} & \dots & {A}_{-(L-1)} \\ {A}_{1} & {A}_{0} & \dots & {A}_{-(L-2)} \\ {\vdots} & {\vdots} & \ddots & {\vdots} \\ {A}_{K} & {A}_{K-1} & \dots & {A}_{0} \end{matrix}\right),\end{equation*}$

then, this matrix is called a doubly-blocked Toeplitz matrix. A standard way to generate a Toeplitz matrix from a vector v is to use v as the first column vector, then make one cyclic permutation and use it as the second column vector and so on.

2D convolution operations can be expressed as multiplication by a doubly-blocked Toeplitz matrix. In brief, to convolve an I_height × I_width input with a K_height × K_width kernel K, e.g.

$\begin{equation*}K=\left(\begin{matrix}\hfill {k}_{11}\hfill & \hfill {k}_{12}\hfill \\ \hfill {k}_{21}\hfill & \hfill {k}_{22}\hfill \end{matrix}\right).\end{equation*}$

we flip K across the horizontal and vertical axis and pad it to the output size (I_height + K_height − 1) × (I_width + K_width − 1) of the convolution. For instance, a 2 × 3 input convolved by K above, leads to output size 3 × 4. Depending on the padding mode used by the convolution, typically, only part of this output is actually required. The flipped and padded kernel from above would be

$\begin{equation*}{K}_{\text{pad}}=\left(\begin{matrix}\hfill {k}_{22}\hfill & \hfill {k}_{21}\hfill & \hfill 0\hfill & \hfill 0\hfill \\ \hfill {k}_{12}\hfill & \hfill {k}_{11}\hfill & \hfill 0\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \end{matrix}\right).\end{equation*}$

We then convert each row vector of this matrix into Toeplitz matrices F_i as described above:

$\begin{equation*}{F}_{0}=\left(\begin{matrix}\hfill {k}_{22}\hfill & \hfill 0\hfill & \hfill 0\hfill \\ \hfill {k}_{21}\hfill & \hfill {k}_{22}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {k}_{21}\hfill & \hfill {k}_{22}\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill {k}_{21}\hfill \end{matrix}\right)\hspace{25.0pt}{F}_{1}=\left(\begin{matrix}\hfill {k}_{12}\hfill & \hfill 0\hfill & \hfill 0\hfill \\ \hfill {k}_{11}\hfill & \hfill {k}_{12}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {k}_{11}\hfill & \hfill {k}_{12}\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill {k}_{11}\hfill \end{matrix}\right)\hspace{25.0pt}{F}_{2}=\left(\begin{matrix}\hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill \end{matrix}\right),\end{equation*}$

and, finally, assemble these into a doubly blocked Toeplitz matrix F:

$\begin{equation*}F=\left(\begin{matrix}\hfill {F}_{0}\hfill & \hfill {F}_{2}\hfill \\ \hfill {F}_{1}\hfill & \hfill {F}_{0}\hfill \\ \hfill {F}_{2}\hfill & \hfill {F}_{1}\hfill \end{matrix}\right).\end{equation*}$

The convolution of, e.g.

$\begin{equation*}I=\left(\begin{matrix}\hfill 1\hfill & \hfill 0\hfill & \hfill 1\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill \end{matrix}\right),\end{equation*}$

is then given by turning the matrix I into a column vector by stacking up its row vectors,

$\begin{equation*}{I}_{\text{col}}=\left(\begin{matrix}\hfill 1\hfill \\ \hfill 0\hfill \\ \hfill 1\hfill \\ \hfill 0\hfill \\ \hfill 1\hfill \\ \hfill 1\hfill \end{matrix}\right),\end{equation*}$

and multiplying F from the left,

$\begin{equation*}{O}_{\text{col}}=F\cdot I\quad {\Leftrightarrow}\quad {O}_{\text{col},j}=\sum\limits _{i}{F}_{ji}{I}_{i}.\end{equation*}$

Finally, O_col can be reinterpreted as the output matrix O by arranging its entries row-wise in a 3 × 4 matrix.

However, if instead of treating I as a vector of ones and zeros, we represent it as a set of the indices of neurons (with indices starting at 0) we can reframe this operation as follows:

$\begin{equation*}{I}_{\text{ind}}=\left\{0,2,4,5\right\},\end{equation*}$

$\begin{equation*}{O}_{\text{col},j}=\sum\limits _{i\in {I}_{\text{ind}}}{F}_{ji}.\end{equation*}$

This new operation not only removes the need for multiplication entirely but its computational cost will also decrease with sparse spiking. In GeNN, this is implemented in a similar manner to the parallelism strategy for sparse connectivity described by Knight and Nowotny [40] where each block of N_block threads (in figure 3 N_block = 4) is responsible for processing N_block kernel entries. Processing begins by using N_block threads to copy the indices (I_ind) of the N_block presynaptic spikes—which were written to global memory by the presynaptic population's neuron kernel—into shared memory so that they can be accessed by all threads in the block during the next phase. The threads are then synchronised and loop through the N_block spikes with each thread processing the kernel entries in their column. The index of each thread is then converted into a row, column and output channel within the kernel and the following simple algorithm is used to process a kernel entry,

**Figure 3.** Parallelising spike propagation through a doubly-blocked Toeplitz matrix using one thread block with four threads. Snaking lines indicate CUDA threads.
Download figure:
Standard image High-resolution image

where GENSYNAPSE plays the same role as it did in the algorithm for procedural connectivity described in section 2.4.1.

2.5. Batching

In order to fully occupy highly-parallel GPU devices, machine learning tools typically simultaneously apply batches of data to multiple copies of models. This reduces the detrimental effects of kernel launch latency on overall compute time. Furthermore, weights and other variables can be shared across all instances within a batch, improving the locality of memory accesses.

Because of the regular memory access patterns of the matrix multiplication operations used to propagate activity between ANN layers, the kernels employed by modern ML libraries are optimised to minimise global memory access across the batch. However, when propagating spikes, different neurons will be spiking across the instances of the batch so different weights will be accessed. This makes a software-based approach for reducing global memory accesses to weights difficult so, in GeNN, we implement batching by simply duplicating our neuron and spike propagation kernels across another thread block dimension and rely on the GPU caches to reduce global memory traffic to the shared weights.

2.6. Rate-based conversion

Rate-based conversion of ANNs to SNNs aims to emulate the transfer function of the ANN unit with that of a spiking model. Typically the transfer function of rectified linear units (ReLU) with no bias tensors is emulated using simple integrate and fire (IF) neurons [22, 24]:

$\begin{equation*}{V}_{j}(t+1)={V}_{j}(t)+\sum\limits _{i=0}^{N}{w}_{ji}{z}_{i}(t),\end{equation*}$

where V_j represents the membrane voltage of neuron j, into which the input spikes z_i from N presynaptic neurons are accumulated via a weight matrix w_ji. When V_j crosses a threshold V_thr, a spike is emitted and V_j is reset to zero. The weights of each layer in the ANN model are then transplanted into the corresponding layers of the SNN model.

In order to maximise the classification accuracy of the converted SNN model, it is important to ensure that the available dynamic range of its spiking activity is representative of the range of activations occurring in the ANN model. If the integration step is too coarse, or the input current to the IF neurons is too strong, then it is possible for the membrane potential to far exceed the firing threshold value before being reset in the next timestep. The difference V − V_thr of membrane potential V and threshold V_thr in such a situation can be interpreted as the amount of lost information and needs to be minimised. Conversely, if the stimulation of a layer is unexpectedly weak, it can take a long time before neurons are stimulated enough to fire, and neurons that do not fire within the sample presentation time again represent lost information. This necessitates increasing the amount of sample presentation time for accurate classification. The solution to this issue is threshold balancing, where the threshold of IF neurons in each layer are set to their maximum expected activation but not higher, such that the amount each IF neuron's V exceeds its threshold at firing is minimised throughout the simulation. An equivalent method of dividing the layers' incoming weights by the maximum activation may also be used. Various methods have been proposed for computing the expected activation of the IF neurons. Data-norm [22] and spike-norm [24] are two such threshold balancing methods.

Here we use the data-norm method, where a random subset of real training data is propagated through the ANN, and the activation of each unit in each layer of the ANN is measured. While the maximum activation of each layer alone can be used as the corresponding SNN layer's threshold, Diehl et al [22] further propose using the highest of either the maximum activation or maximum incoming weight of that layer, to prevent any single incoming spike from instantly raising the neuron's membrane potential above threshold (hence loosing information). The following algorithm demonstrates this method for a feedforward ANN model.

2.7. Few-spike conversion

While the conversion of ReLU-based ANNs to rate-based SNNs can produce high classification accuracy [23, 24], in order to accurately represent analogue activations, a large number of spikes are required. Therefore, either large numbers of timesteps need to be simulated for each stimulus (2500 is typical [24]) or very high rates are required. Either way, the potential computation and energy efficiency gains of using spikes is likely to be lost.

Thorpe et al [55] analysed a number of different spike coding schemes and found that coding schemes which consider the timing of individual spikes—rather than just their rate—can increase the amount of information which can be transmitted in a fixed time by up to 10×. Therefore, it is unsurprising that several techniques have been proposed for converting ANNs to SNNs with a spike-timing based encoding scheme [25, 26]. Here we use the FewSpike approach [26] which aims to encode each stimulus using an average of two spikes per neuron and many fewer timesteps (typically $\approx 10$ ) and has already been shown capable of converting large-scale ANNs to SNNs. In GeNN, we perform inference using these neurons in a 'pipelined' manner so that, during one K timestep stimulus presentation, each neuron j emits spikes representing the previous stimulus and accumulates spikes representing the current stimulus. To do this, each neuron has one state variable ${\hat{F}}_{j}$ into which spikes z are accumulated from N input neurons i via a weight matrix w_ji:

$\begin{equation*}{\hat{F}}_{j}(t)=d(t)\sum\limits _{i=0}^{N}{w}_{ji}{z}_{i}(t),\end{equation*}$

where d(t) is a function common to all neurons with the same activation function. At the start of each stimulus presentation, the neuron's second state variable V_j is set to the value of ${\hat{F}}_{j}$ that was accumulated during the previous K timesteps and ${\hat{F}}_{j}$ is reset to zero. V_j then evolves for the next K timesteps as follows:

$\begin{align*}{V}_{j}(t+1)\hfill & \hfill ={V}_{j}(t)-h(t){z}_{j}(t)\\ {z}_{j}(t)\hfill & \hfill =\mathcal{H}({V}_{j}(t)-T(t)),\end{align*}$

where $\mathcal{H}$ represents the Heaviside step function. T(t) and h(t) are respectively the spiking threshold and the amount subtracted from the membrane potential when the neuron produces a spike, and are common to all neurons with the same activation function. In this paper we use ReLU neurons where d(t) = h(t) = T(t) = α2^(K−t) which essentially encode activations in a binary K bit fixed-point format. In a similar manner to the rate-based normalisation, we optimise the encoding of activations by calculating a per-layer scaling factor α based on the maximum activation value encountered in a subset of training data. In order to support ResNet models with multiple pathways such as those described in section 2.3, a synaptic delay of K + 1 timesteps is added to the identity pathway to match the additional pipeline stage and extra timestep of delay in the non-identity pathway. These ensure that all inputs to the neurons at the junction points are correctly synchronised.

2.8. mlGeNN

mlGeNN is our new open source Python package which uses the PyGeNN library [38] to accelerate the simulation of SNN models using GPUs.

It is used by first constructing and training an ANN model in Keras, keeping the constraints discussed in sections 2.6 and 2.7 in mind. mlGeNN currently supports dense, 2D convolution and 2D average pooling layers, but could be extended with other layer types, such as 3D convolution and recurrent layers. mlGeNN also supports models with multiple input and output layers and, models constructed using both the Keras sequential and functional APIs. The fully trained ANN model is then automatically translated by mlGeNN into a description suitable for GeNN. Connectivity (dense, convolution) layers and subsequent activation layers are translated into synapse groups and neuron groups using the models required by the conversion scheme. Pooling layers are treated a little differently since they are not typically followed by activation functions. Inserting additional neuron groups here would introduce additional nonlinearities in the SNN model which would degrade classification accuracy so, rather than creating new neuron groups, pooling operations are instead merged with the connectivity immediately downstream of them. Furthermore, there are constraints on the type of pooling layer one can use. While several implementations of spiking max pooling layers have been proposed [23, 56], these are all tied to the coding scheme used by the network so, here, we prefer using average pooling layers.

The model conversion is performed through an instance of a model converter, corresponding to the chosen conversion method (see above). mlGeNN provides the DataNorm [22], SpikeNorm [24] and FewSpike [26] converter classes. When constructing each converter, one passes the signed_input argument, indicating whether the input data is signed or all positive, and the norm_data argument, which should be a list containing a subset of training data (one entry for each model input) used to optimise the model conversion. Signed input is implemented by using GeNN's normal spiking mechanism for positive inputs and GeNN's spike-like-event mechanism (see Knight et al [38]) to emit 'negative spikes', i.e. spike events that trigger updates by −w for synapses of weight w, if a negative threshold is crossed.

Finally, one calls the Model.convert_tf_model method of the mlGeNN Model class, passing the TensorFlow model and converter instance, and optionally passing the synapse connectivity type ('sparse', 'procedural', 'toeplitz' as described in section 2.4), simulation timestep, batch size and random number generator seed (passing 0 yields a random seed). This method returns an mlGeNN SNN model corresponding to the given ANN model. This call also instructs the GeNN backend to generate, compile and load the GPU code of the model. Finally, the user evaluates the mlGeNN model by calling its evaluate method, passing a list of input data lists (one entry per input layer), a list of output label lists (one entry per output layer), and the presentation time of each stimulus in milliseconds. An example of this process, using the rate-based data-norm conversion method, is given below.

3. Results

In this section we explore the effect on accuracy and inference latency of SNNs converted using the algorithms described in sections 2.6 and 2.7. We also compare the latency of SNNs to both the original TensorFlow ANNs and, for the VGG-16 model, to SNNs simulated using BindsNet [31] using the modifications developed by Lu and Sengupta [57] to support average pooling.

3.1. CIFAR-10

Table 1 compares the accuracy on the CIFAR-10 dataset of the VGG-16 and ResNet-20 models described in section 2.3 to prior work by Sengupta et al [24] as well as Stöckl and Maass [26]. As both models are based on the descriptions provided by Sengupta et al [24], unsurprisingly, both ANN models reach similar levels of accuracy. While mlGeNN does support the 'spike-norm' rate-based conversion algorithm used by Sengupta et al [24], here we use the simpler data-norm conversion algorithm of Diehl et al [22], modified to support ResNet architectures. This results in slightly lower rate-based conversion accuracies than those reported by Sengupta et al [24] but, the few-spike models perform much better. To support rate-based conversion, we do not use batch normalization and bias when training our models. However, Stöckl and Maass [26] were not constrained by these restrictions so were able to train a rather more accurate ResNet-20 model and this improved performance was transferred to the SNN model.

Table 1. Comparison of CIFAR-10 accuracy between our best models and those trained by Sengupta et al [24] and; Stöckl and Maass [26]. Bold rows indicate our own results.

Network	ANN accuracy (%)	Best rate-based SNN accuracy (%)	Best few-spike SNN accuracy (%)
VGG-16	91.2	90.9	91.4
VGG-16 (Sengupta et al [24])	91.7	91.6	—
ResNet-20	89.4	84.3	89.4
ResNet-20 (Sengupta et al [24])	89.1	87.5	—
ResNet-20 (Stöckl and Maass [26])	91.6	—	91.5

Both of the conversion algorithms described in sections 2.6 and 2.7 allow speed of inference to be 'traded off' for accuracy by varying the number of simulation timesteps each stimulus is presented for. When using rate-based conversion, more timesteps (T) allows for a rate code with a finer granularity and, when using few-spike conversion, additional timesteps (K) add extra 'bits' to the binary fixed-point format and thus reduce the quantisation error. Figure 4 shows the accuracy of VGG-16 and ResNet-20 models trained on the CIFAR-10 dataset and converted to rate-based and few-spike SNNs using different T and K values. When using the VGG-16 model, classification accuracy is significantly degraded with values of T < 2500 or K < 10 (figure 4(a)) whereas, when using the ResNet-20 model classification performance is only significantly degraded with values of T < 1000 or K < 8 (figure 4(b)). This reduced 'convergence time' for the ResNet-20 model matches the findings presented by Sengupta et al [24].

**Figure 4.** Effect of varying T and K on accuracy of CIFAR-10 classification using VGG-16 **(A)** and ResNet-20 (B) models. A 12 GB Titan V GPU was used for all experiments.
Download figure:
Standard image High-resolution image

Using these optimal T and K values, we simulated both models using the different connection algorithms described in section 2.4 across a range of batch sizes. Figure 5 shows the resultant time to perform inference on the entire CIFAR-10 test set using the VGG-16 and ResNet-20 models. At batch size 1, the procedural connectivity approach described in section 2.4.1 performs best across all conversion algorithms and models—with the ResNet-20 SNN actually performing more than 3.5× faster than the TensorFlow ANN. However, as batch size increases, the Toeplitz matrix algorithm described in section 2.4.2 performs the best with the fastest FewSpike SNNs being only a little over 2× slower than the TensorFlow ANN when using the ResNet-20 model and around 12.5× slower when using the VGG-16 model. The procedural connectivity algorithm scales poorly because, while at small batch sizes, using one thread per-neuron (rather than per-matrix diagonal) allows the GPU's parallelism to be better exploited, as batch sizes increase, memory latency worsens due to high contention between the atomic operations used to update the input to the postsynaptic neuron and badly coalesced reading of weights.

**Figure 5.** Effect of varying batch size on CIFAR-10 inference time using VGG-16 (A) and ResNet-20 (B) models. The VGG-16 model was simulated with T = 2500 and K = 10 for rate-based and few-spike conversion respectively. The ResNet-20 model was simulated with T = 1000 and K = 8 for rate-based and few-spike conversion respectively. 'T ANN' and 'k ANN' lines indicate the time it would take to simulate T and k ANN timesteps respectively. A 12 GB Titan V GPU is used for all experiments.
Download figure:
Standard image High-resolution image

Because the models converted using the rate-based algorithm need to be simulated for many more timesteps than those converted using the few-spike algorithm, they are slower across the board—with even the fastest model being over 100× slower than the TensorFlow ANN. However, the encoding scheme used by the few-spike model [26] actually results in approximately 10× more spikes per timestep than the rate-based model. Therefore, the spiking activity in each timestep is much sparser in the rate-based models, allowing GeNN to simulate individual timesteps significantly faster than an ANN step. Note that the total number of spikes per evaluation with few-spike conversion is still much lower than the total spikes per evaluation with rate-based conversion, since fewer timesteps are required per evaluation.

Finally, comparing the VGG-16 model simulated using mlGeNN against BindsNET [31] we can see that because, under the hood, BindsNET uses the same dense tensor operations as the ANN, as discussed in the introduction, its performance is very similar to that of the ANN multiplied by T and over 2.5× slower than the fastest rate-based GeNN VGG-16 model.

3.2. ImageNet

Table 2 compares the accuracy on the ImageNet dataset of the VGG-16 and ResNet-34 models described in section 2.3 to the results presented by Sengupta et al [24]. Again, these models are based on the descriptions provided by Sengupta et al [24] so the ANN models reach similar levels of accuracy. Based on our CIFAR-10 results, presented in the previous section, we have not evaluated our rate-based conversion algorithm on ImageNet but both of our few-spike converted models out-perform the rate-based performances reported by Sengupta et al [24].

Table 2. Comparison of ImageNet top-1 accuracy between our best models and those trained by Sengupta et al [24].

Network	ANN accuracy (%)	Best rate-based SNN accuracy (%)	Best few-spike SNN accuracy (%)
VGG-16	70.3	—	70.2
VGG-16 (Sengupta et al [24])	70.5	69.96	—
ResNet-34	69.3	—	68.4
ResNet-34 (Sengupta et al [24])	70.7	65.47	—

Figure 6 shows the time taken to perform inference on the entire ImageNet test set using the VGG-16 and ResNet-34 models. The simulation with the Toeplitz connectivity is approximately twice as fast as the procedural algorithm in both models but still over 100× slower than the ANN simulated using TensorFlow. While at first maybe a bit surprising, this result can be understood in the context of how TensorFlow processes convolutions. As discussed in section 2.4, TensorFlow switches between multiplying by Toeplitz matrices or employing Winograd or FFT based approaches based on the geometry of each convolutional layer. Lavin and S Gray [50] derived the following expressions for the total number of arithmetic operations required to convolve an I_width × I_height × I_chan input with a K_height × K_width × I_chan × O_chan kernel using these approaches:

$\begin{equation*}L=\alpha \left(1+\frac{\beta }{{I}_{\text{chan}}}+\frac{\gamma {m}^{2}}{{N}_{\text{batch}}{I}_{\text{width}}{I}_{\text{height}}}+\frac{\delta }{{O}_{\text{chan}}}\right){N}_{\text{batch}}{I}_{\text{width}}{I}_{\text{height}}{I}_{\text{chan}}{O}_{\text{chan}},\end{equation*}$

where N_batch is the batch size, m is a function of the size of the tiles the input is divided into when using the FFT or Winograd algorithms and the factors α, β and γ relate to the operations incurred within different parts of the three algorithms. For straightforward matrix multiplication-based approaches, α = K_width K_height and β = γ = δ = 0, but, for Winograd and FFT approaches, α, β and γ are constants with values between 1 and 10, depending on the exact configurations. In models simulated with a large batch size and with large I_width, I_height, I_chan and O_chan such as the VGG and ResNet models operating on ImageNet considered here, the β, γ and δ terms are relatively small and the number of operations required for the Winograd and FFT based approaches are essentially only

$\begin{equation*}L\approx \alpha {N}_{\text{batch}}{I}_{\text{width}}{I}_{\text{height}}{I}_{\text{chan}}{O}_{\text{chan}}.\end{equation*}$

By exploiting sparsity, GeNN reduces the number of operations (L) compared to straightforward matrix multiplication by the fraction P_spike of neurons which spike each timestep. Additionally, as propagating spikes requires no multiplications, there should be some additional advantage. However, if α < P_spike K_width K_height for the Winograd and FFT based approaches, an ANN step with one of these algorithms will be more efficient than a single timestep of GeNN's current spike propagation algorithms. Accordingly, as α = 2.25 for the Winograd and 1.81 for the FFT configurations used by TensorFlow in the models considered, and K_width = K_height = 3, a sparsity of P_spike < 0.2 would be required for an SNN timestep to be faster than the ANN step. For both of the model implementations reported here, P_spike ≈ 0.4 and if we also consider that we are presenting each stimulus for K = 10 timesteps, TensorFlow has approximately a 16× algorithmic advantage in the number of arithmetic operations. Furthermore, the number of arithmetic operations is only half the story. cuDNN is a highly-optimised library, where key components are written directly in SASS assembly language for each GPU architecture whereas GeNN is based on CUDA and there remains significant scope for further optimisation, specifically to remove the need for atomic operations in global memory and to further optimise index calculations.

**Figure 6.** ImageNet inference time using VGG-16 (A) and ResNet-34 (B) models. SNNs were simulated using optimal configuration of K = 10 for few-spike conversion. The VGG-16 SNNs were simulated with a batch size of 32 when using procedural connectivity and 48 when using Toeplitz connectivity. The ResNet-34 SNNs were all simulated with a batch size of 32. ANNs were simulated with optimal batch size of 256. A 12 GB Titan V GPU is used for all experiments.
Download figure:
Standard image High-resolution image

4. Discussion

In this paper we have introduce mlGeNN, a library for machine learning with SNNs that makes use of the GeNN GPU accelerated SNN simulator. We have described how mlGeNN can be used to translate trained ANN architectures into different SNN implementations, using standard algorithms [22, 24, 26], and how the speed and accuracy of inference compares to both the original ANN models and SNN models simulated in other libraries. We found that accuracy is as described in the original publications reporting on ANN to SNN conversions while speed of inference presents a very mixed picture. While mlGeNN is faster than BindsNET [31] on most of the considered examples, and comparable to ANN speeds for some, it is orders of magnitudes slower for others.

As discussed in section 3.2, the gap in performance between ANN and SNN models, running on the same hardware comes from the number of timesteps SNN models have to run per-stimuli, the algorithmic advantage of specialised ANN convolution algorithms and the degree of optimisation and maturity of cuDNN [51] when compared to SNN libraries such as GeNN [39].

In this work—aside from the restrictions described in sections 2.6 and 2.7—we have only considered the conversion of models using standard 32 bit floating point weights and activations. However, representing ANN activations in as few SNN timesteps as possible—using either a rate code or the fixed-point spike code used by the few-spike method—is a similar problem to quantizing ANN activations for inference with low-precision data types. Therefore, applying some techniques from quantization-aware training [58] or fine-tuning [5] to our ANN activations could allow the number of SNN timesteps per stimulus to be minimised, reducing the gap between ANN and SNN performance.

While FewSpike representations require fewer SNN timesteps per stimulus, in the ImageNet-scale models we evaluate in section 3.2, they result in very high firing rates which make each timestep slower than ANN updates. One solution would be to 'sparsen' the neuron activations by training the ANN with L1 regularisation so more activations are zero which is represented by the absence of spikes. Alternatively, binary codes where more common activation values are represented with fewer bits could be employed although this would add additional decoding complexity to the neurons or synapses of the model.

The new 'Toeplitz' connectivity scheme we describe in section 2.4.2 is significantly faster than both the general purpose sparse connectivity described by Knight and Nowotny [40] and the general purpose 'procedural' connectivity described by Knight et al [38] but, significant levels of sparsity are still required for it to match the number of operations required for an ANN update. This property not only affects software-based SNN simulators like GeNN, but is also an important consideration for designers of spike-based neuromorphic chips aiming to compete with non-spiking ML accelerators. However, beyond the theoretical differences in the number of operations required to update an ANN which we previously discussed, there is significant scope to further optimise our implementation. Our current implementation accumulates the output from each layer into global memory using 'fire-and-forget' atomic add operation. Although these operations are handled entirely within the L2 cache of modern GPUs and thus have relatively low latency, by accumulating updates into shared memory using the approach described by Bautembach et al [53], this could be further reduced. Furthermore, profiling using the NVIDIA Nsight compute profiler suggests that while the algorithm is currently primarily compute bound, a large proportion of the arithmetic operations involved index calculations rather than actually processing of convolutions. This is because our single programme multiple data code generator (described in Knight et al [38]) will reuse the same generated code to simulate all layers with Toeplitz connectivity, meaning that the CUDA compiler will be unable to optimize away the divide operations within the index calculations required by the algorithm presented in section 2.4.2. Current GPU architectures do not have hardware support for integer division so this results in a large number of superfluous operations to implement these in software but, as the divisors are all known at build-time, these could be replaced with a single integer multiply-add operation [59].

In conclusion, the mlGeNN framework adds a new element to the GeNN ecosystem, allowing GeNN to be used for machine learning workflows that translate fully trained deep ANNs into SNNs for inference. This is a first step towards a fully SNN-based machine ecosystem built on GeNN that, in the future, will also allow direct training of SNNs. Ultimately, we hope that this will become the ideal tool for developing SNN based algorithms for deployment on both standard and neuromorphic hardware.

Funding

This work was funded by the EPSRC (Brains on Board project, Grant Number EP/P006094/1, ActiveAI project, Grant Number EP/S030964/1, Unlocking spiking neural networks for machine learning research, Grant Number EP/V052241/1), the European Union's Horizon 2020 research and innovation programme under Grant Agreement 945539 (HBP SGA3), by Huawei Technologies Co and a Google Summer of Code grant to AS. Additionally we gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC) and the JADE2 consortium funded by the EPSRC (EP/T022205/1) for compute time on their systems.

Acknowledgments

We would like to thank Malin Sandström and everyone else at the International Neuroinformatics Coordinating Facility (INCF) for their hard work running the Google Summer of Code mentoring organisation every year. Without them, this and many other exciting projects would not be possible.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: 10.25377/sussex.17380853.

mlGeNN: accelerating SNN inference using GPU-enabled neural networks

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction