Surrogate gradients for analog neuromorphic computing

Significance Neuromorphic systems aim to accomplish efficient computation in electronics by mirroring neurobiological principles. Taking advantage of neuromorphic technologies requires effective learning algorithms capable of instantiating high-performing neural networks, while also dealing with inevitable manufacturing variations of individual components, such as memristors or analog neurons. We present a learning framework resulting in bioinspired spiking neural networks with high performance, low inference latency, and sparse spike-coding schemes, which also self-corrects for device mismatch. We validate our approach on the BrainScaleS-2 analog spiking neuromorphic system, demonstrating state-of-the-art accuracy, low latency, and energy efficiency. Our work sketches a path for building powerful neuromorphic processors that take advantage of emerging analog technologies.

I n recent years, deep artificial neural networks (ANNs) have surpassed human-level performance on many difficult tasks (1)(2)(3). The human brain, however, remains unchallenged in terms of its energy efficiency and fault tolerance. A fundamental property underlying these capabilities is spatiotemporal sparseness (4), which is directly linked to the way biological spiking neural networks (SNNs) process and exchange information. Spiking neurons receive and integrate inputs on their analog membrane potentials and, upon reaching the firing threshold, emit action potentials, or spikes. These binary events propagate asynchronously through the SNN and are ultimately received by other neurons.
Neuromorphic engineering attempts to mirror the power efficiency and robustness of the brain by replicating its key architectural properties (5)(6)(7)(8)(9). Here, one distinguishes between fully digital, analog, and mixed-signal systems. Digital systems "simulate" the analog dynamics of spiking neurons, e.g., their membrane potentials (10)(11)(12)(13)(14)(15). In contrast, analog and mixedsignal solutions "emulate" neuronal or synaptic dynamics and states by representing them as physical voltages, currents, or conductance changes evolving in continuous time (7,13,14,16). Thus, by explicitly taking advantage of physical properties and dynamics of the underlying hardware substrate, neuromorphic computing holds the key to building power-efficient and scalable SNNs in silico (15,19,20).
However, to serve meaningful computational purpose, these analog devices require training. The most successful training schemes for ANNs are gradient-based. Yet, extending similar training techniques to SNNs and neuromorphic hardware poses several challenges. First, one has to overcome the binary nature of spikes, which impedes vanilla gradient descent (21)(22)(23). Second, training has to ensure sparse spiking activity to exploit the superior power efficiency of SNN processing (24,25). Finally, training has to achieve all of the above while coping with analog hardware imperfections inevitably tied to their manufacturing process.
In this article, we tackle the above challenges by extending previous work on surrogate gradients, which have emerged as a powerful method for training SNNs end-to-end (23). Specifically, we developed an in-the-loop (ITL) training framework for surrogate gradient learning and applied it to the mixed-signal BrainScaleS-2 single-chip system (26)(27)(28). We demonstrate that SNNs trained using our approach solve several challenging benchmark problems by taking advantage of sparse, precisely timed spikes instead of firing rates. The resulting SNNs reach comparable accuracy levels to corresponding software simulations and perform energyefficient inference with ultralow latency by taking full advantage of BrainScaleS' accelerated nature and in-memory compute Significance Neuromorphic systems aim to accomplish efficient computation in electronics by mirroring neurobiological principles. Taking advantage of neuromorphic technologies requires effective learning algorithms capable of instantiating highperforming neural networks, while also dealing with inevitable manufacturing variations of individual components, such as memristors or analog neurons. We present a learning framework resulting in bioinspired spiking neural networks with high performance, low inference latency, and sparse spike-coding schemes, which also self-corrects for device mismatch. We validate our approach on the BrainScaleS-2 analog spiking neuromorphic system, demonstrating state-of-the-art accuracy, low latency, and energy efficiency. Our work sketches a path for building powerful neuromorphic processors that take advantage of emerging analog technologies.
capabilities. Crucially, we show that ITL surrogate gradients achieve this through self-calibration, whereby training automatically corrects for device mismatch without the need for costly offline calibration.

The BrainScaleS-2 Analog Neuromorphic Substrate
In this article, we relied on the analog BrainScaleS-2 single-chip system. It features 512 analog neuron circuits, whose dynamics obey the leaky integrate-and-fire (LIF) equation which can optionally be augmented by adaptation currents and an exponential spiking nonlinearity. The membrane potential V is explicitly represented on the chip as an analog voltage measured across a capacitor and evolves continuously in time. The leak conductance g leak pulls the membrane toward the leak potential V leak , resulting in an exponential decay with time constant τm ≡ C /g leak . Due to the substrate's small intrinsic capacitances and comparatively large currents, the dynamics of the spiking neurons implemented on BrainScaleS-2 evolve 10 3 times faster than biological neurons. Whenever the membrane potential crosses the firing threshold ϑ, an outgoing spike is generated, and the membrane is reset. An on-chip event router propagates both internally generated and external spikes to connected neurons, allowing it to form feed-forward as well as recurrent topologies. To that end, each neuron integrates stimuli from a column of 256 synapses, each with a 6-bit weight stored in local static random-access memory. The resulting postsynaptic currents I, which are integrated on the membrane capacitor, follow an exponential time course similar to the membrane dynamics themselves. The sign of a synapse is determined as a presynaptic property. However, we allowed for a continuous transition between positive and negative weights during training by merging synapse circuits of opposing signs (Fig. 1B).
BrainScaleS-2 allows individually adjusting all neuronal parameters, including reference potentials and time constants, on a per-neuron basis to flexibly emulate different target dynamics. This fine-grained control also facilitates calibration to mitigate deviations induced by variations in the production process. In this article, we, however, make use of this parameterization to actively decalibrate the system, thereby allowing us to systematically explore self-calibration properties of our learning algorithm.

ITL Surrogate Gradients on Analog Hardware
To train SNNs on BrainScaleS-2, we developed a general learning framework to optimize recurrent and multilayer networks. Our approach is based on the notion of surrogate gradients, (B) Implementation of a multilayer network on the analog neuromorphic core. Input spike trains are injected via synapse drivers (triangles) and relayed to the hidden-layer neurons (green circles) via the synapse array. Spikes in the hidden layer are routed on-chip to the output units (red circles). Each connection is represented by a pair of excitatory and inhibitory hardware synapses, which holds a signed weight value. The analog membrane potentials are read out via the CADC and further processed by the PPU.   which overcome vanishing gradients and critical points associated with nondifferentiable spiking dynamics (23). Surrogate gradient learning flexibly supports arbitrary differentiable loss functions and can seamlessly exploit both rate-based and spike timingbased coding schemes. Broadly, our ITL approach works as follows ( Fig. 2): First, we emulate the forward pass on the analog neuromorphic substrate and record both spikes and internal membrane traces ( Fig. 2 A  and B). By injecting the latter into an otherwise approximate software model, we effectively render the neuromorphic SNN differentiable. This permits the evaluation of surrogate gradients and the calculation of weight updates using backpropagation through time (BPTT) on graphics-processing unit-enabled autodifferentiation libraries (29), in combination with state-ofthe-art optimizers. At the same time, our learning algorithm self-corrects for parameter mismatch of the analog components (Fig. 2C). Finally, we close the loop by transferring the updated weights back to the analog system.
In the following, we elaborate on the two central steps, namely, the recording of data from the neuromorphic system and their integration into the computation graph.
Recording Spikes and Analog Membrane Traces. Surrogate gradient learning crucially relies on the neurons' membrane potentials.

NEUROSCIENCE
On an analog system like BrainScaleS-2, these are represented as physical voltages and are hence not readily available for numerical computation. The required digitization is often challenging due to the inherent parallelism of these substrates. This bottleneck is further emphasized by accelerated systems. BrainScaleS-2 solves this problem by incorporating columnparallel analog-to-digital converters (CADCs) to simultaneously digitize the membrane potentials of all neurons (Fig. 1B). We trigger the ADC conversions via the embedded plasticity processing units (PPUs) (30) to ensure higher and more stable sampling rates compared to a host-based scheduling. This furthermore enables the implementation of a fast inference mode, where only classification results are transmitted to the host. When training the network, however, each recorded sample is instantly transferred to an intermediate external memory region, from where it is asynchronously read by the host machine at the end of an input pattern or batch. In total, we reach a sample rate of ∼0. 6 MSample×s −1 , corresponding to a sampling interval of 1.7 μs. For 256 neurons, this yields a total data rate of 1.2 Gbit×s −1 . In addition to the sampled membrane traces, we continuously record and time-stamp the spike events emitted by the substrate.
A Computation Graph for Analog Circuits. To compute weight updates based on surrogate gradients, we incorporate these aggregated data into a computation graph that approximates the underlying neuronal dynamics on the neuromorphic substrate. To that end, we iteratively simulate the neuronal dynamics to obtain the graph in which we inject the actual recorded membrane traces. Thus, we use measured quantities, where available, and only rely on the model estimates for internal variables that are not measured, e.g., the synaptic currents.
We formulate the graph on a regular time grid of time step Δt derived from the sampling period of the membrane traces. Although the spike trains from the neuromorphic substrate are known with much higher temporal resolution, they are also aligned to these bins. Depending on the coding scheme and network topology, an increased resolution can be beneficial and allow us to better capture causal relations between spikes. In this case, the computation graph can be evaluated on a finer time scale and, for that purpose, operate on interpolated membrane traces.
To reconstruct the internal states, we start by assuming ideal LIF dynamics (Eq. 1), which we numerically integrate by taking into account its temporal decay and the calculated synaptic cur-rentsĨ [t], which, in turn, are based on the presynaptic spikesSj [t] of neuron j:Ṽ For brevity, we consider a dimensionless formulation of the LIF dynamics, in which we assume a leak potential V leak = 0, a capacitance C = 1, and a firing threshold ϑ = 1 (23). Physical variables can be readily obtained through appropriate rescaling (cf. Materials and Methods and SI Appendix). Eq. 3 can be augmented by an additional term to encompass recurrent connections. The modeled state variables, indicated by the tilde (˜), represent the estimates of the on-chip dynamics. Since these can deviate from the actual emulation and hence distort the resulting gradients, we, in their place, insert the normalized recorded data. For this purpose, we introduce an auxiliary identity function f (x ,x ) ≡ x and define surrogate derivatives ∂f /∂x = 0 and ∂f /∂x = 1.
Eq. 2 can now be modified tõ

[4]
A similar approach is taken for spikes by definingS where β describes the steepness of the surrogate gradient (31). When performing the backward pass and, to this end, calculating ∂L/∂θ = . . . ∂S /∂Ṽ · ∂Ṽ /∂θ, the sampled values for the membrane potential are used whenever an expression containing V is evaluated, e.g., in ∂S /∂Ṽ . The estimates, in contrast, are used to determine further derivatives ∂Ṽ /∂θ, which occur in the recursion relation of BPTT.
Flexible Choice of Loss Functions. The suggested framework allows us to operate on any differentiable loss that can be formulated on the data acquired from the neuromorphic system. This encompasses loss functions based on the spiking activity of the neurons, as well as on their membrane voltages (cf. Materials and Methods). The task-specific loss can be augmented by regularization functions. These might, on one hand, target an improved generalization performance or, on the other hand, an adaptation to hardware-specific constraints, such as finite weights and dynamic ranges of analog signals. Such terms can furthermore be directly tailored to shape the activity of the emulated SNNs and result in sparse firing patterns.

Results
We trained BrainScaleS-2 on a series of spike-based vision and speech-recognition tasks using our ITL learning framework. Specifically, we chose classification tasks requiring evidence integration on widely different time scales, which allowed us to probe the efficiency of our approach on both feed-forward and recurrent network topologies.
First, we trained a feed-forward network consisting of a single hidden layer with 246 neurons on the Modified National Institute of Standards and Technology (MNIST) dataset (32). To accommodate the data to a fan-in of 256 inputs, we reduced the original 28 × 28 images to 16 × 16 pixels. We then converted the pixels into a spike-latency code (cf. Materials and Methods). The network was optimized by using the Adam optimizer (33) to minimize a max-over-time loss , y ), with the negative log-likelihood (NLL), the membrane traces of the output layer V O i [t], and the true labels y . To prevent excessive amplitudes and, in turn, clipping of V O i on the analog substrate, we included a penalty ρa · meani . We furthermore added an activity-shaping term to promote sparse activity patterns (cf. Eq. 6). Notably, this contribution could only reduce the network's activity and did not act as an upwardpulling homeostatic force. Being based on surrogate gradients, our approach nevertheless allowed training the network starting from a quiescent hidden layer.
During training, the neuromorphic substrate learned to correctly infer and represent the correct class memberships as the maximally responsive output units (Fig. 3 A and B). Interestingly, the inhibition of the other units was not explicitly demanded by the loss function, but emerged naturally through optimization. After 100 epochs, the model almost perfectly fit the training samples and achieved an overall accuracy of 97.2 ± 0.1% on held-out test data (Table 1). We were able to reduce overfitting by augmenting the data through random rotations of up to 15 • . Dropout similarly improved test performance, and combining it with data augmentation resulted in an accuracy of 97.6 ± 0.1% on BrainScaleS-2.
As a comparison, we trained the same SNN purely in software and, in that process, ignored all hardware-specific constraints, including the finite weight resolution. With an accuracy of 97.5 ± 0.1% on the test data, the software implementation only slightly surpassed BrainScaleS-2. As a baseline for the downscaled 16 × 16 MNIST dataset, we furthermore trained an equivalently sized ANN with rectified linear units, which resulted in an accuracy of 98.1 ± 0.1%. Dropout as well as augmentation again improved upon these numbers, resulting in a best-effort performance of 98.7 ± 0.1%. Importantly, these accuracy figures-within their uncertainties-resembled results on the full-size MNIST images, suggesting that these two datasets are comparable in their complexity.
To further explore the computational abilities of BrainScaleS-2 trained with surrogate gradients, we used the same network architecture to classify 16 × 16 Fashion-MNIST (34), which resulted in a test accuracy of 84.2 ± 0.2% (Table 1).
Low-Latency Neuromorphic Computation. The output traces of trained networks suggested that for latency-encoded inputs, as used above, the decision is available long before the end of a stimulus (cf. Fig. 3A). To determine the network's classification latency, we artificially restricted the readout layer's membrane traces (cf. Fig. 3A) to varying time intervals [0, T ], over which we based the network's decision, as given by the maximally active unit. We found that the readout reached its peak accuracy within 8 μs after the first input spike (Fig. 3C).
Low classification latency, however, does not automatically translate into high inference rates, but is also affected by the neuronal and synaptic time constants. These time constants determine the rate by which state variables decay back to baseline within the neuron circuits and, hence, impose a minimum separation of independent stimuli. Still, to translate low classification latency into high inference rates, we added an artificial reset of the neuromorphic units 10 μs after inserting the first input spike. Specifically, we exploited a feature of BrainScaleS-2 that allowed us to concurrently reset the analog membrane circuits and clamped all synaptic currents to their respective baselines (Fig. 3D). This allowed us to infer images with a separation of 11.8 μs, allowing our SNNs to accurately classify more than 85,000 (85 k) images per second with a latency of 8 μs. Moreover, we measured the system's power consumption. When emulating the trained SNN, the full BrainScaleS-2 chip consumed ∼200 mW. This figure included the current draw from the analog neuromorphic core, the plasticity processors, all surrounding periphery, and the high-speed communication links. Combining this measurement with the above throughput results in an energy consumption of 2.4 μJ per classified image.
Efficiency through Sparse Spiking Activity. A key advantage of SNNs is their sparse temporal spiking activity, which is presumed crucial for the power efficiency of the brain (4). For similar reasons, it is also important for larger neuromorphic systems and particularly in scenarios in which several chips cooperate by exchanging spikes over communication channels with limited bandwidth.
To ensure sparse activity on the BrainScaleS-2 system, we augmented the training loss by a regularization term with the strength parameter ρ b , the hidden-layer size NH, and the corresponding hidden-layer spike trains S H i (35). We trained the above feed-forward SNNs for a range of different values ρ b and measured both their accuracy and average hidden-layer spike counts. All resulting network configurations were able to fit the training data with high accuracy (Fig. 3E). More importantly, they reached a constant test accuracy of 97.2% for activity levels down to ∼20 hidden-layer spikes per image. When only using 10 spikes on average, we observed a slight decrease in performance. At such low spike counts, the networks operated in a regime far from the rate-coding limit and, hence, had to rely on individual spikes and their timing.
Self-Calibration through ITL Learning. The above results were obtained with a calibrated BrainScaleS-2 system in which the parameter deviations due to device mismatch were largely compensated, and the computation graph hence closely matched the emulated dynamics. Nevertheless, a certain degree of residual mismatch remained. To quantify whether and how well our ITL scheme self-calibrates the substrate during learning, we performed a series of additional experiments, in which we deliberately decalibrated the system. Specifically, we calibrated each neuron's time constants and threshold to individual target values. These were drawn from normal distributions with a mean corresponding to the original calibration targets. We generated parameter sets by varying their normalized SDs σ d in the range of 0 to 50% (Fig. 4A). This notably exceeded the mismatch present on an uncalibrated BrainScaleS-2 system. To dissect the influence of poorly matching time constants and misaligned thresholds, we first detuned τm,s and ϑ − V leak separately and, finally, all of these parameters at the same time. Each of these experiments was repeated for five random seeds.
For each set of parameters, we trained the SNN on the neuromorphic system, still assuming ideal dynamics when constructing the computation graph, as done previously. In other words, we explicitly ignored the introduced mismatch. Nevertheless, learning performance was hardly affected by decalibration up to σ d = 30 %. Beyond that point, training error levels remained In contrast, simply loading a software-trained network results in an increased test error, especially for a strong decalibration σ d . For configurations with extreme mismatch, some networks suffered from dysfunctional states (e.g., leak-over-threshold). (C) When incorporating dropout regularization during training, networks become widely resilient to failure of hidden neurons.
low, but gradually increased (Fig. 4B). The testing performance, however, was (except for the highest σ d ) unaffected by the artificial mismatch. Notably, for high decalibration levels, some network configurations suffered from pathological network states, which were caused by some neurons entering a suprathreshold regime without external input. Thus, even for mismatch levels far above the ones expected for BrainScaleS-2 and similar systems, ITL learning effectively self-calibrated the analog neuromorphic SNNs. To illustrate the added benefit of such self-calibration, we also established the baseline performance for weight transfer, whereby networks were trained in software, and the weights were transferred to the neuromorphic chip subsequently. While a performance gap between ITL and weight transfer was already noticeable for the calibrated system, higher decalibration levels σ d dramatically widened this gap (Fig. 4B).
Training for Robustness. We furthermore investigated the resilience of trained SNNs to defects occurring after deployment, e.g., failing neuron circuits. To this end, we simulated neuronal death by artificially silencing randomly selected units in the hidden layer of the network after training. As expected,

Cramer et al.
Surrogate gradients for analog neuromorphic computing PNAS performance deteriorated with an increasing fraction of disabled neurons (Fig. 4C).
However, when robustness was encouraged during training using dropout, the resilience to such neuronal failures was largely improved. For networks trained with a dropout rate of 40%, the test error increased by only 10% when silencing 15% of the hidden-layer units. In contrast, it grew by 37% when dropout was not used during training.
Speech Recognition with Recurrent SNNs. So far, our analysis was limited to tasks with short time horizons, which can be readily solved using feed-forward networks. But other tasks, such as speech recognition or keyword spotting, may require working memory and thus recurrent architectures. On BrainScaleS-2, recurrent connectivity is readily supported by a flexible event router. Further, recurrence is easily integrated into our ITL learning scheme by adding recurrent connections to Eq. 3.
To demonstrate successful learning of recurrent connections with our framework, we trained a network with 186 recurrently connected hidden-layer neurons on the Spiking Heidelberg Digits (SHD) dataset (36), which consists of spoken digits from »zero« to »nine« in both English and German, resulting in 20 classes total. This dataset is a natural benchmark for SNNs due to its inherent temporal dimension. Furthermore, it directly provides input spike trains and, hence, alleviates the need for additional preprocessing, which can confound comparison. To feed the data into our system, we reduced their dimensionality by subsampling 70 out of the original 700 channels (cf. Materials and Methods). The network was then trained by optimizing a sumover-time loss L = NLL(softmax(sumt V O i [t]), y ) (Fig. 5A). To prevent pathologically high firing rates, we employed homeostatic regularization during training. Specifically, we added a regularizer of the form ρr · max(0, i,t Si [t] − ϑr) 2 , where i and t iterate over the hidden-layer units and time steps, respectively; ρr defines the regularization strength; and ϑr an activity threshold.
After 100 training epochs, the SNN reached 96.6 ± 0.5% on the training data and 76.2 ± 1.3% on the test set ( Fig. 5B and Table  1). The large gap is presumably due to the nature of the dataset, which was designed to especially challenge a network's ability to generalize (36). The two languages included in the dataset exhibit classes with significant phonemic similarity (»nine« vs. »neun«), which are indeed harder to separate by the trained network (Fig. 5C). Most importantly, however, the test set consists to 81% of two speakers that are not part of the training set and result in higher classification error rates (Fig. 5D). To improve generalization performance, we employed data augmentation. For this purpose, we stochastically shifted events to neighboring input channels drawn from a normal distribution centered around their original channel (cf. Materials and Methods). This approach improved the test performance to 80.6 ± 1.0%. To test whether good performance was dependent on learning of recurrent connections, we initialized a set of networks with the shuffled recurrent weights from a trained network and only trained its input and readout weights (Fig. 5B). This resulted in a substantial reduction of classification performance to 64.3 ± 2.3%. Thus, our ITL learning framework can leverage recurrent connections to improve accuracy on this benchmark.
To again compare our hardware system to simulations, we trained and evaluated the SNN in an equivalent software-only implementation. Without augmentation, it reached an accuracy of 71.2 ± 0.3%-far below the corresponding hardware results. At the same time, the software simulation was able to perfectly fit the training data, which was not achieved on BrainScaleS-2. We speculate that this discrepancy results from the intrinsic stochasticity of the analog substrate, which is propagated and amplified by the network's recurrent dynamics and thereby acts as a form of regularization. When we applied the same data augmentation to the simulation as in our hardware emulation, this resulted in a improved test accuracy of 79.9 ± 0.7%, close to the performance of BrainScaleS-2 under equivalent conditions. Thus, our work suggests that intrinsic analog device noise could act as an efficient regularizer.
Finally, we compared the energy efficiency of our recurrent SNNs to recently published work on the Aloha keyword-spotting task. To that end, we trained a recurrent network with 176 hidden units on the task and found comparable performance at an energy efficiency competitive to Loihi and Movidius (ref. 37; SI Appendix, SI Text and Table S1).
In summary, our findings illustrate that the flexibility of ITL learning also applies to the realm of recurrent SNN trained on challenging speech-processing problems.

Discussion
We have developed a general ITL learning method for recurrent and multilayer SNNs on analog neuromorphic substrates and demonstrated its capabilities on BrainScaleS-2. The combination of surrogate gradients with ITL training-facilitated by the massively parallel digitization of analog membrane potentialsallowed us to tie on recent achievements in the field of SNN optimization and bring them to an analog substrate. This allowed us to achieve state-of-the-art classification accuracies on multiple benchmark problems, comparable to equivalent software simulations. During training, our framework automatically corrected for device mismatch and thus abolished the need for explicit calibration. The resulting SNNs exhibited spatially and temporally sparse activity patterns and could furthermore be optimized for resilience to neuron failure. Ultimately, our method allowed us to exploit BrainScaleS-2 for low-latency neuromorphic inference at high throughput and a low energy footprint.
Most current neuromorphic systems are fully digital and typically allow one to simulate software-trained models without performance loss (38,39). This approach is flexible with regard to the SNN training schemes used (23,(40)(41)(42)(43)(44)(45)(46)(47)(48)(49), but to fully leverage recent advances in material sciences often requires dealing with analog or mixed-signal components (15,19,50,51). For instance, memristors, a key emerging technology, are ideal candidates for long-term memory storage in neuromorphic systems (52)(53)(54). However, these respective components are intrinsically analog and subject to drift and manufacturing variability. These imperfections can reduce performance when loading software-trained models onto the analog substrate.
While several studies approached this problem by optimizing for additional robustness during training (44,55), these techniques are intrinsically limited. Since mature on-chip training solutions are not yet available, ITL learning has emerged as a good compromise that efficiently takes device-specific nonidealities and heterogeneity into account (56)(57)(58)(59). However, previous work relied on rate-based or time-to-first-spike coding schemes. Here, we expanded ITL techniques into the realm of surrogate gradient learning, which flexibly interpolates between rateand timing-based coding schemes on multilayer and recurrent architectures, thereby simultaneously improving performance and energy efficiency, while also being conducive for fast inference (24).
Comparing the performance of neuromorphic SNN implementations is an intricate task, starting with a lack of standardized benchmarks (36,60). When surveying the broad spectrum of different neuromorphic architectures, one encounters diverse ways of determining a system's energy consumption, which ranges from presilicon estimates of a neuromorphic core's current draw to full-system laboratory measurements. Nevertheless, we attempted to contrast our findings with results from previous studies on both digital and analog systems ( Table 2). Although lacking the essence of temporal spike-based information processing, we considered the MNIST dataset due to its widespread adoption.
Our model on BrainScaleS-2 performed competitively in all metrics and was surpassed in accuracy only by much larger or convolutional networks. When considering the energy footprint, Table 2

. Comparison of MNIST benchmark results across neuromorphic platforms
BrainScaleS-2 reached values only outperformed by optimized architectures fabricated in much smaller and hence more efficient technology nodes (38,63). In comparison to other neuromorphic systems, we set benchmarks in terms of throughput and latency, which even challenge dedicated ANN accelerators (SI Appendix, Tables S1 and S2).
One limitation of our study is that, in addition to MNIST, we primarily used speech-based benchmark datasets to compare to other systems. Using accelerated systems such as BrainScaleS-2 for speech recognition would require an extra conversion step at the sensor level and thus likely result in additional energy costs that we did not quantify in our study. However, it is conceivable that part of this cost could be offset by dynamical network approaches and effective time-multiplexing strategies in edge applications or data centers. We primarily chose speech benchmarks due to the lack of suitable alternatives (60) and as a proxy for challenging problems requiring temporal processing that fits the number of channels supported by our system. While we expect that our main findings will generalize to other task domains and other neuromorphic substrates, showing this equivalence is left for future work.
In summary, our work shows how learning can efficiently compensate for device-specific imperfections, thereby allowing us to employ analog neuromorphic substrates for complex, energyefficient, and ultralow-latency information processing. Importantly, it also is the first step toward future on-chip learning algorithms that could even take advantage of such device heterogeneity (65). Thus, our work gives a glimpse of how powerful learning algorithms will empower future neuromorphic technologies.

Materials and Methods
Software Environment. Our training framework was based on PyTorch's autodifferentiation library (29). It furthermore builds upon the BrainScaleS-2 software stack to configure the neuromorphic system and execute the experiments (66).
Input Coding. For MNIST, we scaled down the dataset to 16 × 16 pixels by first discarding the two outermost rows and scaling the remaining pixels. The images were then converted to spikes by interpreting the normalized pixel grayscale values x i as input currents to LIF neuons. Strong enough currents trigger a spike at time t i = τ in log x i /(x i − ϑ in ), where τ in denotes the input units time constant and ϑ in its threshold (SI Appendix, Table S3).
Since the SHD dataset is provided in the form of input spike times, a custom conversion was not required. For SHD, we reduced the original 700 input channels by subsampling. Specifically, we omitted the first 70 and then retained every 9th input unit. The time dimension was scaled by a factor of 2,000 to account for the system's acceleration factor of 1,000 and further shorten the experiment duration to reduce the computation burden on the host system. When employing data augmentation, a spike originally originating from input channel i was reassigned to a neighboring channel drawn from N (μ = i, σ). This augmentation was applied prior to downsampling the inputs.
Initialization. We used Kaming's initialization (67) for both the hiddenand output-layer weights. Specifically, weights were drawn from a normal distribution with zero mean and an SD ofσw / N H,L (SI Appendix, Table S3).
Weight Scaling. Weight values had to be scaled, rounded, and cropped to the neuromorphic system's weight resolution of 7-bit signed integers resulting from merging two 6-bit synapse circuits. The exact scaling took into account analog bias currents and other technical parameters and was heuristically optimized to equalize the response of the analog neuronal circuits and the model dynamics of the computational graph.
Due to the absence of a threshold for the nonspiking output layer, its membrane traces could be scaled arbitrarily. For the MNIST classification, we adopted a dynamic weight scaling for the output weights by aligning the largest absolute weight value as represented in software to the maximum weight possible on the substrate.
Energy Measurements. We separately measured the current draw of the full application-specific integrated circuit on the individual supply rails via INA219 current/power monitors from Texas Instruments, which were for that purpose placed on the system's carrier board. The power readings were taken during the execution of the forward pass.