Brain-Inspired Learning on Neuromorphic Substrates

Neuromorphic hardware strives to emulate brain-like neural networks and thus holds the promise for scalable, low-power information processing on temporal data streams. Yet, to solve real-world problems, these networks need to be trained. However, training on neuromorphic substrates creates significant challenges due to the offline character and the required non-local computations of gradient-based learning algorithms. This article provides a mathematical framework for the design of practical online learning algorithms for neuromorphic substrates. Specifically, we show a direct connection between Real-Time Recurrent Learning (RTRL), an online algorithm for computing gradients in conventional Recurrent Neural Networks (RNNs), and biologically plausible learning rules for training Spiking Neural Networks (SNNs). Further, we motivate a sparse approximation based on block-diagonal Jacobians, which reduces the algorithm's computational complexity, diminishes the non-local information requirements, and empirically leads to good learning performance, thereby improving its applicability to neuromorphic substrates. In summary, our framework bridges the gap between synaptic plasticity and gradient-based approaches from deep learning and lays the foundations for powerful information processing on future neuromorphic hardware systems.


I. INTRODUCTION
Our brains simultaneously process various streams of temporal information allowing us to solve challenging real-world problems that ensure our survival. Importantly, brains do this with an aptitude that dwarfs existing computer technologies while only consuming a meek 25W of power.
Neuromorphic engineering has taken on the challenge of approaching such efficiency by building scalable, low-power systems that mirror the brain's essential architectural features [1][2][3]. Decades of research helped overcome major engineering challenges toward this goal, resulting in an increasing number of neuromorphic sub-strates available today [4][5][6] and allowing the efficient emulation of brain-inspired neural networks. One key challenge that remains and prevents the widespread application of neuromorphic systems is the lack of practical algorithms that run on such hardware and equip it with complex functionality.
Deep learning provides algorithmic blueprints to organize large neural networks into suitable function approximators that flexibly solve diverse real-world problems [7]. To achieve this feat, deep learning optimizes loss functions with gradient descent, which can be computed efficiently with the Back-Propagation (BP) algorithm. This efficiency, however, rests on the von Neumann computer architecture. In contrast, gradient BP is difficult to implement on non-von Neumann neuromorphic substrates. These difficulties mainly arise from limitations in their ability to communicate neural activities and weight values between different network elements, similar to the architectural constraints of biological neural networks [8]. For instance, synaptic plasticity implemented at a biological synapse may have access to the activity of the two neurons it connects, but not to the activity of other neurons that it is not physically connected to. This notion is often expressed by saying that plasticity in the brain is local. Local learning rules have been extensively studied in computational neuroscience, typically based on experimental data. However, these rules often lack the normative foundations of BP and hence the ability to instantiate complex functional neural networks. However, top-down driven synaptic plasticity rules can also be derived from gradient descent both in the case of single biologically inspired spiking neurons [9][10][11], thereby establishing a link to the Perceptron [12], and in more complex multi-layer networks [13][14][15][16][17][18][19][20][21]. Remarkably, many normative approaches do result in learning rules largely consistent with models of cortical neurons and synapses [17,[22][23][24][25][26][27]. In the present article, we discuss the remarkable commonalities across machine learning and computational neuroscience learning algorithms from the standpoint of neuromorphic engineering. To that end, we rely on the conceptual framework provided by deep learning allowing us to focus on three distinct aspects of building functional artificial neural networks: architecture, learning rules, and loss functions ( Fig. 1; [28]). Within this framework, we focus on online learning rules (Sections II & III) and loss functions (Section IV) due to their specific relevance for neuromorphic engineering. This relevance largely derives from the use of non-von Neumann architectures in this field, which we introduce next.

A. Von Neumann computers and biological brains
Conceptually, computation requires memory, communication, and arithmetic. Computation is a physical process in which these three components come together in the same place and at the same time. Doing so requires space, time, and energy. And the amount of resources used determines the overall cost of computation, and varies depending on both the computation itself and the architecture on which it runs.
The von Neumann architecture, which underlies virtually all human-made computers, posits a physical sepa-ration between processing units for arithmetic, program flow control, and memory. This separation induces a communication channel between the two. A significant design limitation of this architecture is that when the amount of data becomes too large, communication between the processing and the memory unit becomes a bottleneck. Thus, for memory-intensive computations, this so-called von Neumann Bottleneck impacts both latency and power efficiency. The impact is especially noticeable in deep learning applications that require large amounts of data to first propagate through immense deep neural network models. Then gradient information propagates back through the same nodes to compute gradients for learning. Since both the data and the model parameters reside in memory while the operations that act on them take place in the processing units, the communication needs become substantial. Although modern computer architectures use diverse strategies to mitigate this bottleneck, e.g., caching, branch prediction, and parallelism, all of these measures come at the expense of higher energy cost and larger on-chip space requirements. Hence, they eventually run into the same bottleneck. With the looming end of Moore's law, the communication bandwidth across processing and memory units is reaching a plateau. While the continued development of deep learning hinges in large part on ever faster and larger-scale hardware, some emerging neuromorphic developments seek to provide alternatives to the von Neumann architecture [29].
While von Neumann architectures simulate deep neural networks, neuromorphic solutions attempt to emulate them on a physical substrate, inspired by the brain. In the brain, neurons integrate inputs in an analog manner, apply nonlinear transformations to them, and communicate asynchronously through digital spikes or action potentials. Spikes are binary events in continuous time that evoke electric currents in receiving neurons whose strength is determined by their connecting synapses. Importantly, these connections are plastic and can change dynamically in an experience-dependent manner to implement learning and memory. Together, neurons and their intricate synaptic connectivity achieve neural processing and long-term storage in a completely parallel and time-continuous way. Since memory and arithmetic are physically co-localized and distributed within the same physical network, brains are fundamentally nonvon Neumann architectures.
Sparse, event-based communications and physically co-localizing memory and computation are two defining principles of brain-inspired computing and neuromorphic engineering [29,30]. Neuromorphic engineers recognized the benefits of this approach and developed large networks of VLSI neurons and synapses communicating asynchronously through Address Event Representation (AER) [31,32]. Recent digital hardware, such as Google's TPU [33], Graphcore's Intelligence Processing Unit (IPU), and Cerebras' wafer-scale CS-1, have also embraced some form of memory-computation co-localization to improve the energy and performance metrics of scientific computing and machine learning workloads. The development of these brain-inspired architectures is in large part motivated by the need to process the ever-increasing deluge of data generated by the pervasive sensors in everyday life.

B. Processing and learning from temporal data streams
A large fraction of sensor data are temporal and require real-time processing, thus warranting dedicated architectures. For example, autonomous vehicles need continuous monitoring and processing of time series data from their sensors to navigate safely. Similarly, internet of things (IoT) devices have to continuously monitor their environment to respond to speech commands, detect anomalies online from biosensor data, (e.g., a pacemaker implant), and to learn continually [34][35][36]. While data streams induce specific challenges, most of which we will discuss in Section IV, we first focus on the essential fact that most real-world data, whether streaming or not, is also temporal data that needs to be processed in realtime. In the following, we briefly review RNNs as the de-facto standard for processing temporal data before focusing specifically on neuromorphic implementations thereof and how we can accomplish gradient-based optimization in real-time. The brain is an inherently time-dependent dynamical system [37,38] that relies on biophysical processes, recurrence, and feedback of its physical substrate for computation [30,39]. These dynamics are different from the majority of deep neural networks, which are often strictly feedforward, and lack the fine temporal dynamics of brains. From a technological point of view, emulating neural dynamics on a physical substrate has the advantage of operating much more efficiently compared to simulations [40]. However, this comes with a key challenge in which "time represents itself." This implies that all computational processes occur at the timescales of the physical system. In VLSI technologies, this is achieved by operating CMOS transistors in their subthreshold regime, such that currents and consequently the time constants are matched to those of the brain [40,41]. As such, these systems are both online and streaming. Other technologies run in accelerated time [42], which can create a mismatch of time scales in certain real-world applications.
RNNs have proven highly effective for sequential processing such as keyword spotting, object recognition, or time series forecasting [7,43]. Neuromorphic processors generally implement Spiking Neural Networks (SNNs) [6], which can be viewed as a special class of RNNs inspired by biology. SNNs are particularly suited for energy-efficient processing thanks to their rich local dynamics but spatiotemporally sparse communication via spikes (events). This is in contrast to most RNNs used in machine learning which rely on dense and analog valued communications. Moreover, biological neurons presumably carry specific inductive biases in their internal dynamics that are potentially advantageous for realworld information processing.
Like all neural networks, SNNs can be trained to find suitable connection weights. Because SNNs are RNNs, they can be trained with similar gradient-based methods that only need to be modified slightly to accommodate the binary activation functions underlying action potentials [20]. Gradient descent on a loss function thereby automatically adjusts neuron and synapse parameters in the hidden layers of the network to reduce a scalar training loss. In the following, we review two common algorithms underlying gradient computation in RNNs, and highlight their limitations for gradient-based learning in neuromorphic substrates, before discussing a range of possible remedies.

II. GRADIENT-BASED LEARNING IN RECURRENT NEURAL NETWORKS
Training RNNs requires computing objective function gradients with respect to the network parameters. To gain a better understanding of why specific challenges arise when computing gradients for RNNs, we consider a simple RNN with the network state h t ∈ R k , the input x t ∈ R n and the parameters θ ∈ R p , where p is the number of parameters. We further define the output y t = g θ (h t ), and the associated target y t * and loss function L = t L t (y t , y t * ). Training an RNN requires computing the gradients of L with respect to all the parameters θ. Following the steps of Marschall et al. [44], we define θ t as the application of the parameter at time t to distinguish between their direct and indirect influence, and note that all parameters θ t are tied across all timesteps t. The gradient is given by: (1) Two different temporal summations appear in this expression. To be able to use Equation (1) for online learning, we have to first evaluate the sum over s which underlies ∂h t ∂θ . Fortunately, it is possible to compute ∂h t ∂θ with the help of a simple recursion relationship. To write this relationship compactly, we define the influence G t := ∂h t ∂θ ∈ R k×p , the immediate influence F t := ∂h t ∂θ t ∈ R k×p , and the RNN's dynamics H t := ∂h t ∂h t−1 ∈ R k×k . By further assuming G t=0 = 0, G t is given as the recursive product of Jacobians: which can be computed going forward in time. For a detailed derivation of the above expressions, see [44]. By inserting G t back into Equation (1), and assuming a small learning rate, the summation over t can be implemented as an online learning algorithm. This causal learning algorithm is called Real-Time Recurrent Learning (RTRL) [44,45].
However, a more common method to evaluate the gradient is Back-Propagation-Through-Time (BPTT). BPTT takes an acausal approach to evaluate the same Jacobian matrix products: it computes the product of Jacobians starting at the end and working its way backward through time, hence its name. To make this relationship explicit we define the credit assignment vector C t = ∂L ∂h t ∈ R k and the instantaneous credit vector D t = ∂L t ∂h t ∈ R k . The recursion relation underlying BPTT is then given by: This expression needs to be computed in an acausal manner [44], which precludes its use as an online learning algorithm. Although, for a given sequence, the gradients resulting from BPTT and RTRL are the same, BPTT remains the gold standard for training RNNs on von Neumann hardware. The reason is that the different implementations have specific advantages. To understand the origin of these differences, we now analyze the computational costs of BPTT and RTRL.

A. Cost analysis of BPTT
The first term in Equation (3) is the k-vector derivative of a scalar loss function, and is thus a row vector. The factorization afforded by the chain rule means that all products are Vector Jacobian products of dimensions k and k × k, respectively, and can be computed in k 2 evaluations. However, to evaluate the gradient in reversemode, it is necessary to record all evaluations in the forward phase. Thus, the full activation history needs to be stored in memory, and in the case of RNNs, memory requirement scales as O(kT ). The time complexity of each timestep is dominated by the number of scalar multiplications operations underlying the product of the Jacobians which is O(k 2 T ) (cf. Fig. 4). The dependence on T restricts BPTT to temporal inputs of limited duration and to substrates that offer enough memory to store the activation history. Both requirements are major shortcomings when we want to process continuous streaming data on non-von Neumann architectures on which locally accessible memory is limited. To reduce the complexity of BPTT, gradient propagation is generally truncated at a number of steps smaller than T . This temporal restriction improves complexity, but severely restricts learning performance on tasks that require long time horizons. Truncation is even suspected to render RNNs trained with BPTT to what is effectively a feedforward network [46].

B. Cost Analysis of RTRL
The computational cost of RTRL is determined by evaluating G t as a product of Jacobians of the shape R k×k and R k×p which requires k 2 p scalar multiplications. Since p is the number of parameters, the cost is sizable. RTRL requires O(kp) memory to store G t , a R k×p matrix. For instance, in a fully connected network, we have k (n + k) parameters for the feedforward and recurrent connections. Assuming that n ∼ k, the overall memory complexity scales cubically with the number of neurons (O(k 3 )) whereas time scales as O(k 4 ) per update step.
Nevertheless, one decisive advantage of RTRL is that, although G t is a potentially large matrix, it can be discarded after each update. In other words, RTRL's complexity is independent of T , which makes it an interesting contender for training on streaming data on neuromorphic substrates. Despite the benefit of online evaluation, RTRL's high computational burden O(k 4 ), compared to O(k 2 T ) for BPTT, makes it prohibitive for any practical RNN implementation.

C. Reducing the cost of online learning through sparseness
Fortunately, there are multiple ways to reduce the computational load of RTRL while at the same time retaining most of its efficiency. One way of lowering RTRL's high computational cost is to approximate the computation of the influence matrix G t , for instance, by decomposing it in a product of lower-order tensors [47,48] or Kronecker factors [49] (see [44] for a comprehensive review). Another way of reducing RTRL's computational cost is to consider sparse approximations of the influence matrix G t [50]. This situation arises naturally when the network connectivity is either sparse [51][52][53][54][55] or approximately sparse. By approximately, we mean that there are a few strong connections that dominate the temporal dynamics and thus the gradients.
Sparsity has additional benefits for neuromorphic substrates because it helps to alleviate two of their inherent limitations. First, it caters to limited on-device storage by requiring less memory for model parameters. Second, sparseness can help to overcome communication bottlenecks that result from the need of communicating non-local, but learning-relevant information to where it is needed. In both BPTT and RTRL, the update of a single network parameter indirectly depends on all other network parameters, thus rendering learning non-local.
Due to the relevance of sparseness and local learning rules to neuromorphic substrates, we dedicate the remainder of this article to efficient biologically inspired approximations of RTRL and describe how they empower neuromorphic devices to learn from streaming data. As we will see, the dynamics of biological neuron models naturally admit a formulation with sparse block Jacobians, which are both sparse and obey locality principles by tying gradient propagation to diagonal blocks that implement the underlying neural and synaptic dynamics. But before we can fully appreciate the algorithmic importance of sparse Jacobians, we will briefly review the notion of auto-differentiation, which most modern gradient-based learning algorithms rely on.

D. Auto-differentiation strategies in practice
Automatic Differentiation (AD) is a type of differentiable programming allowing to compute the gradients across entire computational pipelines [56][57][58]. AD has been used extensively in scientific computing before it was applied to neural networks [59], but it has been popularized with recent libraries such as Theano [60], Tensorflow [61], and PyTorch [62]. The key principle underlying AD is that numerical computations are compositions of a finite set of elementary operations for which derivatives can be defined. By combining the derivatives of the operations through the chain rule of derivatives, the gradient of the overall composition is systematically computed.
Just as in Equations (2) and (3), gradients in AD can be accumulated according to different modes. The two main modes are forward and reverse. For RNNs, forward-mode accumulation is equivalent to RTRL, and backward-mode accumulation corresponds to BPTT. These two modes are two extremes for gradient accumulation. In principle, it is possible to mix forward and backward within the same algorithm [63,64]. But for reasons relating to its superior time and space complexity, most machine learning frameworks use pure reverse mode accumulation. Later in this article, we discuss an example of mixed AD modes that is computationally advantageous for neuromorphic hardware.

III. APPLYING GRADIENT-BASED LEARNING TO BIOLOGICALLY-INSPIRED NEURAL NETWORKS
Biological neural networks are RNNs, but there are two defining characteristics that set them apart from the generic RNN models discussed in the previous section. First, biological neurons possess internal dynamics on different timescales due to a plethora of bio-chemical processes that interact with the membrane dynamics [30,65]. Because these dynamics play an important role for the approximations of the Jacobians H t and F t , which are the basis for the efficient online learning algorithms, we will discuss them in detail in Section III-B. Second, most biological neurons communicate through action potentials, or spikes. Since spikes are binary neuronal outputs, they render SNNs non-differentiable, which poses a problem for standard gradient-based optimization algorithms. Hence, special training methods are required, which we briefly review in the following.
While translation approaches require the training of a non-spiking proxy network whose connectivity is later translated into a spiking network, the other methods operate directly on SNNs. Variational learning approaches attempt to change the distribution of the network output toward a given target distribution by minimizing an upper bound on the Kullback-Leibler divergence between the two. To that end, these models employ stochastic neuron models, typically formulated within the scope of the spike response model (SRM) with escape noise or a stochastic firing threshold [79]. Variational methods can learn useful representations in hidden neurons and, importantly, the resulting learning rules often have natural interpretations as local learning rules with a global modulatory factor [13,23,80]. However, both low dimensional feedback [81] and stochasticity [13] are known to result in noisy gradient estimates, which can lead to slow convergence and can render learning practically impossible.
Surrogate gradient learning avoids such problems by using neuron specific feedback signals as in standard BP and dispensing with stochasticity in the forward pass. Nevertheless, gradients are computed as if noise was present to smooth out the non differentiable binary nonlinearities of spiking neurons. While a rigorous theoretical formulation of this interpretation still needs to be worked out, it could provide a compelling explanation of why surrogate gradient learning is robust to the choice of nonlinearity [78]. Within such a framework, different functional shapes of surrogate derivatives may simply correspond to different choices of neuronal noise distributions.
Irrespective of the underlying explanation, a host of recent studies have established the effectiveness of surrogate gradient learning at scale on various deep SNN architectures and diverse tasks and datasets [15,16,20,21,27,77,78]. To that end, surrogate derivatives have been used in combination with variants of both RTRL or BPTT.

B. Spiking Neuron Models have Implicit Recurrence
To capture the internal dynamics of real neurons, spiking neuron models possess implicit recurrence [20]. To understand this concept, we consider one of the simplest and most widely used spiking neuron models, the LIF neuron [79]. LIF neurons capture key electrophysiological properties of biological neurons, while being abstract, analytically tractable, and easy to simulate. The state of a single-compartment LIF neuron i is described by its membrane potential U i which obeys the following temporal dynamics where we have introduced the resting potential U rest , the membrane time constant τ m , and the input resistance R [79]. The membrane potential U i acts as a leaky integrator of the input spike trains S j,in (t) = k∈ζj δ(t k j − t) where delta is the Dirac delta and the sum runs over all firing times ζ j of neuron j. Output spike trains S k (t) are defined similarly. Matrices W and V here define feed-forward and recurrent weight matrices. On a digital computer, Eq. (4) is commonly integrated using an Euler numerical integration 1 on a discrete time grid as follows: where we further introduced the decay factor β := exp − ∆ τm with timestep ∆ and the discrete spike trains S t j,in , S t j which are equal to one if the respective neurons j spiked in timestep t and zero otherwise. Equation (5) makes the formal connection to our RNN To distinguish input and output dimensions, the recursion is shown here for the parameters of W only. For parameters V , this illustration would look similar, but with n = k. (Top) Point Neuron, single compartment neuron. For both cases, it was assumed that G t−1 is a step-wise diagonal matrix. This shows the "mixing" effect of H E,t . By virtue of block matrix algebra, the implicit recurrence preserves the (block) diagonal structure. Hence, H I,t is referred to as implicit recurrence. On the other hand, the off-diagonal terms of H E,t destroy the (block) diagonal property. Here we illustrate H E,t as having no diagonal terms, meaning there are no synaptic self connections.
example above more obvious. Similar to a long shortterm memory (LSTM) cell, the spiking neuron's leaky membrane potential maintains an internal state through implicit recurrence via the decay constant 0 < β < 1. The LIF (Equation (5)) model is related to the RNN and its learning dynamics in Equation (2), by replacing h t with the membrane potential state U t .
To understand the implications of this property for gradient computation and to use it for efficient algorithmic learning implementations, we distinguish between two types of recurrence, namely, explicit and implicit. In keeping with our convention (cf. Eq. (2)), we define the following notation: where H I,t and H E,t denote the implicit and explicit recurrent dynamics, respectively. This decomposition is motivated by the element-wise nature of the computations inside the network elements. Implicit recurrence captures the sensitivity to perturbations of the inter-nal neuronal dynamics, such as membrane dynamics. Explicit recurrence is due to inter-neuronal synaptic connections, like in any vanilla RNN model. With these definitions, the forward-mode recursion takes the familiar RTRL form that exhibits the contributions of the two types of recurrences: with initial states G 0 = 0. A consequence of element-wise operations within the neuron is that H I,t and ∂S t ∂U t are block sparse Jacobian matrices with a diagonal structure (Fig. 2).
The model descriptions above represent SNNs in the same form as artificial RNNs. However, compared to RNNs, the timestep used for SNN integration is much finer in practice to account for the internal neuronal dynamics. Further, a small time step provides a more accurate integration with respect to the continuous-time dynamics and thus the modeled neuromorphic substrate. Since this timestep is determined by the shortest time scale in the dynamical system, and thus independent of the input data, many SNNs are trained over hundreds of steps. This results in as many network instances in memory as there are time steps, which becomes cumbersome for networks of more than a few thousand neurons. As a result, in recent works, the size of SNNs trainable by BPTT has remained severely limited by the available GPU memory [15]. Hence, pure reversemode AD, i.e BPTT is not suitable to train large SNNs. The invariance of RTRL's complexity to the number of time steps T offers a potential solution to this problem, provided the unfavorable O(k 3 ) space and O(k 4 ) time scaling can be mitigated. But how can one achieve a reduction in complexity?
Let us, for a moment, ignore the explicit recurrences when computing G t . In this case, if H I and G t have the same block structure 2 , the resulting product H I,t G t−1 is also block-diagonal. Thus, only O(k 2 ) operations are necessary to compute the recursion of the derivative, and O(k 2 ) memory is required to store the non-zero values of G t . Therefore, maintaining the sparseness of the G t recursion has the potential of reducing the complexity of RTRL. Here, we focus on two solutions to maintain G t sparse.
The first solution is to work with sparse H E matrices, meaning in the case of a single layer RNN a sparse matrix V with few non-zero entries. Sparse connectivity patterns are pervasive in the brain as biological neural networks tend to be locally dense but globally sparse [84]. While a sparse V does not indefinitely maintain the G t sparse, it does so for a certain number of steps. This opens up possibilities for sparse n-step approximations, allowing to save computation by a factor of the sparsity squared [50]. Moreover, sparse connectivity caters to the fact that neuromorphic hardware often relies on sparse connectivity for better memory-efficiency [85], which is also mirrored in the hierarchical communication fabric [5,86,87]. Thus, from an implementation standpoint, sparse connectivity matrices are preferable on neuromorphic hardware. From a performance standpoint, the suitability of sparsely connected networks varies from case to case, and is an intensely studied topic, both in terms of implementation [88] and algorithms. Randomly pruning network weights typically impairs overall network performance unless special care is taken, such as intelligent sparse initialization schemes [52,53,55,89] or dynamic rewiring during the training [51,54]. Moving forward, the "lottery ticket hypothesis," which posits the existence of sparse, trainable feed-forward networks without loss in accuracy [52], is likely to spur further research on sparse RNNs.
The second solution is to approximate the gradient computation by assuming that certain connections contribute more to the gradient than others. For SNNs, a simple way of doing so is to ignore all contributions of H E to the gradient which amounts to assuming that most relevant temporal information is carried forward in time through implicit recurrence, i.e., H I . All the while, the recurrent connections remain in place in the network and contribute to the dynamics. But how well do such approximations work?
C. Implicit recurrence induces approximate, local, and efficient learning rules Although ignoring H E seems like a drastic simplification, several studies have used it to construct biologically plausible online learning rules as local approximations of RTRL. Empirically, these rules perform well on a number of complex problems either without recurrent connections, like in the case of SuperSpike [17], or by ignoring gradient flow through the recurrent synaptic connectivity as done in e-Prop [27], RFLO [92], and DECOLLE [19]. These findings suggest that explicit recurrence is either not necessary for many problems, or that ignoring explicit recurrence in gradient computations does not create a major impediment for successful learning, even when such recurrent connections are present.
To disambiguate between these different possibilities, we extended previous work [78] by running additional simulations in which we either ignored or included the contribution of H E during SNN training. We then systematically compared the resulting network performance of the two approaches and to networks without any explicit recurrent connections (Fig. 3). Since these results may be dataset dependent, we repeated this analysis for a range of different classification datasets that required different levels of temporal memory. The tasks can be coarsely divided according to the duration of their inputs. For the Randman and MNIST dataset all input spikes arrive within a short temporal window (< 50ms), whereas for the speech processing problems, we considered individual inputs with a duration of ∼ 1s.  [78]. Randman: A synthetic spike-timing dependent task based on smooth random manifolds. MNIST: A time-to-first spike latency encoded version of the MNIST dataset. SHD: The Spiking Heidelberg Digits dataset [90]. RawHD: The Heidelberg Digits dataset, but using analog currentbased input instead of spikes. RawSC: Same current-based input, but using the Speech Commands dataset [91]. We distinguish between the following types of synaptic connectivity and training approaches. FF (gray): feed-forward SNNs, RC (red): explicitly recurrent SNNs trained with surrogate gradients, and RD (blue): "recurrent detachted" networks that are architecturally identical to RC, but in which we ignored explicit recurrence for gradient computation (cf. H E ). Finally, we used two different readout configurations to compute the logits for the softmax cross-entropy loss. We either summed up the readout unit activation over all timesteps (Sum) or computed the Max over all timesteps. Due to the high computational cost of full RTRL and the absence of efficient training libraries, we performed all simulations using BPTT and determined optimal hyperparameters with a grid search comprising more than 8000 simulations. For each configuration we selected the ten best models using held-out validation data and computed the mean error and SEM on separate test data.
Not surprisingly, the addition of recurrent synaptic connections (RC) to a given SNN results in a reduction of error over strictly feed-forward (FF) synaptic connectivity, in most cases (Fig. 3). Note, that we did not correct for the larger parameter count of the recurrently connected models. This observed difference was generally larger for tasks with increased complexity and temporally longer stimuli, consistent with the idea that longer stimuli require longer memory time scales. However, when the same SNNs were trained ignoring gradient contributions through explicit recurrence through H E , error rates increased mildly on most datasets. Importantly, however, the errors remained lower than for the networks without recurrent synaptic connections. The effect was largest on tasks that required more temporal memory.
Thus, in the scenarios we tested, the cost of ignoring the contribution of H E to gradients is small, all the while the algorithmic benefits are substantial. As illustrated above, ignoring H E yields local online learning rules (cf. (6)), which are better suited for hardware implementations. Several online learning approaches therefore make use of this, or similar approximations [19,27,93]. The LIF neuron model used for the simulations in Fig. 3 had a two dimensional state. One variable was used to model the membrane potential, whereas the other described the time course of exponentially decaying synaptic currents. However, the scaling properties of the online learning algorithm are not affected by extending it to more complex multi-compartment neuron models, for instance, by incorporating additional slow dynamical variables [27,75,77,94].

D. Implicit Recurrence of Multi-compartment Neurons can Increase Computational Power
We now illustrate that RTRL applied to multicompartment neurons results in Jacobians H I,t that are block diagonal and, hence, efficient to train using approximate online algorithms. This can be formalized by extending the domain of U t to R mk , where m is the total number of compartments per neuron. Note, that in this formalism synaptic dynamical variables also count as compartments. For instance, the ones typically used to implement exponentially decaying conductance or current variables as for the experiments shown in (Fig. 3) (cf. Appendix A). In a typical multi-compartment model, S t ∈ R k is a function of only one compartment per neuron, which can be written: where P is a binary k × mk matrix that selects the spiking compartment. Without loss of generality, we can assume the first compartment is the spiking one, resulting in the following matrix P : Consequently, the dynamics are given by: Since only spiking compartments are assumed to be connected with other compartments, the weights V are matrices defined in R mk×k . Furthermore, we have that G t ∈ R mk×p . Following the approximate gradient computation where H E is ignored, G t becomes block diagonal (Fig. 2), resulting in pm non-zero entries. Thus, adding neuronal complexity by widening the neuronal state space does not change the time complexity of the learning algorithms and does not preclude the use of efficient online learning algorithms. However, such changes can have dramatic effects on the computational expressivity of the resulting network models and are reflected in the corresponding approximate learning rules [27]. These insights may partially explain why neurobiology uses a diversity of different neuron types with distinct internal dynamics and thus opens up new vistas for exciting future research.

E. Different Modes of Auto-differentiation in Biologically Inspired Learning
To train biologically inspired SNNs and to study the dynamical properties of biological networks discussed above in functional networks, many researchers rely on auto-differentiation frameworks. But because of the shortcomings of BP in the context of online learning for neuromorphic applications, increasing attention is given to the approximate online learning methods discussed in the previous sections. In the following, we give concrete examples of recent work on SNN training. However, it is important to bear in mind that the same algorithms also apply to non-spiking RNNs when the spiking nonlinearity is replaced with a smooth differentiable function.
Specifically, we will show one case of reverse mode AD (BPTT), one case of forward mode AD (RTRL), and one example of mixed-mode AD (Fig. 4). For all three examples, we consider a network consisting of LIF neurons which evolve according to the dynamics in Equation (5). To simplify the mathematical expressions and without loss of generality, we consider U rest = 0 and R = 1. We write the LIF dynamics in matrix form as follows: Here, we assume n input and k output neurons, and S t in to be the input spike train.
a) Training with BPTT (Fig. 4a): We first analyze the case of training this network with BPTT. The hidden state is U t and the dynamics and immediate Jacobian takes the form: where we have assumed the smooth surrogate function σ. Using the BPTT recursive expression for C t (which includes dynamics H t , Equation (3)), the gradients for W and V become: Thanks to the native support of reverse mode AD in machine learning frameworks, this spiking neuron model can be implemented and differentiated like RNNs. This approach has been applied successfully to train general SNNs models [15,18,76,77,90,95,96]. BPTT has the advantages that it does not restrict the dynamics, connectivity patterns, and loss function. However, these advantages come at the cost of a large memory footprint and temporal non-locality. Thus, for applications using small networks of some thousands of neurons that do not require online learning, BPTT is the method of choice. For larger networks, the memory overhead becomes impractical, and other modes of gradient computation described below may be necessary. b) Training with Sparse RTRL (Fig. 4b): As discussed above, it is possible to reduce the complexity of RTRL by keeping G t sparse. We demonstrate this in the case of the LIF neuron. Using the RTRL recursion (Equation (2)), the gradient of W obeys: Explicit rec.
If we neglect explicit recurrence in the above expression, all involved matrices remain block sparse as shown in (Fig. 2). The recursion can be compactly written in vector form by defining the following k-vector trace: where we used a different variable Q in to emphasize that it represents a vector, rather than a tensor. Since the Q in ∈ R n terms are dense n-vectors, for gradients with respect to W , the RTRL recursion is simplified from O(np) memory to O(n) memory. Gradients are then computed in the familiar three-factor form: and similarly for V gradients: where Q t ∈ R k is defined in analogy to Q t in , but replacing S t in with S t . Note the above factorization to the dense k-vector form is only valid for linear LIF neurons with instantaneous approximations of the loss function [19]. However, in the case of nonlinear neuronal dynamics, such as spike-triggered adaptation or for certain loss functions (see Section IV), O(k 2 ) traces may be necessary, leading to quadratic scaling with the number of neurons [17,27]. c) Mixed-mode: It is possible to combine elements of both BPTT and RTRL for training RNNs and SNNs. We call this situation mixed-mode AD (Fig. 4c). The portion of the graph from S t in to L t in (Fig. 4) may involve several steps. Such cases can occur for example in convolutional networks with pooling layers, linear readout layers and multiple recurrent layers [97]. In [19], for example, loss functions were defined based on random combinations of max-pooled outputs of a convolutional layer consisting of spiking neurons. If these steps are instantaneous, i.e., they do not explicitly depend on past states, the immediate Jacobians F t W and F t V can be computed online using BP within a single timestep. All other gradients can be accumulated over time by virtue of the RTRL recursion. Up to rounding errors, the numerical result will be exactly the same as sparse RTRL discussed earlier, but the memory footprint of the implementation differs and may be more favorable. The recent Deep Continuous Local Learning (DECOLLE) [19] is one example of a mixed-mode AD implementation using spatially and temporally local loss functions. It combines forward mode AD, to achieve temporal credit assignment, with reverse mode AD for spatial credit assignment. This allows the convenient use of existing autodifferentiation tools, while combining it with the more favorable scaling properties of BP in space. Bohnstingl et al. [97] lay the ground to extend this idea to multiple recurrent layers, and find that estimating gradients for layer l using solely the recurrence relation of that layer, and ignoring others results in good learning performance and low complexity.

IV. LOSS FUNCTIONS FOR ONLINE LEARNING
So far, we have focused on learning algorithms that permit efficiently computing loss gradients in an online manner to learn from temporal data. The applicability of these algorithms, however, hinges on the ability to also compute the loss functions in an online manner, which is not always possible. In the following, we discuss the desiderata of loss functions for online real-time learning and point out directions where future research is required.
Since loss functions are defined at the output of a network, let us briefly review what input-output paradigms exist for RNNs. We can coarsely divide neural network processing by specifying whether they require an input at every timestep or only once. Similarly, we can distinguish between networks that yield one output at the end of processing as opposed to networks that produce an output in every timestep. Networks that receive a single input at the beginning produce a single output at the end, are very similar to feed-forward neural networks in that they provide a one-to-one mapping (Fig. 5a). Similarly, we can construct networks that output a trajectory in response to a single command input, which corresponds to a one-to-many mapping. For streaming data, networks must be able to consume a temporal sequence of inputs. Thus, these networks have many inputs and can be further separated into many-to-many or many-to-one mappings.
For our above derivations pertaining to RTRL, we assumed a gradient of a loss function L defined as a sum over time L = t L t of temporally localized loss functions L t . We assumed further that these localized loss functions can be computed immediately after the corresponding network timestep is computed, i.e. online. It is easy to see how such local functions can be defined on the output of a many-to-many network (Fig. 5b). Similarly, this also covers the many-to-one case since we can simply set all losses for t < T to zero (L t = 0) and retain only one late value L T .
In the context of SNNs, several studies have focused on the many-to-many setting in which the network has to learn a predefined output trajectory. A classic example is FORCE learning which applies the recursive least squares algorithm which can be done forward in time and online [98,99]. A related approach is FOLLOW learning [100] which is another online learning approach for SNNs that relies on local learning rules to learn a forward model of arbitrary dynamical systems.
However, not all learning tasks require learning an output trajectory and hence temporal locality of the loss function is not the rule, but rather the exception. To solve classification problems, for example, one often considers a non-local loss that takes into account extended periods of network activity (Fig. 5b). For instance, in seminal work Gütig et al. [9,101] trained a spiking neuron, the Tempotron, as a binary classifier on input spike trains with a sparse temporal code. Crucially, the Tempotron can freely decide when to spike within a given temporal window. This has the advantage that targets or labels do not have to be given to a learning system with temporal acuity. To achieve this, the authors introduced a loss that depends on the maximum of the neural membrane potential during an entire trial. Hence, the loss function can only be evaluated once the trial is completed.
A similar approach was later extended to train multilayer SNNs by either computing the maximum over time or the sum over time of dedicated readout units [78,90] or by simply summing over the number of output spikes [15,18,102]. Although this approach can learn powerful classification models, non-local loss functions are not suitable for online learning since their gradient can only be evaluated after a trial is completed. Although similar to the late loss case, they are different in that their value depends explicitly all timesteps (Fig. 5b).
This temporal non-locality has far-reaching consequences for all classification and pattern detection tasks, because it means that all network activity has to be stored until the loss function can be evaluated. Thus gradient computation is locked until the last timestep of a given trial is evaluated. A similar situation arises when targets or labels are delayed because this also affects the corresponding loss function evaluations (Fig. 5b). While locking is the norm in the context of BPTT, online algorithms such as RTRL may lose their realtime character due to it. Therefore loss functions that can be evaluated online are crucial to successful online learning. Although research on this issue is still limited, one possible remedy to avoiding delays and locking is to fold the non-locality back into the network and to rely on a loss function that can be written as a sum of temporally localized losses.
One example builds on the van Rossum distance which was developed as a distance metric for spike trains [103]. As a spike train distance metric, it was recognized early on as a suitable loss function for training SNN to produce precisely timed output spikes [104], but as we will see in a moment, it is not limited to precisely timed output spike trains. For a given spike train S(t) = k δ(t k − t) and an associated target spike train S * (t), the van Rossum distance is given defined as For streaming data purposes, the many-to-one and many-to-many mapping in the gray shaded box are most relevant. b Illustration of different computational graphs used for computing loss functions. Online learning algorithms such as RTRL require temporally localized loss functions that can be evaluated at every timestep, which trivially generalizes to the regime in which many local losses are zero, e.g., L t for t < T . However, non-local loss functions such as the sum or the maximum over network outputs, as well as, delayed arrival of labels that cause the loss evaluation to be delayed pose challenges for online learning.
where * denotes the temporal convolution of the spike trains with the kernel function . The trick is now to define a filtered spike train Y = * S(t) as the network output and similarly a target network Y * = * S * (t). The kernel can be arbitrary, but choosing a causal kernel is critical for all online implementations. In practice, we choose a kernel that can be easily implemented as a simple dynamical system [17,104] that allows the online evaluation of the term in parenthesis (cf. Eq. (13)). Canonical choices are exponential or double exponential functions that are straightforward to implement with one or two additional ordinary differential equations for each output, e.g, τ ∂z ∂t = −z + S(t) with the kernel time scale τ . These manipulations allow us to write L = Finally, by further reducing the integral to a sum over discrete timesteps, this expression can be expressed as L = 1 As we already alluded to earlier, the van Rossum distance does not necessarily learn precisely time spikes. Rather, the extent to which the van Rossum distance punishes temporal misalignment can be smoothly adjusted by the time horizon of the kernel. For instance, if one chooses a small timescale τ for the exponential kernel function , this allows for learning precisely timed output spike trains. However, a large choice of τ will result in a reduction of temporal spike alignment of individual spikes with their targets. The distance then smoothly interpolates between spike-timing code and rate codes. Another interesting property of the van Rossum distance, in particular when used with causal kernels, is that causal kernels induce an effective delay, which acts as an eligibility trace [17]. Eligibility traces are found in neurobiology [105] and importantly they solve the distal reward problem by bridging the delay between a network output and the error feedback that arrives later in time [106].
Thus, at the expense of temporal precision, the van Rossum distance can introduce temporal memory and accommodate label delay during learning [17] while at the same time allowing to compute temporally localized losses. Finally, computing simple kernels such as exponential or double exponential for each network output corresponds to adding additional filtering layers to the network. This layer implies another set of diagonal Jacobians for which we have already seen how their gradients can be computed using RTRL.
While the van Rossum distance solves some challenges in the streaming data setting in which labels can only be assigned coarsely in time, a number of important issues remain open. Here future work, possibly building on aggregate label losses [101,107] or connectionist temporal classification (CTC) losses [43], provided these approaches can be made online-capable, may offer possible solutions.
In summary, online gradient algorithms based on RTRL require temporally localized losses. While such losses are the standard for learning output trajectories, there are situations in which temporally non-local loss functions are more natural choices (e.g., classification problems). Fortunately, a conversion to online-enabled losses is possible for some non-local loss functions, as we illustrated in the example of the van Rossum distance. But future research efforts are required to firmly establish online-capable loss functions for situations in which labels are only loosely aligned with streaming input.

V. IMPLEMENTATION STRATEGIES IN NEUROMORPHIC HARDWARE
How can the gradient-based learning strategies discussed in this article guide the development of a neuromorphic learning substrate? Memory is generally a limiting factor in neuromorphic hardware. In SNNs, the need for online learning in combination with such capacity limits makes forward mode accumulation particularly compelling. Sparse RTRL equates to one trace per parameter, which can be realized in hardware by replicating the circuits and storage for weight updates. For instance, for each connection in Intel Loihi Research Chip, up to two states that evolve as functions of the post-and pre-synaptic states are possible [5]. These states and their dynamics are implemented similarly to the weight update dynamics. Intel Loihi is thus in principle compatible with a sparse forward-mode AD learning scheme. However, several significant open challenges remain for its implementation. First, unlike weight (parameter) updates which can be carried out in an eventbased fashion, traces are dynamical states per connection that must be updated at every time step. Because scale in neuromorphic hardware and software simulations is often achieved by storing the synaptic traces at the neuronal level [108], computing traces per connection would significantly increase the computational burden. Second, the memory overhead becomes significant for large networks with shared weights, for instance in convolutional architectures, because weight sharing does not imply trace sharing. Although the idea of weight sharing is counter intuitive to a neuromorphic implementation, digital neuromorphic hardware can take advantage of it [4,5].
One solution to the above problems is to use the sparse approximation described above (cf. Equation (11)), previously implemented as part of DECOLLE [19] and subsequently for "LIF neurons" in [27], resulting to one trace per axon. In this approximation, synaptic traces use the same state as the synaptic currents for learning. This implies only constant (O(1)) memory overhead for learning, which significantly simplifies the neuromorphic hardware implementation. [109] describe a DECOLLE crossbar implementation that leverages the sharing of the learning and inference signals, while eliciting updates in a temporally sparse, error-driven fashion. Furthermore, the learning dynamics are potentially immune to mismatch in the synaptic dynamics, since the same signal is used for computing the forward pass and gradient dynamics. The gradient-based learning of SNNs induces a three factor rule (Equation (12)), comprising one term to compute the loss gradient ( ∂L ∂S t ) and two terms for the network states ( ∂S t ∂θ ). [109] exploits this factorization in a neuromorphic design comprising two types of cores: processing cores and neuromorphic cores. Processing cores are general-purpose processors that compute the loss gradients, and neuromorphic cores compute the network states and their gradients. A similar strategy was demonstrated on the Intel Loihi using the on-chip three Lakemont x86 cores [102]. This separation imparts significant flexibility to the hardware, as loss functions are often task dependent but network architectures tend to be generic.
Despite the advantages of online algorithms, reverse mode AD remains an important reference and a tool for training SNNs off-line and off-chip. One strategy for digital neuromorphic chips is to use a functional simulator of the dynamics [110], and train it using conventional deep learning techniques (GPUs and BPTT). Functional simulators and BPTT were also used to pretrain networks for subsequent learning on-chip [102]. Due to device-to-device variation, functional simulators of mixed signal hardware require calibration, for example by system identification [111,112]. This calibration scheme is costly and often imprecise, because the limited access to the chips' internal states dictates several approximations.
Hardware-in-the-loop approaches can partially overcome this limitation by using the hardware substrate to compute the forward pass and computing synaptic updates in software. Recent work demonstrated successful learning on an accelerated neuromorphic VLSI substrate using a chip-in-the-loop approach [113], as well as in phase-change memory [21]. While this approach can self-correct for any remaining device mismatch, it requires dedicated on-chip circuitry, e.g., sampling ADCs, to read out the internal voltages. Additionally, the external computation of updates poses a significant performance bottleneck which renders this strategy often too slow for real-time or accelerated learning. One solution that has not been fully explored is to pre-train networks on an approximate functional simulator of a mixed signal chip, and fine-tune on chip, as in [102]. Regardless of which training strategy is employed, the methods based on reverse mode AD are generally limited by memory. Hence, mixed mode AD and other advanced AD methods will presumably play a central role in reducing the memory footprint and thereby improving the performance and applicability of SNNs training to dedicated neuromorphic substrates.

VI. DISCUSSION
This article has reviewed common bottlenecks encountered when applying gradient-based learning to neuromorphic architectures that require online computation of gradients. Using the example of SNNs, we have shown that many recently proposed learning algorithms for online learning are approximate variants of RTRL. In its exact form, RTRL is an online algorithm to compute gradients in RNNs, but it is computationally expensive. However, when used in combination with temporally local losses and biologically inspired neural architectures such as LIF neurons, it is possible to find effective approximations that reduce RTRL's computational cost substantially while retaining good learning performance. We have shown that such approximations are inspired by biological neurons whose implicit recurrence structure induces block sparse or approximately block sparse Jacobians, allowing to speed up gradient computation while simultaneously reducing the memory footprint. These conceptual links expose a clear path forward toward building more efficient online learning algorithms for neuromorphic devices.
We further elaborated on the relationship between gradient-based learning in SNNs and the different modes of Automatic Differentiation. Due to sufficient memory and the lower computational cost on von Neumann computers, deep learning has primarily focused on reverse mode accumulation or BPTT. However, when combining appropriate architectures with approximate RTRL, we expect a renaissance of forward-mode AD to empower non-von Neumann computers and streaming applications requiring online learning. Mixed-mode AD is a particularly exciting direction for implementing learning in biological neural networks as it efficiently reduces complexity by exploiting the Jacobians' sparseness. This idea resonates with a widespread trend in deep learning accelerators: To trade-off compute against memory, as evidenced by advanced AD techniques on manycore processors the Cerebras CS-1 and the Graphcore IPU.

A. Spatial and Temporal Scales in Gradient-Based Learning
Computation is a physical process that extends across multiple spatial and temporal scales. Practical learning algorithms have to take this multi-scale behavior into account. The finite truncation length in BPTT defines the time span (memory) over which the learning algorithm can efficiently navigate between timescales, even when the network itself supports such dynamics (e.g., through working memory or gating mechanisms in LSTMs [114]). Thus, prematurely truncated gradients can be harmful. Memory and stability go hand-in-hand in dynamical systems, including RNNs. Miller and Hardt [46] argue that RNNs are trained to operate in the stable regime in which gradients vanish for stable learning. However, stability can often be at odds with longterm memory [115]. Therefore, many working memory models in neuroscience rely on multistability and attractor states to implement long-term memory [116][117][118], with regions of state space in which gradients either vanish [119] or become very large [120]. LSTM networks explicitly overcome this shortcoming by including long timescales in their architecture thereby avoiding vanishing gradients [115]. This twist, however, requires training procedures that can similarly bridge such long time horizons, which is where truncated BPTT can easily reach its limits. An interesting question is whether training with approximate forms of RTRL can offer advantages in training networks with such longer short-term memory. Thus, if remaining issues related to the bias of approximate variants of RTRL and stability can be addressed at scale, these findings will open new vistas in computing with non-von Neumann computers.

B. Solutions to the Weight Transport Problem
Another issue that plagues neuromorphic implementations of BP is that reverse accumulation requires access to the transposed weight matrices W and V . This requirement has also been referred to as the weight transport problem [121]. The weight transport problem implies a bidirectional flow of information, which is biologically implausible, and, crucially, difficult to realize in any physical system [122,123]. The problem is that to compute the gradient, error information needs to flow backward through the same connections that are used in the forward pass. This creates a major impediment for any physical system, be it biological or neuromorphic, in which information flow is directed. Overcoming this limitation often requires an explicit learning channel that implements these backward weights or weight transposes [122,123] and ideally keeps them synchronized across nodes or different physical locations of a physical network implementation. A number of studies have shown that such synchronization can be achieved, for instance, by learning the backward weights [124][125][126].
In any case, the mere existence of such backward connections for spatial credit assignment still has a real physical price as they require space on the chip and consume energy, thus raising the question of whether one could dispense with them entirely. Interestingly, RTRL and its approximate variants do suggest ways forward toward viable alternatives. Since RTRL preserves causality with respect to the forward dynamics, the temporal weight transport problem does not apply. Unfortunately, exact RTRL still requires synapses to know the state of all other neurons and synapses in the network, thus still posing a weight transport problem. The block sparse Jacobian approximations we discussed in this article are not straight forward to extend to learning across multiple layers. Nevertheless, the inherently causal nature of RTRL and the recent success of local loss functions for training SNNs [19,97,127,128] provide exciting avenues of future research to spatially assign credit and circumvent the weight transport problem in multilayer networks.
While a plethora of neuromorphic platforms, which mimic the brain's computational substrate, have matured over the years, seeing these solutions thrive in realworld applications will require them to learn. To tackle this challenge, we should turn once more to biology. Taking inspiration from the brain will allow us to develop neuromorphic learning algorithms toward tomorrow's neuromorphic computers.