Natural-gradient learning for spiking neurons

In many normative theories of synaptic plasticity, weight updates implicitly depend on the chosen parametrization of the weights. This problem relates, for example, to neuronal morphology: synapses which are functionally equivalent in terms of their impact on somatic firing can differ substantially in spine size due to their different positions along the dendritic tree. Classical theories based on Euclidean-gradient descent can easily lead to inconsistencies due to such parametrization dependence. The issues are solved in the framework of Riemannian geometry, in which we propose that plasticity instead follows natural-gradient descent. Under this hypothesis, we derive a synaptic learning rule for spiking neurons that couples functional efficiency with the explanation of several well-documented biological phenomena such as dendritic democracy, multiplicative scaling, and heterosynaptic plasticity. We therefore suggest that in its search for functional synaptic plasticity, evolution might have come up with its own version of natural-gradient descent.


Introduction
Understanding the fundamental computational principles underlying synaptic plasticity represents a long-standing goal in neuroscience. To this end, a multitude of top-down computational paradigms have been developed, which derive plasticity rules as gradient descent on a particular objective function of the studied neural network (Rosenblatt, 1958;Rumelhart et al., 1986;Pfister et al., 2006;D'Souza et al., 2010;Friedrich et al., 2011).
However, the exact physical quantity to which these synaptic weights correspond often remains unspecified. What is frequently simply referred to as w ij (the synaptic weight from neuron j to neuron i) might relate to different components of synaptic interaction, such as calcium concentration in the presynaptic axon terminal, neurotransmitter concentration in the synaptic cleft, receptor activation in the postsynaptic dendrite or the postsynaptic potential (PSP) amplitude in the spine, the dendritic shaft or at the soma of the postsynaptic cell. All of these biological processes can be linked by transformation rules, but depending on which of them represents the variable with respect to which performance is optimized, the network behavior during training can be markedly different.
As an example we consider the parametrization of the synaptic strength either as PSP amplitude in the soma, w s , or as PSP amplitude in the dendrite, w d (see also Fig. 1 and Section 2.1). Reparametrizing the synaptic strength in this way implies an attenuation factor for each single synapse, but different factors are assigned across the positions on the dendritic tree. As a consequence, the weight vector will follow a different trajectory during learning depending on whether the somatic or dendritic parametrization of the PSP amplitude was chosen.
It certainly could be the case that evolution has favored one particular parametrization over all others during its gradual tuning of synaptic plasticity, but this would necessarily imply sub-optimal convergence for all but a narrow set of neuron morphologies and connectome configurations. An invariant learning rule on the other hand would not only be mathematically unambiguous and therefore more elegant, but could also improve learning, thus increasing fitness. ] (also shown in light blue). The resulting weight change leads to an increase ∆w s in the somatic EPSP after learning. The dark blue arrows track the calculation of the same gradient, but with respect to the dendritic EPSP (also shown in dark blue): 1) taking the attenuation into account in order to compute the error as a function of w d , 2) calculating the gradient, followed by 3) deriving the associated change in∆w s , again considering attenuation. Due to the attenuation f (w) entering the calculation twice, the synaptic weights updates, as well as the associated evolution of a neuron's output statistics over time, will differ under the two parametrizations.
In some aspects, the question of invariant behavior is related to the principle of relativity in physics, which requires the laws of physics -in our case: the improvement of performance during learning -to be the same in all frames of reference. What if neurons would seek to conserve the way they adapt their behavior regardless of, e.g., the specific positioning of synapses along their dendritic tree? Which equations of motion -in our case: synaptic learning rulesare able to fulfill this requirement?
The solution lies in following the path of steepest descent not in relation to a small change in the synaptic weights (Euclidean gradient descent), but rather with respect to a small change in the input-output distribution (natural gradient descent). This requires taking the gradient of the error function with respect to a metric defined directly on the space of possible input-output distributions, with coordinates defined by the synaptic weights. First proposed in Amari (1998), but with earlier roots in information geometry (Amari, 1987;Amari and Nagaoka, 2000), natural gradient methods (Yang and Amari, 1998;Rattray and Saad, 1999;Park et al., 2000;Kakade, 2001) have recently been rediscovered in the context of deep learning (Pascanu and Bengio, 2013;Martens, 2014;Ollivier, 2015;Amari et al., 2019;Bernacchia et al., 2018). Moreover, Pascanu and Bengio (2013) showed that the natural gradient learning rule is closely related to other machine learning algorithms. However, most of the applications focus on rate-based networks which are not inherently linked to a statistical manifold and have to be equipped with Gaussian noise or a probabilistic output layer interpretation in order to allow an application of the natural gradient. Furthermore, a biologically plausible synaptic plasticity rule needs to make all of the required information accessible at the synapse itself, which is usually unnecessary and therefore largely ignored in machine learning.
The stochastic nature of neuronal outputs in-vivo (see, e.g. Softky and Koch, 1993) provides a natural setting for plasticity rules based on information geometry. As a model for biological synapses, natural gradient combines the elegance of invariance with the success of gradient-descent-based learning rules. In this manuscript, we derive a closed-form synaptic learning rule based on natural gradient descent for spiking neurons and explore its implications. Our learning rule equips the synapses with more functionality compared to classical error learning by enabling them to adjust their learning rate to their respective impact on the neuron's output. It naturally takes into account relevant variables such as the statistics of the afferent input or their respective positions on the dendritic tree. This allows a set of predictions which are corroborated by both experimentally observed phenomena such as dendritic democracy and multiplicative weight dynamics and theoretically desirable properties such as Bayesian reasoning (Marceau-Caron and Ollivier, 2017). Furthermore, and unlike classical error-learning rules, plasticity based on the natural gradient is able to incorporate both homo-and heterosynaptic phenomena into a unified framework. While theoretically derived heterosynaptic components of learning rules are notoriously difficult for synapses to implement due to their non-locality, we show that in our learning rule they can be approximated by quantities accessible at the locus of plasticity. In line with results from machine learning, the combination of these features also enables faster convergence during supervised learning.

Naive Euclidean gradient is not parametrization-invariant
We consider a cost function C on the neuronal level that, in the sense of cortical credit assignment (see e.g. Sacramento et al., 2017), can relate to some behavioral cost of the agent that it serves. The output of the neuron depends on the amplitudes of the somatic PSPs elicited by the presynaptic spikes. We denote these "somatic weights" by w s , and may parametrize the neuronal cost as C = C s [w s ]. However, dendritic PSP amplitudes w d can be argued to offer a more unmitigated representation of synaptic weights, so we might rather wish to express the cost as C = C d [w d ]. These two parametrizations are related by an attenuation factor α (between 0 and 1): w s = αw d . In general, this attenuation factor depends on the synaptic position and is therefore described by a vector that is multiplied component-wise with the weights.
It may now seem straightforward to switch between the somatic and dendritic representation of the cost by simply substituting variables, for example To derive a plasticity rule for the somatic and dendritic weights we might consider gradient descent on the cost: At first glance, this relation seems reasonable: dendritic weight changes affect the cost more weakly then somatic weight changes, so their respective gradient is more shallow by the factor α. However, from a functional perspective, the opposite should be true: dendritic weights should experience a larger change than somatic weights in order to elicit the same effect on the cost. This inconsistency can be made explicit by considering that somatic weight changes are, themselves, attenuated dendritic weight changes: ∆w s = α∆w d . Substituting this into Eqn. 1 leads to a contradiction: ∆w d = α 2 ∆w d . This reasoning is visualized in Fig. 1 for the general case where the somatic and dendritic weights are related by some arbitrary function f . To solve the conundrum we need to shift the focus from changing the synaptic input to changing the neuronal output, while at the same time considering a more rigorous treatment of gradient descent (see also Surace et al., 2020).

Natural gradient plasticity rule
We consider a neuron with somatic potential V evoked by the spikes x i of n presynaptic afferents firing at rates r i . The presynaptic spikes of afferent i cause a train of weighted dendritic potentials w d x i locally at the synaptic site. The x i denotes the unweighted synaptic potential (USP) train elicited by the low-pass-filtered spike train x i . At the soma, each dendritic potential is attenuated by a potentially nonlinear function that depends on the synaptic location:

The somatic voltage thus reads as
We further assume that the neuron's firing follows an inhomogeneous Poisson process whose rate φ t (V ) ∶= φ(V t ) depends on the current membrane potential through a nonlinear transfer function φ. In this case, spiking in a sufficiently short interval [t, t + dt] is Bernoulli-distributed. The probability of a spike occurring in this interval (denoted as y t = 1) is then given by which defines our generalized linear neuron model (Gerstner and Kistler, 2002). Here, we used x as a shorthand notation for the USP vector (x i ). In the following, we drop the time indices for better readability.
Assuming that the neuron strives to reproduce a target firing distribution p * (y x ), plasticity may follow a supervisedlearning paradigm based on gradient descent. In this case, the Kullback-Leibler divergence between the neuron's current and its target firing distribution represents a natural cost function which measures the error between the current and the desired output distribution in an information-theoretic sense. Minimizing this cost function by naive Euclidean gradient descent with respect to the synaptic weights (denoted by ∇ e w ) results in the well-known error-correcting rule (Pfister et al., 2006)  (A) During supervised learning, the error between the current and the target state is measured in terms of a cost function defined on the neuron's output space; in our case, this is the manifold formed by the neuronal output distributions p(y, x). As the output of a neuron is determined by the strength of incoming synapses, the cost C is an implicit function of the afferent weight vector w. Since the gradient of a function depends on the distance measure of the underlying space, Euclidean gradient descent, which follows the gradient of the cost as a function of the synaptic weights ∂C ∂w, is not uniquely defined, but depends on how w is parametrized. If, instead, we follow the gradient on the output manifold itself, it becomes independent of the underlying parametrization. Expressed in a specific parametrization, the resulting natural gradient contains a correction term that accounts for the distance distortion between the synaptic parameter space and the output manifold. (B-C) Standard gradient descent learning is suited for isotropic (B), rather than for non-isotropic (C) cost functions. For example, the magnitude of the gradient decreases in valley regions where the cost function is flat, resulting in slow convergence to the target. A non-optimal choice of parametrization can introduce such artefacts and therefore harm the performance of learning rules based on Euclidean gradient descent. In contrast, natural gradient learning will locally correct for distortions arising from non-optimal parametrizations.
which is a spike-based version of the classical perceptron learning rule (Rosenblatt, 1958), whose multilayer version forms the basis of the error-backpropagation algorithm (Rumelhart et al., 1986). Here, Y * denotes a teacher spike train sampled from p * . On the single-neuron level, a possible biological implementation has been suggested by Urbanczik and Senn (2014), who demonstrated how a neuron may exploit its morphology to store errors, an idea that was recently extended to multilayer networks (Sacramento et al., 2017).
However, as we argued above, learning based on Euclidean gradient descent is not unproblematic. It cannot account for synaptic weight (re)parametrization, as caused, for example, by the diversity of synaptic loci on the dendritic tree. Convergence of learning is therefore harmed by the slow adaptation of distal synapses compared to equally important proximal counterparts. With the multiplicative USP term x in Eqn. 4 being the only manifestation of presynaptic activity, there is no mechanism by which to take into account input variability, which can, in turn, also impede learning. Furthermore, when compared to experimental evidence, this learning rule cannot explain heterosynaptic plasticity, as it is purely presynaptically gated.
In general, Euclidean gradient descent is well-known to exhibit slow convergence in non-isotropic regions of the cost function (Ruder, 2016), with such non-isotropy frequently arising or being aggravated by an inadequate choice of parametrization (see Ollivier, 2015, and Fig. 2). In contrast, natural gradient descent is, by construction, immune to these problems. The key idea of natural gradient as outlined by Amari is to follow the (locally) shortest path in terms of the neuron's firing distribution. Argued from a normative point of view, this is the only "correct" path to consider, since plasticity aims to adapt a neuron's behavior, i.e., its input-output relationship, rather than some internal parameter (Fig. 2).
For the concept of a locally shortest path to make sense in terms of distributions, we require the choice of a distance measure for probability distributions. Since a parametric statistical model, such as the set of our neuron's realizable output distributions, forms a Riemannian manifold (Rao, 1945;Amari and Nagaoka, 2000), a local distance measure can be obtained in form of a Riemannian metric. The Fisher metric (Rao, 1945), an infinitesmial version of the D KL , represents a canonical choice on manifolds of probability distributions, since it is generally the unique metric that remains invariant under sufficient statistics (Cencov, 1972). On a given parameter space, the Fisher metric may be expressed in terms of a bilinear product with the Fisher information matrix The Fisher metric locally measures distances in the p-manifold as a function of the chosen parametrization. We can then obtain the natural gradient (which intuitively may be thought of as "∂C ∂p") by correcting the Euclidean gradient ∇ e w C ∶= ∂C ∂w with the distance measure above: The natural gradient learning rule is then given asẇ = −η∇ n w C. Calculating the right-hand expression for the case of Poisson-spiking neurons (for details, see Supplementary Information, Sections S.1 and S.2), this takes the forṁ where w is an arbitrary weight parametrization that relates to the somatic amplitudes via a component-wise rescaling . For easier reading, we use several shorthand notations: multiplications and divisions of vectors, scalar functions and additions of scalars to vectors apply component-wise. Eqn. 7 represents the complete expression of our natural gradient rule, which we discuss throughout the remainder of the manuscript.
Natural gradient learning conserves both the error term [Y * − φ(V )] and the USP contribution x from classical gradient-descent plasticity. However, by including the relationship between the parametrization of interest w and the somatic PSP amplitudes f (w), natural-gradient-based plasticity explicitly accounts for reparametrization distortions, such as those arising from PSP attenuation during propagation along the dendritic tree. Furthermore, natural-gradient learning introduces multiple scaling factors and new plasticity components, whose characteristics will be further explored in dedicated sections below (see also Supplementary Information, Sections S.3.1 and S.3.2 for more details).
First of all, we note the appearance of two scaling factors (more details in Section 2.5). On one hand, the size of the synaptic adjustment is modulated by a global scaling factor γ s , which adjusts synaptic weight updates to the characteristics of the output non-linearity, similarly to the synapse-specific scaling by the inverse of f ′ . Furthermore γ s also depends on the output statistics of the neuron, harmonizing plasticity across different states in the output distribution (see Supplementary Information, Section S.3.1). On the other hand, a second, synapse-specific learning rate scaling accounts for the statistics of the input at the respective synapse, in the form of a normalization by the afferent input rate c r, where c is a constant that depends on the PSP kernel (see Section 4.1). Unlike the global modulation introduced by γ s , this scaling only affects the USP-dependent plasticity component. Just as for Euclidean-gradient-based learning, the latter is directly evoked by the spike trains arriving at the synapse. Therefore, the resulting plasticity is homosynaptic, affecting only synapses which receive afferent input.
However, in the case of natural-gradient learning, this input-specific adaptation is complemented by two additional forms of heterosynaptic plasticity (Section 2.6). First, the learning rule has a bias term γ u which uniformly adjusts all synapses and may be considered homeostatic, as it usually opposes the USP-dependent plasticity contribution. The amplitude of this bias does not exclusively depend on the afferent input at the respective synapse, but is rather determined by the overall input to the neuron. Thus, unlike the USP-dependent component, this heterosynaptic plasticity component equally affects both active and inactive inactive synaptic connections. Furthermore, natural gradient descent implies the presence of another plasticity component γ w f (w) which adapts the synapses depending on their current weight. More specifically, connections that are already strong are subject to larger changes compared to weaker ones. Since the proportionality factor γ w only depends on global variables such as the membrane potential, this component also affects both active and inactive synapses.
The full expressions for γ s , γ u and γ w are complicated functions of the membrane potential, its mean and variance, as well as, for γ u and γ w , of the total input ∑ n i=1 x i and the total instantaneous presynaptic rate ∑ n i=1 r i . However, under reasonable assumptions such as a high number of presynaptic partners and for a large, diverse set of empirically tested scenarios, we have shown that these factors can be reduced to simple functions of variables that are fully accessible at the locus of individual synapses. The above learning rule along with closed-form expressions for these factors (Supplementary Information, Sections S.3.1 and S.3.2) represent the main analytical findings of this paper.
We note that, while having used a standard sigmoidal transfer function throughout the paper, Eqn. 7 holds for every sufficiently smooth φ. Moreover, there exists a quadratic transfer function for which our learning rule becomes particularly simple, which we discuss in Section S.4 of the Supplementary Information.
In the following, we demonstrate that the additional terms introduced in natural-gradient-based plasticity confer important advantages compared to Euclidean gradient descent, both in terms of of convergence as well as with respect to biological plausibility. More precisely, we show that our plasticity rule improves convergence in a supervised learning task involving an anisotropic cost function, a situation which is notoriously hard to deal with for Euclidean-gradientbased learning rules (Ruder, 2016). We then proceed to investigate natural-gradient learning from a biological point of view, deriving a number of predictions that can be experimentally tested, with some of them related to in-vivo observations that are otherwise difficult to explain with classical gradient-based learning rules.

Natural gradient speeds up learning
Non-isotropic cost landscapes can easily be provoked by non-homogeneous input conditions. In nature, these phenomena arise under a wide range of circumstances, for elementary reasons that boil down to morphology (neurons are not We tested the performance of the natural gradient rule in a supervised learning scenario, where a single output neuron had to adapt its firing distribution to a target distribution, delivered in form of spikes from a teacher neuron. The input consisted of Poisson spikes from n = 100 afferents, half of them firing at 10 Hz and 50 Hz, respectively. (B-C) Spike trains, PSTHs and voltage traces for teacher (orange) and student (red) neuron before (B) and after (C) learning with natural-gradient plasticity. During learning, the firing patterns of the student neuron align to those of the teacher neuron. (D-E) Exemplary weight evolution during Euclidean-gradient (D) and natural-gradient (E) learning given n = 2 afferents with the same two rates as before. Thick solid lines represent contour lines of the cost function C. The respective vector fields depict normalized negative Euclidean and natural gradients of the cost C, averaged over 2000 input samples. The thin solid lines represent the paths traced out by the input weights during learning. (F) Learning curves for n = 100 afferents using natural-gradient and Euclidean-gradient plasticity. The plot shows averages over 1000 trials with initial and target weights randomly chosen from a uniform distribution U (−1 n, 1 n). Fixed learning rates were tuned for each algorithm separately to exhibit the fastest possible convergence to a root mean squared error of 0.8 Hz in the student neuron's output rate. symmetrical geometric objects) and function (neurons receive input from multiple afferents that perform different computations and thus behave differently). To evaluate the convergence behavior of our learning rule and compare it to Euclidean gradient descent, we considered a very generic situation in which a neuron is required to map a diverse set of inputs onto a target output.
In order to induce a simple and intuitive anisotropy of the error landscape, we divided the afferent population into two equally sized groups of neurons with different firing rates (Fig. 3A). This resulted in an asymmetric cost function, as visible from the elongated contour lines (Fig. 3D,E). We further chose a realizable teacher by simulating a different neuron with the same input populations connected via a predefined set of target weights w * . Fig. 3B,C show that our natural-gradient rule enables the student neuron to adapt its weights to reproduce the teacher voltage V * and thereby its output distribution.
In the following, we compare learning in two student neurons, one endowed with Euclidean-gradient plasticity (Eqn. 4, Fig. 3D) and one with our natural-gradient rule (Eqn. 7, Fig. 3E). To better visualize the difference between the two rules, we used a two-dimensional input weight space, i.e., one neuron per afferent population. While the negative Euclidean gradient vectors stand, by definition, perpendicular to the contour lines of C, the negative natural gradient vectors point directly towards the target weight configuration w * . Due to the anisotropy of C induced by the different Figure 4: Natural-gradient learning scales synaptic weight updates depending on their distance from the soma. We stimulated a single excitatory synapse with Poisson input at 5 Hz, paired with a Poisson teacher spike train at 20 Hz. The distance d from soma was varied between 1 µm and 10 µm and attenuation was assumed to be linear and proportional to the inverse distance from soma. To make weight changes comparable, we scaled dendritic PSP amplitudes inversely with d + 1 in order for all of them to produce the same PSP amplitude at the soma. (A) Example PSPs before (solid lines) and after (dashed lines) learning for two synapses at 3 µm and 7 µm. Application of our natural-gradient rule results in equal changes for the somatic PSPs. (B) Example traces of synaptic weights for the two synapses in (A). (C) Absolute and relative dendritic amplitude change after 5 s as a function of a synapse's distance from the soma.
input rates (see also Fig. 2B), Euclidean-gradient learning starts out by mostly adapting the high-rate afferent weight and only gradually begins learning the low-rate afferent. In contrast, natural gradient adapts both synaptic weights homogeneously. This is clearly reflected by paths traced by the synaptic weights during learning.
Overall, this lead to faster convergence of the natural gradient plasticity rule compared to Euclidean gradient descent. In order to enable a meaningful comparison, learning rates were tuned separately for each plasticity rule in order to optimize their respective convergence speed. The faster convergence of natural-gradient plasticity is a robust effect, as evidenced in Fig. 3F by the average learning curves over 1000 trials.
In addition to the functional advantages described above, natural-gradient learning also makes some interesting predictions about biology, which we address below.

Democratic plasticity
As discussed in the introduction, classical gradient-based learning rules do not usually account for neuron morphology. Since attenuation of PSPs is equivalent to weight reparametrization and our learning rule is, by construction, parametrization-invariant, it naturally compensates for the distance between synapse and soma. In Eqn. 7, this is reflected by a component-wise rescaling of the synaptic changes with the inverse of the attenuation function f ′ , which is induced by the Fisher information metric (see also Fig. 8 and the corresponding section in the Methods). Under the assumption of passive attenuation along the dendritic tree, we have w s where d i denotes the distance of the ith synapse from the soma. More specifically, α(d) = e −d λ , where λ represents the electrotonic length scale. We can write the natural-gradient rule aṡ For functionally equivalent synapses (i.e., with identical input statistics), synaptic changes in distal dendrites are scaled up compared to proximal synapses. As a result, the effect of synaptic plasticity on the neuron's output is independent of the synapse location, since dendritic attenuation is precisely counterbalanced by weight update amplification.
We illustrate this effect with simulations of synaptic weight updates at different locations along a dendritic tree in Fig. 5. Such "democratic plasticity", which enables distal synapses to contribute just as effectively to changes in the output as proximal synapses, is reminiscent of the concept of "dendritic democracy" (Magee and Cook, 2000). These experiments show increased synaptic amplitudes in the distal dendritic tree of multiple cell types, such as rat hippocampal CA1 neurons; dendritic democracy has therefore been presumed to serve the purpose of giving distal inputs a "vote" on the neuronal output. Still, experiments show highly diverse PSP amplitudes in neuronal somata (Williams and Stuart, 2002). Our plasticity rule refines the notion of democracy by asserting that learning itself rather than its end result is rescaled in accordance with the neuronal morphology. Whether such democratic plasticity ultimately leads to distal and proximal synapses having the same effective vote at the soma depends on their respective importance towards reaching the target output. In particular, if synapses from multiple afferents that encode the same information are randomly distributed along the dendritic tree, then democratic plasticity also predicts dendritic democracy, as the scaling of weight changes implies a similar scaling of the final learned weights. Note, however, that the absence of dendritic democracy does not contradict the presence of democratic plasticity, as afferents from different cortical regions might target specific positions on the dendritic tree (see, e.g., Markram et al., 2004).

Input and output-specific scaling
In addition to undoing distortions induced by, e.g., attenuation, the natural gradient rule predicts further modulations of the homosynaptic learning rate. The factor γ s in Eqn. 7 represents an output-dependent global scaling factor (for both homo-and heterosynaptic plasticity): It increases the learning rate in regions where the sigmoidal transfer function is flat (see also Section S.3.1). This represents an unmediated reflection of the philosophy of natural gradient descent, which finds the steepest path for a small change in output, rather than in the numeric value of some parameter. The desired change in the output requires scaling the corresponding input change by the inverse slope of the transfer function. Furthermore, synaptic learning rates are inversely correlated to the USP variance σ 2 (x ) (Fig. 5). In particular, for the homosynaptic component, the scaling is exactly equal to σ 2 (x ) −1 = c r i (see Eqn. 7 and Section 4.1). In other words, natural gradient learning explicitly scales synaptic updates with the (un)reliability of their input. To demonstrate this effect in isolation, we simulated the effects of changing the USP variance while conserving its mean. Moreover, to demonstrate its robustness, we independently varied two contributors to the input reliability, namely input rates (which enter σ 2 (x ) directly) and synaptic time constants (which affect the PSP-kernel-dependent scaling constant c ). Fig. 5 shows how unreliable input leads to slower learning, with an inverse dependence of synaptic weight changes on the USP variance. We note that this observation also makes intuitive sense from a Bayesian point of view, under which any information needs to be weighted by the reliability of its source (cf. also Aitchison and Latham, 2014, although our interpretation is different).

Interplay of homosynaptic and heterosynaptic plasticity
One elementary property of update rules based on Euclidean gradient descent is their presynaptic gating, i.e., all weight updates are scaled with their respective synaptic input x . Therefore, they are necessarily restricted to homosynaptic plasticity, as studied in classical LTP and LTD experiments (Bliss and Lømo, 1973;Dudek and Bear, 1992). As discussed above, natural-gradient learning retains a rescaled version of this homosynaptic contribution, but at the same time predicts the presence of two additional plasticity components. Contrary to homosynaptic plasticity, these components also adapt synapses to currently non-active afferents, given a sufficient level of global input. Due to their lack of input specificity, they give rise to heterosynaptic weight changes, a form of plasticity that has been observed in hippocampus (Chen et al., 2013;Lynch et al., 1977), cerebellum (Ito and Kano, 1982) and neocortex (Chistiakova and Volgushev, 2009), mostly in combination with homosynaptic plasticity. A functional interpretation of heterosynaptic plasticity, to which our learning rule also alludes, is as a prospective adaptation mechanism for temporarily inactive synapses such that, upon activation, they are already useful for the neuronal output.
Our natural-gradient learning rule Eqn. 7 can be more summarily rewritten as where the three additive terms represent the variance-normalized homosynaptic plasticity, the uniform heterosynaptic plasticity and the weight-dependent heterosynaptic plasticity: To avoid singularities in the learning rule, the other five received extremely weak input at 0.01 Hz. In addition, we assumed the presence of tonic inhibition as a balancing mechanism for keeping the neuron's output within a reasonable regime. Afferent stimulus was paired with teacher spike trains at 20 Hz and plasticity at both stimulated and unstimulated synapses was evaluated in comparison with their initial weights. For simplicity, initial weights within each group were assumed to be equal. with the common proportionality factor ηγ s −1 composed of the learning rate, the outputdependent global scaling factor, the postsynaptic error, a sensitivity factor and the inverse attenuation function, in order of their appearance. The effect of these three components is visualized in Fig. 6B. The homosynaptic term ∆w hom is experienced only by stimulated synapses, while the two heterosynaptic terms act on all synapses. The first heterosynaptic term ∆w hetu introduces a uniform adjustment to all components by the same amount, depending on the global activity level. For a large number of presynaptic inputs, it can be approximated by a constant (see Section S.3.2). Furthermore, it usually opposes the homosynaptic change, which we address in more detail below.
In contrast, the contribution of the second heterosynaptic term ∆w hetw is weight-dependent, adapting all synapses in proportion to their current strength. This explains experimental results such as Loewenstein et al. (2011), which found in-vivo weight changes in the neocortex to be proportional to the spine size, which itself is correlated with synaptic strength (Asrican et al., 2007). Our simulations show that ∆w hetw is roughly a linear function of the membrane potential (more specifically, its deviation with respect to its baseline). Since the latter can be interpreted as a scalar product between the afferent input vector and the synaptic weight vector, it implies that input transmitted by strong synapses has the largest impact on this heterosynaptic plasticity component. In comparison, input from weak synapses only has a small effect, thus requiring persistent and strong stimulation of these synapses to induce significant changes and "override the status quo" of the neuron. Since, following a period of learning, afferents connected via weak synapses can be considered uninformative for the neuron's target output, this mechanism ensures a form of heterosynaptic robustness towards noise.
The homo-and heterosynaptic terms exhibit an interesting relationship. To illustrate the nature of their interplay, we simulated a simple experiment (Fig. 7A) with varying initial synaptic weights for both active and inactive presynaptic afferents. Stimulated synapses (Fig. 7B) are seen to undergo strong potentiation (LTP) for very small initial weights; the magnitude of weight changes decreases for larger initial amplitudes until the neuron's output matches its teacher, at which point the sign of the postsynaptic error term flips. For even larger initial weights, potentiation at stimulated synapses therefore turns into depression (LTD), which becoms stronger for higher initial values of the stimulated synapses' weights. This is in line with the error learning paradigm, in which changes in synaptic weights seek to reduce the difference between a neuron's target and its output.
For unstimulated synapses (Fig. 7C), we observe a reversed behavior. For small weights, the negative uniform term ∆w hetu dominates and plasticity is depressing. As for the homosynaptic case, the sign of plasticity switches when the weights become large enough for the error to switch sign. Therefore, in the regime where stimulated synapses experienced potentiation, unstimulated synapses are depressed and vice-versa. This reproduces various experimental observations: on one hand, homosynaptic potentiation has often been found to be accompanied by heterosynaptic depression (Lynch et al., 1977), such as in the amygdala (Royer and Paré, 2003) or the visual cortex (Arami et al., 2013); on the other hand, when the postsynaptic error term switches sign, depression at unstimulated synapses transforms into potentiation (Wöhrl et al., 2007;Royer and Paré, 2003).
While plasticity at stimulated synapses is unaffected by the initial state of the unstimulated synapses, plasticity at unstimulated synaptic connections depends on both the stimulated and unstimulated weights. In particular, when either of these grow large enough, the proportional term ∆w hetw overtakes the uniform term ∆w hetu and heterosynaptic plasticity switches sign again. Thus, for very large weights (top right corner of Fig. 7C), heterosynaptic potentiation transforms back into depression, in order to more quickly quench excessive output activity. This behavior is useful for both supervised and unsupervised learning scenarios (Zenke and Gerstner, 2017), where it was shown that pairing Hebbian terms with heterosynaptic and homeostatic plasticity is crucial for stability.
In summary, we can distinguish three plasticity regimes for natural-gradient learning (Fig. 7D-G). In two of these regimes, heterosynaptic and homosynaptic plasticity are opposed (O1, O2), whereas in the third, they are aligned and lead to depression (S). The two opposing regimes are separated by the zero-error equilibrium line, at which plasticity switches sign.

Discussion
As a consequence of the fundamentally stochastic nature of evolution, it is no surprise that biology withstands confinement to strict laws. Still, physics-inspired arguments from symmetry and invariance can help uncover abstract principles that evolution may have gradually discovered and implemented into our brains. Here, we have considered parametrization invariance in the context of learning, which, in biological terms, translates to the fundamental ability of neurons to deal with diversity in their morphology and input-output characteristics. This requirement ultimately leads to various forms of scaling and heterosynaptic plasticity that are experimentally well-documented, but can not be accounted for by classical paradigms that regard plasticity as Euclidean gradient descent. In turn, these biological phenomena can now be seen as a means to jointly improve and accelerate error-correcting learning.
Inspired by insights from information geometry, we applied the framework of natural gradient descent to biologically realistic neurons with extended morphology and spiking output. Compared to classical error-correcting learning rules, our plasticity paradigm requires the presence of several additional ingredients. First, a global factor adapts the learning rate to the particular shape of the voltage-to-spike transfer function and to the desired statistics of the output, thus addressing the diversity of neuronal response functions observed in vivo (Markram et al., 2004). Second, the homosynaptic component of plasticity is normalized by the variance of presynaptic inputs, which provides a direct link to Bayesian frameworks of neuronal computation (Aitchison and Latham, 2014;Jordan et al., 2020). Third, our rule contains a uniform heterosynaptic term that opposes homosynaptic changes, downregulating plasticity and thus acting as a homeostatic mechanism (Chen et al., 2013;Chistiakova et al., 2015). Fourth, we find a weightdependent heterosynaptic term that also accounts for the shape of the neuron's activation function, while increasing its robustness towards noise. Finally, our natural-gradient-based plasticity correctly accounts for the somato-dendritic reparametrization of synaptic strengths.
These features enable faster convergence on non-isotropic error landscapes, in line with results for multilayer perceptrons (Yang and Amari, 1998;Rattray and Saad, 1999) and rate-based deep neural networks (Pascanu and Bengio, 2013;Ollivier, 2015;Bernacchia et al., 2018). Importantly, our learning rule can be formulated as a simple, fully local expression, only requiring information that is available at the locus of plasticity.
We further note an interesting property of our learning rule, which it inherits directly from the Fisher information metric that underlies natural gradient descent, namely invariance under sufficient statistics. This is especially relevant for biological neurons, whose stochastic firing effectively communicates information samples rather than explicit distributions. Thus, downstream computation is likely to require a reliable sample-based, i.e., statistically sufficient, estimation of the afferent distribution's parameters, such as the sample mean and variance. This singles out our natural-gradient approach from other second-order-like methods as a particularly appealing framework for biological learning.
Many of the biological phenomena predicted by our invariant learning rule are reflected in existing experimental results. Our "synaptic democracy" can give rise to dendritic democracy, as observed by Magee and Cook (2000). Our plasticity rule requires heterosynaptic plasticity, which has been observed in neocortex, as well as in deeper brain regions such as amygdala and hippocampus (Lynch et al., 1977;Engert and Bonhoeffer, 1997;White et al., 1990;Royer and Paré, 2003;Wöhrl et al., 2007;Chistiakova and Volgushev, 2009;Arami et al., 2013;Chen et al., 2013;Chistiakova et al., 2015), often in combination with homosynaptic weight changes. Moreover, we find that heterosynaptic plasticity generally opposes homosynaptic plasticity, which qualitatively matches many experimental findings (Lynch et al., 1977;White et al., 1990;Royer and Paré, 2003;Wöhrl et al., 2007) and can be functionally interpreted as an enhancement of competition. For very large weights, heterosynaptic plasticity aligns with homosynaptic changes, pushing the synaptic weights back to a sensible range (Chistiakova et al., 2015), as shown to be necessary for unsupervised learning (Zenke and Gerstner, 2017). In supervised learning it helps speed up convergence by keeping the weights in the operating range.
A further prediction that follows from our plasticity rule is the normalization of weight changes by the presynaptic variance. We would thus anticipate that increasing the jitter in presynaptic spike trains should reduce LTP in standard plasticity induction protocols. Also, we expect to observe a significant dependence of synaptic plasticity on neuronal response functions and output statistics. For example, flatter response functions should correlate with faster learning, in contrast to the inverse correlation predicted by classical learning rules derived from Euclidean gradient descent. These propositions remain to be tested experimentally.
By following gradients with respect to the neuronal output rather than the synaptic weights themselves, we were able to derive a parametrization-invariant error-correcting plasticity rule on the single-neuron level. Error-correcting learning rules are an important ingredient in understanding biological forms of error backpropagation Sacramento et al. (2017). In principle, our learning rule can be directly incorporated as a building block into spike-based frameworks of error backpropagation such as (Sporea and Grüning, 2013;Schiess et al., 2016). Based on these models, top-down feedback can provide a target for the somatic spiking of individual neurons, towards which our learning rule could be used to speed up convergence. Explicitly and exactly applying natural gradient at the network level does not appear biologically feasible due to the existence of cross-unit terms in the Fisher information matrix G. However, methods such as the unit-wise natural-gradient approach (Ollivier, 2015) could be employed to approximate the natural gradient using a block-diagonal form of G. For spiking networks, this would reduce global natural-gradient descent to our local rule for single neurons.

Neuron model
We chose a Poisson neuron model whose firing rate depends on the somatic membrane potential V above the resting potential V rest = −70 mV. They relate via a sigmoidal activation function with β = 0.3, θ = 10 mV and a maximal firing rate φ max = 100 Hz. This means the activation function is centered at −60 mV, saturating around −50 mV. Note that the derivation of our learning rule does not depend on the explicit choice of activation function, but holds for arbitrary monotonically increasing, positive functions that are sufficiently smooth. For the sake of simplicity, refractoriness was neglected. For the same reason, we assumed synaptic input to be current-based, such that incoming spikes elicit a somatic membrane potential above baseline given by where denotes the unweighted synaptic potential (USP) evoked by a spike train of afferent i, and w i is the corresponding synaptic weight. Here, t f i denote the firing times of afferent i, and the synaptic response kernel is modeled as where 0 is a scaling factor with units mV ms, and unless specified otherwise, we chose 0 = 1 mV ms. For slowly changing input rates, mean and variance of the stationary unweighted synaptic potential are then given as (Petrovici, 2016) E Var with c = ( ∫ ∞ 0 2 dt) −1 . Unless indicated otherwise, simulations were performed with a membrane time constant τ m = 10 ms and a synaptic time constant τ s = 3 ms. Hence, USPs had an amplitude of 60 mV and were normalized with respect to area under the curve and multiplied by the synaptic weights. Initial and target weights were chosen such that the resulting average membrane potential was within operating range of the activation function. As an example, in Fig. 3F, the average initial excitatory weight was 0.005, corresponding to an EPSP amplitude of 300 µV.
In Fig. 5, the scaling factor 0 was additionally normalized proportionally to the input rate r i at the synapse in order to keep the mean USP constant and allow a comparison based solely on the variance.

Derivation of the somatic natural gradient learning rule
The choice of a Poisson neuron model implies that spiking in a small interval [t, t + dt] is governed by a Poisson distribution. For sufficiently small interval lengths dt, the probability of having a single spike in [t, t + dt] becomes Bernoulli with parameter φ (V t ) dt. The aim of supervised learning is to bring this distribution closer to a given target distribution with density p * , delivered in form of a teacher spike train. We measure the error between the desired and the current input-output spike distribution in terms of the Kullback-Leibler divergence given in Eqn. 3, which attains its single minimum when the two distributions are equal. Note that while the D KL is a standard measure to characterize "how far" two distributions are apart, its behavior can sometimes be slightly unintuitive since it is not a metric. In particular, it is not symmetric and does not satisfy the triangle inequality.
Classical error learning follows the Euclidean gradient of this cost function, given as the vector of partial derivatives with respect to the synaptic weights. A short calculation (Supplementary Information, Section S.1) shows that the resulting Euclidean gradient descent learning rule is given by Eqn. 4. By correcting the vector of partial derivatives for the distance distortion between the manifold of input-output distributions and the synaptic weight space, given in terms of the Fisher information matrix G (w), we obtain the natural gradient (Eqn. 6). We then followed an approach by Amari (Amari, 1998) to derive an explicit formula for the product on the right hand side of Eqn. 6.
In Section S.1 of the Supplementary Information, we show that given independent input spike trains, the Fisher information matrix defined in Eqn. 5 can be decomposed with respect to the vector of input rates r and the current weight vector w as Here, Σ usp = diag{c −1 r i } n i=1 is the covariance matrix of the unweighted synaptic potentials, and c 1 , c 2 , c 3 are coefficients (Eqn. S32 for their definition) depending on the mean µ V and variance σ 2 V of the membrane potential, and on the total rate q = c 2 0 ∑ i r i . Through repeated application of the Sherman-Morrison-Formula, the inverse of G (w) can be obtained as Here the coefficients γ s , g 1 , g 2 , g 3 , g 4 , which are defined in Eqn. S52, are again functions of mean and variance of the membrane potential, and of the total rate. Consequently, the natural gradient rule in terms of somatic amplitudes is given byẇ Note that the formulas for arise from the product of the inverse Fisher information matrix and the Euclidean gradient, using 1 T x = ∑ n i=1 x i and w sT x = V . Due to the complicated expressions for g 1 , . . . , g 4 (Eqn. S32), Eqn. 25 only provides limited information about the behavior of γ u and γ w . Therefore, we performed an empirical analysis based on simulation data ( Supplementary Information, Section S.3.2, Fig. S2). In a stepwise manner we first evaluated g 1 , . . . , g 4 under various conditions, which revealed that the products with g 2 and g 3 in Eqn. 25 are neglible in most cases compared to the other terms, hence γ u ≈ c 2 2 0 ∑ n i=1 x i and γ w ≈ g 4 V . Furthermore, for a sufficient number of input afferents, we can approximate g 1 ≈ −q −1 . Since q = c 0 E[∑ n i=1 x i ], by the central limit theorem we have γ u ≈ c u c for large n and c u = 0 = 1. Moreover, while the variance of g 4 across weight samples increases with the number and firing rate of input afferents, its mean stays approximately constant across conditions. This lead to the approximation γ w ≈ c w V , where c w is constants across weights, input rates and the number of input afferents.
To evaluate the quality of our approximations, we tested the performance of learning in the setting of Fig. 3 when γ u and γ w were replaced by their approximations (Eqn. S60). The test was performed for several input patterns (Section 4.4.6, Section S.3.3, Fig. S3). It turned out that a convergence behavior very similar to natural gradient descent could be achieved with c u = 0.95, which worked much better in practice than c u = 1.. For c w a choice of 0.05 which was close to the mean of c w worked well.
For these input rate configurations and choices of constants, we additionally sampled the negative gradient vectors for random initial weights and USPs (Section 4.4.6, Supplementary Information, Section S.3.3, Fig. S3) and compared the angles and length difference between natural gradient vectors and the approximation to the ones between natural and Euclidean gradient.

Reparametrization and general natural gradient rule
To arrive at a more general form of the natural gradient learning rule, we consider a parametrization w of the synaptic weights which is connected to the somatic amplitudes via a smooth component-wise coordinate change f = (f 1 , . . . , f n ), such that w s i = f i (w i ). A Taylor expansion shows that small weight changes then relate via the derivative of f On the other hand, we can also express the cost function in terms of w, with C[w] = C s [f (w s )], and directly calculate the Euclidean gradient of C in terms of w. By the chain rule, we then have Plugging this into Eqn. 26, we obtain the contradiction ∆w s i = f ′ i (w i ) 2 ∆w s i . Hence the predictions of Euclidean gradient learning depend on our choice of synaptic weight parametrization (Fig. 1).
In order to obtain the natural gradient learning rule in terms of w, we first express the Fisher Information Matrix in the new parametrization, starting with Eqn. S11 Inserting both Eqn. 27 and Eqn. 30 into Eqn. 6, we obtain the natural gradient rule in terms of w (Eqn. 7). As illustrated in (Fig. 8), unlike for Euclidean gradient descent, the result is consistent with Eqn. 26. ] (also shown in light blue). The resulting weight change leads to an increase ∆w s in the somatic EPSP after learning. The dark blue arrows track the calculation of the same gradient, but with respect to the dendritic EPSP (also shown in dark blue): 1) taking the attenuation into account in order to compute the error as a function of w d , 2) calculating the gradient, followed by 3) deriving the associated change in∆w s , again considering attenuation. Unlike for Euclidean Gradient descent (Fig. 1), the factor f ′ (w) 2 is compensated, since its inverse enters via the Fisher information. This leads to the synaptic weights updates, as well as the associated evolution of a neuron's output statistics over time, being equal under the two parametrizations.

Simulation Details
All simulations were performed in python and used the numpy and scipy packages. Differential equations were integrated using a forward Euler method with a time step of 0.5 ms.

Supervised Learning Task
A single output neuron was trained to spike according to a given target distribution in response to incoming spike trains from n independently firing afferents. To create an asymmetry in the input, we chose one half of the afferents' firing rates as 10 Hz, while the remaining afferents fired at 50 Hz. The supervision signal consisted of spike trains from a teacher that received the same input spikes. To allow an easy interpretation of the results, we chose a realizable teacher, firing with rate φ (V * ), where V * = w * T x for some optimal set of weights w * . However, our theory itself does not include assumptions about the origin and exact form of the teacher spike train.
For the learning curve in Fig. 3F, initial and target weight components were chosen randomly from a uniform distribution on U (−1 n, 1 n), corresponding to maximal PSP amplitudes between −600 µV and 600 µV for the simulation with n = 100 input neurons. Learning curves were averaged over 1000 initial and target weight configurations. We did not enforce Dales' law, thus about half of the synaptic input was inhibitory at the beginning but sign changes were permitted. This means that the mean membrane potential above rest covered a maximal range of [−30 mV, 30 mV]. Learning rates were optimized as η n = 6 * 10 −4 for the natural gradient descent algorithm and η e = 4.5 * 10 −7 for Euclidean gradient descent, providing the fastest possible convergence to a residual root mean squared error in output rates of 0.8 Hz. Per trial, the expectation over USPs in the cost function was evaluated on a randomly sampled test set of 50 USPs that resulted from input spike trains of 250 ms. The expectation over output spikes was calculated analytically.
The vector plots in Fig. 3D For the plots in Fig. 3B-C, we used initial, final, and target weights from a sample of the learning curve simulation. We then randomly sampled input spike trains of 250 ms length and calculated the resulting USPs and voltages according to Eqn. 17 and Eqn. 16. The output spikes shown in the raster plot were then sampled from a discretized Poisson process with dt = 5. * 10 −4 . We then calculated the PSTH with a bin size of 12.5 ms.

Distance dependence of amplitude changes
A single excitatory synapse received Poisson spikes at 5 Hz, paired with Poisson teacher spikes at 20 Hz. The distance from the soma was varied between 1 µm and 10 µm. Learning was switched on for 5 s with an initial weight corresponding to 0.05 at the soma, corresponding to a PSP amplitude of 3 mV. Initial dendritic weights were scaled up with the proportionality factor α(d) −1 depending on the distance from the soma, in order for input spikes to result in the same somatic amplitude independent of the synaptic position. Example traces are shown for α(d) −1 = 3 and α(d) −1 = 7.

Variance dependence of amplitude changes
We stimulated a single excitatory synapse with Poisson spikes, while at the same time providing Poisson teacher spike trains at 80 Hz. To change USP variance independently from mean, unlike in the other exercises, the input kernel in Eqn. 19 was additionally normalized by the input rate. USP variance was varied by either keeping the input rate at 10 Hz while varying the synaptic time constant τ m between 1 ms and 20 ms, or fixing τ s at 20 ms and varying the input rate between 10 Hz and 50 Hz.

Comparison of homo-and heterosynaptic plasticity
Out of n = 10 excitatory synapses of a neuron, we stimulated 5 by Poisson spike trains at 5 Hz, together with teacher spikes at 20 Hz, and measured weight changes after 60 s of learning. Initial weights for both unstimulated and stimulated synapses were varied between 1 n and 5 n . For reasons of simplicity, all stimulated weights were assumed to be equal, and tonic inhibition was assumed by a constant shift in baseline membrane potential of −5 mV. Example weight traces are shown for initial weights of 1.6 n , 2.5 n , and 4.5 n for both stimulated and unstimulated weights. The learning rate was chosen as η = 0.01.

Approximation of learning rule coefficents
We sampled the values for g 1 , . . . , g 4 from Eqn. S32 for different afferent input rates. The input rate r was varied between 5 Hz and 55 Hz for n = 100 neurons. The coefficients were evaluated for randomly sampled input weights (20 weight samples of dimension n, each component sampled from a uniform distribution U (−5 n, 5 n)).
In a second simulation, we varied the number n of afferents between 10 and 200 for a fixed input rate of 20 Hz, again for randomly sampled input weights (20 weight samples of dimension n, each component sampled from a uniform distribution U (−5 n, 5 n)).
In a next step, we compared the sampled values of g 1 as a function of the total input rate n * r to the values of the approximation given by g 1 ≈ −q −1 (r between 5 Hz and 55 Hz, n between 10 and 200 neurons, 20 weight samples of dimension n, each component sampled from a uniform distribution U (−5 n, 5 n)).
Afterwards, we plotted the sampled values of γ u as a function of the approximation s (Eqn. S59, r between 5 Hz and 55 Hz, n = 100, 20 weight samples of dimension n, each component sampled from a uniform distribution U (−5 n, 5 n), 20 USP-samples of dimension n for each rate/weight-combination).
Next, we investigated the behavior of γ w as a function of g 4 V . (r between 5 Hz and 55 Hz, n = 100, 20 weight samples of dimension n, each component sampled from a uniform distribution U (−5 n, 5 n), 20 USP-samples of dimension n for each rate/weight-combination), and in last step, as a function of c w V with a constant c w = 0.05.

Evaluation of approximated natural gradient rule
We evaluated the performance of the approximated natural-gradient rule in Eqn. S62 (with c u = 0.95 and c w = 0.05 compared to Euclidean gradient descent and the full rule in Eqn. 7 in the learning task of Fig. 3 under different input conditions (n=100, Group 1: 10 Hz/ Group 2: 30 Hz, Group 1: 10 Hz/ Group 2: 50 Hz, Group 1: 20 Hz/ Group 2: 20 Hz, Group 1: 20 Hz/ Group 2: 40 Hz). The learning curves were averaged over 1000 trials with input and target weight components randomly chosen from a uniform distribution on U (−1 n, 1 n). Learning rate parameters were tuned individually for each learning rule and scenario according to Table 1. All other parameters were the same as for Fig. 3F.
For the angle histograms in Fig. S3A-B, we simulated the natural, Euclidean and approximated natural weight updates for several input and initial weight conditions. Similar to the setup in Fig. 3 we separated the n = 100 input afferents in two groups firing at different rates (Group1/Group2 : 10 Hz 10 Hz, 10 Hz 30 Hz,10 Hz 50 Hz,20 Hz 20 Hz,20 Hz 40 Hz).
For each input pattern, 100 Initial weight components were sampled randomly from a uniform distribution U (−5 n, 5 n), while the target weight was fixed at w * = 0.15 n , 0.15 n T . For each initial weight, 100 1 s-long input spike trains were sampled and the average angle between the natural gradient weight update and the approximated natural gradient weight update at t = 1 s was calculated. The same was done for the average angle between the natural and the Euclidean weight update.

S.1 Detailed derivation of learning rule
Here, we summarize the mathematical derivations underlying our natural-gradient learning rule (Eqn. 7). While all derivations in Section S.1 and Section S.2 are made for the somatic paramterization and can then be extended to other weight coordinates as described in Section 4.3, we drop the index s in w s for the sake of readability.
Supervised learning requires the neuron to adapt its synapses in such a way that its input-output distribution approaches a given target distribution with density p * . For a given input spike pattern x, at each point in time, the probability for a Poisson neuron to fire a spike during the interval [t, t + dt] (denoted as y t = 1) follows a Bernoulli distribution with a parameter φ t dt = φ (V t ) dt, depending on the current membrane potential. The probability density of the binary variable y t on {0, 1}, describing whether or not a spike occurred in the interval [t, t + dt], is therefore given by and we have where p usp denotes the probability density of the unweighted synaptic potentials x t . Measuring the distance to the target distribution in terms of the Kullback-Leibler divergence, we arrive at Since the target distribution does not depend on the synaptic weights, the negative Euclidean gradient of the D KL equals We may then calculate where Eqn. S7 follows from the fact that y t ∈ {0, 1} and for Eqn. S8 we neglected the term of order dt 2 which is small compared to the remainder. Plugging Eqn. S7 into Eqn. S4 leads to the Euclidean-gradient descent online learning rule, given byẇ Here, Y * t = ∑ f δ t − t f is the teacher spike train. We obtain the negative natural gradient by multiplying Eqn. S9 with the inverse Fisher information matrix, since with the Fisher information matrix G (w) at w being defined as (S11) since p usp does not depend on w, we can insert the previously derived formula Eqn. S8 for the partial derivative of the log-likelihood. Hence, using the tower property for expectation values and the definition of p w (Eqn. S1, Eqn. S2), Eqn. S11 transforms to (S18) In order to arrive at an explicit expression for the natural gradient learning rule, we further decompose the Fisher information matrix, which will then enable us to find a closed expression for its inverse.
Inspired by the approach in Amari (1998), we exploit the fact that a positive semi-definite matrix is uniquely defined by its values as a bivariate form on any basis of R n . Choosing a basis for which the bilinear products with G (w) are of a particularly simple form, we are able to decompose the Fisher Information Matrix by constructing a sum of matrices whose values as a bivariate form on the basis equal are equal to those of G (w). Due to the structure of this particular decomposition, we may then apply well-known formulas for matrix inversion to obtain G (w) −1 .
Consider the basis B = {w, b 1 , . . . , b n−1 } such that the vectors Σ usp w, Σ usp b 1 , . . . , Σ usp b n−1 are orthogonal to each other. Here, Σ usp denotes the covariance matrix of the USPs which in the case of independent Poisson input spike trains is given as Σ usp = diag(c −1 r i ). In this case the matrix square root reduces to the component-wise square root. Note that for any b, b ′ ∈ B with b ≠ b ′ , the random variables b T x t and b ′T x t are uncorrelated, since We make the mild assumptions of having small afferent populations firing at the same input rate, and that the basis B is constructed in such way that the basis vectors are not too close to the coordinate axes, such that the products b T x t are not dominated by a single component. Then, for sufficiently large n, every linear combination of the random variables b T x t and b ′T x t is approximately normally distributed, thus, the two random variables follow a joint bivariate normal distribution. Furthermore, uncorrelated random variables that are jointly normally distributed are independent. Since functions of independent random variables are also independent, this allows us to calculate all products of the form into products of expectations. Taking into account that V (t) = w T x t and E [x ] = r, we arrive at Here, I 1 , I 2 , I 3 denote the generalized voltage moments given as The integral formulas follow from the fact that for a large number of input afferents, and under the mild assumption of all synaptic weights roughly being of the same order of magnitude, the membrane potential approximately follows a normal distribution with mean Based on the above calculations, we construct a candidateG(w) for a decomposition of G (w). We start with the matrix c 1 2 0 rr T + Σ usp , since its easy to see that Exploiting the orthogonal properties according to which we constructed B to carefully add more terms such that also the other identities in Eqn. S23 hold, we arrive at Here, To check that indeedG(w) = G (w), it suffices to check the values ofG as a bilinear form on the basis B.

S.2 Inverse of the Fisher Information Matrix
As the expectation of the outer product of a vector with itself, G (w) is per construction symmetric and positive semidefinite. From the previous calculations, it follows that for elements b of a basis B of R n the products b T G (w) b are strictly positive. Hence G (w) is positive definite and thus invertible. We showed that We introduce the notationw = Σ usp w andr = 0 Σ −1 usp r . (S33) Then, For the following calculations, we will repeatedly use the identities By the Sherman-Morrison-Woodbury formula, the inverse of an invertible rank 1 correction (A + uv T ) of an invertible matrix A is given by Applying this to invert G (w), as a first step, we consider the term M 1 + M 2 and identify M 1 = A and M 2 = uv T . Its inverse is given by with Applying the Sherman-Morrison-Woodbury formula a second time, this time with M 1 + M 2 = A and M 3 = uv T , we obtain Here, Plugging in the definitions of M 3 and M 1 , and using that we arrive at After resorting the terms and grouping, we obtain the inverse of the Fisher Information Matrix as with

S.3 Analysis of the learning rule coefficients
In order to gain an intuitive understanding of Eqn. 7 and to judge its suitability as an in-vivo plasticity rule, we require insights into the behavior of the coefficients γ s , γ u , and γ w under various circumstances.

S.3.1 Global scaling factor
The global scaling factor γ s is given as Figure S1: Global learning rate scaling γs as a function of the mean membrane potential. We sampled the global learning rate factor γs (blue) for various conditions. In line with Eqn. S56, γs is boosted in regions where the transfer function is flat, i.e., φ ′ (V ) is small. The global scaling factor is additionally increased in regions where the transfer function reaches high absolute values.
The above formula reveals that γ s is closely tied to the firing nonlinearity of the neuron, as well as the statistics of output. The scaling by the inverse slope of the output nonlinearity amplifies the synaptic update in regions where a small change in weight and thus in the membrane potential would not lead to a noticeable change in output distribution. A further scaling with φ additionally amplifies the synaptic change in high-output regimes (Fig. S1). This is in line with the spirit of natural gradient that ensures that the size of the synaptic weight update is homogeneous in terms of D KL change rather than in absolute weight terms. Furthermore, the rescaling is based on the average output statistics, which the synapse might have access to via backpropagating action potentials (Stuart et al., 1997), rather than an instantaneous value.
S.3.2 Empirical analysis of γ u and γ w While the formula for γ s provides a rather good intuition of this coefficients' behavior, from the derivations in the previous sections, it becomes clear that such a straightforward interpretation is not readily available from the formulas for γ u and γ w . Defined as and γ w = c 0 g 2 n i=1 x i + g 4 V , where g 1 , . . . g 4 are given in Eqn. S52, these coefficients depend, apart from the membrane potential and its first and second moments, on the total input and its mean the total rate. This raises the question whether the synapse, which in general only has limited access to global quantities, can implement γ u and γ w . We therefore used an empirical analysis through simulations to obtain a more detailed insight.
As a starting point, we sampled g 1 , . . . , g 4 for various input conditions ( Fig. S2A-D, refer to Section 4.4.5 for simulation details). Here, we varied the afferent input rate r between 5 Hz and 55 Hz for n = 100 neurons and evaluated the value of the respective coefficient for a randomly sampled input weight. In a second simulation ( Fig. S2E-H, Section 4.4.5), we varied the number n of afferent inputs between 10 and 200 neurons for fixed input rate 20 Hz, again with randomly chosen input weights. This revealed an approximately inverse proportional relationship between the average input rate r and g 1 , g 2 and g 3 respectively. Furthermore, these coefficients seemed to be approximately inversely proportional to the number n of afferent inputs. However, this was not true for g 4 , whose mean seemed to stay approximately constants across input rates, although the scattering across the mean value (for different weight samples) increased.
While the value of the membrane potential stays bounded also for a large number of inputs (due to the normalization of the synaptic weight range with 1 n ), the total sum of USPs increases with an increasing number of inputs. Therefore, for large enough n, the term g 3 V can be neglected, so γ u ≈ c 2 2 0 g 1 ∑ n i=1 x i . A closer look at the behavior of g 1 shows that, for a sufficiently large number of neurons or high input rates, g 1 ≈ −q −1 , which we verified by simulations (Fig. S2I, Section 4.4.5). In consequence, To verify this approximation, we sampled the values of γ u for various conditions (Fig. S2J, Section 4.4.5 ) and compared them against the approximation in Eqn. S59, which confirmed our approximation. Since the input rate is the mean value of the USP, assuming large enough populations with the same input rate and a sufficient number of input afferents, by the central limit theorem we have γ u → c u c for n → ∞ , (S60) with c u = 0 . However, in practice, for 0 = 1 mV ms, a learning behavior much closer to natural gradient was obtained when c u was slightly smaller than 1 such as c u = 0.95 (cf. Section S.4).
As a starting point to approximate γ w , we noticed that the mean of g 4 stayed approximately constant when varying the input rate or the number of input afferents. On the other hand, g 2 rapidly tends to zero in those cases, so we assumed that g 2 ∑ n i=1 x i stays either constant or goes to zero in the limit of large n. Since c 0 g 2 seemed to be rather small compared to g 4 , we hypothesized γ w ≈ g 4 V , which was confirmed by simulations (Fig. S2K, Section 4.4.5). As a second step (Fig. S2L, Section 4.4.5), since g 4 seemed to be constant in the mean, we approximated where simulations with c w = 0.05, close to the mean of g 4 showed a learning behavior close to natural gradient learning ( Fig. S3C-F). Replacing γ u and γ w in Eqn. 7 by the expressions in Eqn. S60 and Eqn. S61, we obtain the approximated natural gradient ruleẇ

S.3.3 Performance of the approximated learning rule
Simulations of natural, Euclidean and approximated natural gradient weight updates for several input patterns and randomly sampled initial conditions (Section 4.4.6) showed that the average angles (both in the Euclidean metric and in the Fisher metric) between the true and approximated natural gradient weight update were small compared to the average angle between Euclidean and natural gradient weight update (Fig. S3A-B). This was confirmed by the learning curves for several tested input conditions in the setting of Fig. 3, since the performance of the approximation lay in between the natural and the Euclidean gradient's performance ( Fig. S3C-F, simulation details: Section 4.4.6). It can hence be regarded as a trade-off between optimal learning speed, parameter invariance and biological implementability.
so the Fisher information matrix is additive.
Then, under the assumption that T is small and firing rates are approximately constant on [0, T ], for the Fisher information matrix, we have