Stochastic variational learning in recurrent spiking networks

The ability to learn and perform statistical inference with biologically plausible recurrent networks of spiking neurons is an important step toward understanding perception and reasoning. Here we derive and investigate a new learning rule for recurrent spiking networks with hidden neurons, combining principles from variational learning and reinforcement learning. Our network defines a generative model over spike train histories and the derived learning rule has the form of a local Spike Timing Dependent Plasticity rule modulated by global factors (neuromodulators) conveying information about “novelty” on a statistically rigorous ground. Simulations show that our model is able to learn both stationary and non-stationary patterns of spike trains. We also propose one experiment that could potentially be performed with animals in order to test the dynamics of the predicted novelty signal.


INTRODUCTION
Humans and animals are able to learn complex behavioral tasks and memorize events or temporally structured episodes. Most likely, learning and memory formation are intimately linked to changes in the synaptic connection strength between neurons. Long-term potentiation and depression of synapses can be induced by many different experimental protocols, and depend on voltage (Artola and Singer, 1993;Ngezahayo et al., 2000), spike-timing (Markram et al., 1997;Bi and Poo, 2001) as well as a subtle combination of timing, voltage, and frequency (Sjöström et al., 2001;Clopath et al., 2010). Spike-Timing Dependent Plasticity (STDP) has intrigued theoreticians, because it provides a local Hebbian learning rule for spiking neurons; local, here, means that the dynamics of the synapses is of the form d dt w ij ∝ h(post i , pre j ), where pre j is the set of pre-synaptic variables of neuron j (e.g., spike timing) and post i is the set of post-synaptic variables of neuron i (e.g., spike times and voltage) and h is an arbitrary functional.
With relation to neuroscience, unsupervised learning is most commonly related to developmental plasticity (Miller et al., 1989), formation of receptive fields (Song and Abbott, 2001) or cortical rewiring (Young et al., 2007). Indeed most early applications of unsupervised STDP concern the learning of feedforward connections and the formation of receptive fields (Gerstner et al., 1996;Kempter et al., 1999;Song et al., 2000;Song and Abbott, 2001). Unsupervised STDP will tune to the earliest spikes (Song and Abbott, 2001;Gerstner and Kistler, 2002;Guyonneau et al., 2005) and can perform Independent Component Analysis (Clopath et al., 2010;Savin et al., 2010).
On the level of behavioral neuroscience, human performance approaches in many psychophysical learning paradigms Bayes optimality, i.e., the best statistical model cannot perform better than humans do (Knill and Pouget, 2004;Körding and Wolpert, 2004). This supports the hypothesis that the brain is performing approximate inference, which implies that the brain has access to prior and posterior distribution of possible explanations of the observed data (Berkes et al., 2011).
These findings lead to the idea that the spiking activity of the brain constitutes a generative model, that is, a model of the joint distribution of percepts (observed spike trains induced by sensors) and hidden causes in the world (hidden spike train generated by neurons that are not directly affected by sensor spikes). The ability to model hidden causes in the sensory data is important for both stationary and non-stationary situations. For stationary distributions of spike trains, hidden neurons are important to encode potential interpretations or explanations of spike patterns observed at the sensory neurons (visible neurons). For non-stationary sequences, hidden neurons are fundamental for representing the non-observed dynamics and to form long-term memories.
Various abstract Bayesian models have been proposed to account for this phenomenon (Körding and Wolpert, 2004;Deneve, 2008;Nessler et al., 2009). However, it remains an open question whether optimization in abstract Bayesian models can be translated into plausible learning rules for spiking neurons. If one considers only stationary input patterns, an explicit relation between Bayesian inference and synaptic plasticity has been suggested (Habenschuss et al., 2012). Moreover, it has been suggested recently (Pecevski et al., 2011;Shao, 2012) that spiking networks with biologically plausible dynamics can produce stationary samples from Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009) and more general Bayesian networks. In the following we drop the limitations of stationary spatial patterns and consider a recurrent network of stochastically spiking neurons as a generative model of spatio-temporal spiking patterns.
Learning the synapses between hidden neurons in recurrent models has been recognized as a difficult problem, as these synapses only contribute indirectly to the activity of the visible neurons and as recurrence leads to the "vanishing gradient" problem (Bengio et al., 1994). Conversely, in machine learning the algorithms known to be efficient for learning graphical and recurrent models are typically non-local (Jaakkola and Jordan, 2000;Bhatnagar et al., 2007;Sutskever et al., 2008;Salakhutdinov and Hinton, 2009). That is, in order to efficiently perform parameter updates in these models, their learning rules either take into account the entire state of the model or it requires that information to propagate in a non-causal maner through the synapses.
Therefore, the locality constraints imposed by biology constitutes one of the major challenges in transforming those algorithms into biologically plausible learning rules.
Even in biology, however, certain types of non-local signals participate in the learning process, notably through neuromodulators which can convey information about the global state of the network or external information (e.g., reward or surprise) (Izhikevich, 2007;Schultz, 2008;Fremaux et al., 2010).
Here, we derive a principled learning rule for unsupervised learning in recurrent spiking networks that relies only on quantities that are locally available at the synapse: pre-synaptic activity, post-synaptic activity and a global modulatory signal.
The key innovation of our model compared to earlier studies (Brea et al., 2011;Jimenez Rezende et al., 2011) lies in the computation of a global modulating signal which is a linear superposition of local terms and can therefore be interpreted as the diffusion of a neuromodulator in the extra-cellular medium.
Furthermore our global neuromodulating signal conveys information about novelty or surprise on statistically rigorous grounds, providing interesting links with findings relating surprise and plasticity (Gu, 2002;Ranganath and Rainer, 2003;Yu and Dayan, 2005).
We show with simulations based on synthetic data that our proposed learning mechanism is capable to capture complex hidden causes behind the observed spiking patterns and is able to replicate, in its spontaneous activity, the statistics of the observed spike trains.
Finally, we provide an application of our model to a hypothetical novelty detection task where a simulated agent (e.g., a rat) is inserted into a maze with specific properties (views, rooms, topology of the maze). Our model, simulating the "brain" of this agent, successfully captures the statistical properties of this environment. We show that, after learning, our model is capable of distinguishing the original environment from another environment that differs only in its topology (relative location of the rooms). Additionally, we predict the "expected dynamics" of a neuromodulator signaling "novelty" as the agent traverses the virtual maze. We propose that this hypothetical experiment could be developed into a real experiment.

NEURON MODEL
The neuron model used in our simulations is a generalized linear model (GLM) in the form of a Spike Response Model (SRM) with escape noise (Gerstner and Kistler, 2002;Jolivet et al., 2006). The spike train of a neuron j for times t > 0 is denoted as We model the membrane potential of a neuron i as τ adapt X j (s) is the adaptation potential, w ij is the synaptic strength between neurons i and j and φ j (t) is the potential evoked by an incoming spike from neuron j. The evoked potentials are modeled by a simple exponential filter where τ is the time constant of the membrane potential. The spikes are generated by a conditioned Poisson process with exponential escape rate (Jolivet et al., 2006). That is, the conditional instantaneous firing intensity ρ i (t) is taken to be where ϑ and u are physical constants of the neuron. However, we will keep ρ i (t) as an arbitrary function of u i (t), ρ i (t) = g(u i (t)), in all our derivations and we specify the exponential form (Equation 3) only when performing the simulations (see further below). Equations (1-3) capture the simplified dynamics of a spiking neuron with stochastic spike firing.
In the following simulations we assume that two different neurons i and j can have at most two common synapses, w ij and w ji . The neuron model and the potentials contributing to its activation are illustrated in Figure 1.
In the following sections, we will first introduce the theoretical framework in which we derive our learning rule. The learning mechanism is then derived in several steps, followed by simulations showing that our model and learning rules are capable of capturing complex spatio-temporal features in the input spike trains, reproduce them in its spontaneous activity and perform statistical inference on the hidden causes of provided data.

A PRINCIPLED FRAMEWORK FOR LEARNING
In the following we consider a stochastic, fully connected network M composed of two sets of neurons which are functionally distinct. The first group which we call observed or visible neurons (denoted by V) represents an ensemble of neurons which receive data in the form of spike trains. The spike trains of the visible neurons will be referred to as X V . These neurons are exposed to the external world. The second set, which we call hidden neurons (denoted by H) does not directly receive data from the external world. The spike trains of the hidden neurons will be referred to as X H . Their role is to provide "compressed explanations" for the observed data. The topology of this network is illustrated in Figure 2C. In the absence of external drive, the spontaneous activity of the hidden neurons will contribute to the firing of the observed neurons. The spike trains of the entire network comprising both observed and hidden neurons will be indicated simply as X.
Our network defined in this way constitutes a generative model of spike trains with hidden neurons. In the following we interpret synaptic potentiation and depression as a form of optimization of this generative model. More precisely, we assume that synaptic plasticity (the "learning rule") is trying to increase the likelihood of the observed spike trains under the model.
In what follows, we review the calculation of the complete-data log-likelihood log p(X V , X H ) for a recurrent network of pointprocess neurons.  From Equations (1-3) it follows that the probability of producing a spike in the infinitesimal interval [t, t + t] by the ith neuron conditioned on the past activity of the entire network X(0 . . . t) and provided that ρ i (t) t 1 is given by and the probability of producing no spike in the same interval is given by Therefore, by discretizing a finite time interval [0, T] into N sufficiently small bins [t k , t k + t] with k = 1 . . . N so that there is at most one spike in each bin and assuming independence between neurons within the infinitesimal bins we can write the probability for all neurons in the network as where k s i and k ns i labels the bins with one spike and bins without spike from neuron i, respectively. Taking the limit N → ∞ of Equation (6) divided by the volume t K , where K is the total number of spikes in the interval [0, T] we obtain the probability density The log-likelihood corresponding to Equation (7) can be compactly written as It should be stressed that the log-likelihood (Equation 8) is not a sum of independent terms, since the instantaneous firing rate of each neuron depends on the entire past activity of all the other neurons through Equations (1-3).
In the following sections we derive a plasticity rule that will attempt to maximize the log-likelihood of the observed data log p(X V ). For this, we briefly review the calculation of the gradient of the observation likelihood in absence of hidden neurons and then we introduce our method for approximating its gradient when there are hidden neurons.

FULLY OBSERVED NETWORK
The gradient of the log-likelihood of fully observed networks of SRM neurons and similar point-processes have been studied in detail in Paninski (2004) and Pfister et al. (2006). In what follows we review the calculation of the gradients of the data loglikelihood (Equation 8) with respect to the synaptic weights and then discuss a simulation revealing the limitations of the model.
In a network without hidden neurons as illustrated in Figures 2A,B, the data log-likelihood (Equation 8) reduces to a simple form In Paninski (2004) and Pillow et al. (2004) it is shown that the optimization problem defined by the log-likelihood (Equation 9) is convex. Therefore its global maximum can be found by gradient ascent.
The gradient of Equation (9) with respect to the synaptic efficacies w ij is given by where the gradients are obtained by differentiating the firing rate function (Equation 3): We conclude that the gradient ∇ w ij log p(X V ) can be calculated in a purely local manner. More precisely, an update of the weights according to gradient ascent w ij ∝ ∇ w ij log p(X V ) yields a learning rule that is simply a "trace" of a product of two factors: A first factor φ j t that depends only on the presynaptic activity and a second factor g (uk(t )) g(u k (t )) X k t − ρ k t that depends on the state of the postsynaptic neuron. Moreover, the gradient (Equation 10) has been shown to yield a simplified form of STDP (Pfister et al., 2006). The notion of "trace" (i.e., a temporal average of some quantity) will return for other learning rules later in this paper.
In order to expose the weakness of the fully observed model described above, we test its performance on a task which consists of learning "stair patterns" involving 3 groups of 10 visible neurons, which are probabilistically activated using low (1 Hz) and high (700 Hz) firing rates. The activations generate a sequential pattern where each group remains active for a duration drawn from a Gaussian distribution with mean 30 ms and standard deviation 10 ms (truncated at positive values), ( Figure 3A). This benchmark is interesting firstly because it requires the formation of memories on the order of three times the membrane time constants of the single neuron dynamics and secondly because it requires the ability to learn the appropriated transition probabilities. Our simulations show that a fully observed network is not capable of learning such patterns, as can be seen by looking at the samples produced from the learned network ( Figure 3B). However, a model with 50 hidden neurons (with the learning rule discussed further below) can learn the distribution (Figures 3C,D). Moreover, even if the fully observed network could learn the data distribution it would not provide any useful representation of the data, while a network with hidden neuron naturally forms higher-level representations of the incoming data ( Figure 3E)(top). For precise details concerning the numerical simulations and evaluations, see further below.

Frontiers in Computational
This simulation corroborates the intuition that a network consisting of visible neurons only is rather limited in scope. We therefore turn in the following to a more general network consisting of both visible and hidden neurons.

PARTIALLY OBSERVED NETWORK
In the following we first introduce our model that includes hidden neurons. We derive our learning rule and introduce a few modifications to improve its performance. Finally, we show with simulations that the resulting model can learn spiking patterns that couldn't be learned by a model without hidden neurons.
In a model that includes hidden neurons (that is, neurons not directly connected to the incoming data), the marginalized likelihood of the visible neurons is obtained by integrating over all possible hidden spike trains X H , In the following we introduce the variational approximation scheme for approximating Equation (12). The variational approach consists of approximating a complex distribution p by a simpler distribution q and provides a flexible generalization of the expectation-maximization (EM) algorithm (see Jaakkola and Jordan, 2000;Beal and Ghahramani, 2006). We are interested in approximating the posterior p(X H |X V ) by another distribution over spike trains q(X H |X V ). We optimize the parameters of the distribution q(X H |X V ) by minimizing the KL-divergence where f (X) p = DXf (x)p(x). The first term in Equation (13) is known in statistical physics as the Helmholtz free energy (Landau et al., 1980), The second term in Equation (13) is simply the data loglikelihood. Since the KL-divergence KL(q; p) between two distributions q and p is non-negative (Gibbs and Su, 2002), the free energy (Equation 14) is an upper bound on the negative loglikelihood. Therefore, we can redefine the problem of maximizing the data log-likelihood log p(X V ) with respect to the parameters of the generative model p as the double optimization problem of minimizing the free energy F with respect to the parameters of q and with respect to the parameters of the original model p.
The attractiveness of such an approach for deriving biologically plausible rules comes from the fact that the distribution q is arbitrary (as long as it has the same support as the true distribution). Therefore we can either choose it in order to simplify the calculations or to improve the model's compliance with known biological constraints, such as causality and locality of the weight updates. These interesting properties of the variational approximation have been explored by a diversity of models (Dayan, 2000;Friston and Stephan, 2007;Jimenez Rezende et al., 2011;Brea et al., 2013;Nessler et al., 2013).
In what follows, we postulate that the posterior distribution of the hidden spike trains of the network M can be well approximated by another recurrent network of spiking neurons which we call the network Q. This network is composed of the same neurons as the original model, except that it has different synaptic connections. Its connectivity is depicted in Figure 2D.
The assumption of an "inference" network Q is analogous to the recognition model introduced in Dayan (2000) for the Helmholtz machine.
Our model differs from the Helmholtz machine, as introduced in Dayan (2000), in two key aspects: (1) The Helmohtz machine is a model for stationary data, i.e., it cannot readily model temporal sequences; (2) Although the recognition network is introduced in a variational framework, the proposed learning rule is not attempting to minimize a free energy (there are different cost functions for the generative and recognition models) whereas our learning rule is explicitly attempting to minimize the free energy associated to the model.
In other words, our model consists of a single set of neurons with two sets of synaptic weights that can be turned on and off independently (possibly through the action of specific neuromodulators). The first set of weights, which we will refer to as w M ij , parameterizes the original generative model M. The second set of weights, which we will refer to as w Q ij parameterizes the network Q. The topology of the network Q is restricted and excludes connections toward the visible neurons from hidden and other visible neurons.
In the following, all the quantities (e.g., membrane potentials and firing rates) that are computed using the parameters w M ij (i.e., with Q turned off) will have the superscript M. Analogously, all the quantities that are computed using the parameters w Q ij (i.e., with M turned off) have the superscript Q.
When driven solely by the observed spike trains and by the weights w Q ij , the activity of the hidden neurons X H is, by construction, an approximated sample from true the posterior distribution p(X H |X V ) provided by the distribution q(X H |X V ). Therefore, if we want our model to perform approximated Bayesian inference on the most likely hidden causes of an observed spike train X V we just have to run it with the synapses w M ij turned off. Conversely if we want to produce a sample from the learned generative model, we have to run it with the synapses w Q ij turned off.
In the next sections we derive stochastic estimators of the gradients of the free energy F with respect to the synaptic weights w M ij and w Q ij in a biologically plausible manner. We show that the naive gradients obtained for w Q ij are problematic since their variance grows quadratically with the size of the network. We then introduce a simple modification to reduce the variance of the obtained gradients based on techniques from reinforcement learning. Finally we modify the gradient estimators so that they turn into "on-line" parameter updates.

STOCHASTIC GRADIENTS
The complete data log-likelihood L M of the model M and L Q of the model Q are given by and respectively. The free energy (Equation 14) corresponding to the loglikelihoods (Equations 15, 16) is given by In the following we simplify the notation and write • q instead of • q(X H |X V ) . We wish to write the learning equations for both w M ij and w Q ij as simple gradient descent on the free energy: where μ M and μ Q are learning rates for the networks M and Q, respectively. The exact gradients ∇ w M ij F and ∇ w Q ij F are difficult to evaluate analytically since we cannot compute the required expectations. Therefore we resort to unbiased stochastic approximations of those gradients. The calculation of the gradient ∇ w M ij F is analogous to the fully observed case and is given by where we have used the fact that where we have defined the "instantaneous free energy" F τ as Note that during learning the activity of the visible neurons is driven purely by the observed spike trains while the activity of the hidden neurons and related quantities (e.g., ρ Q i and ρ M i ) is driven by the Q-network. Expanding the remaining gradients in Equations (20, 21) using the chain rule we obtain the "batch mode" learning equationṡ Note that the learning rule (Equation 24) is the same as in the fully observed network case in Equation (10). The learning rule (Equation 25) for the network Q is similar, but contains an additional modulation factor, the point estimate of the free energyF , which appears as a global signal that modulates the learning of the network Q. SinceF provides a lower bound on the data loglikelihood, the free energy measures how much the recent history of observed spike trains "fits" the generative model defined by the network M. The assumption of a globally available signal conveying information about reward or surprise is standard in the reinforcement learning literature. The naive stochastic gradients (Equations 24, 25) are not efficient in practice. Even though they constitute unbiased estimators of the true gradients, their variance is prohibitively high. We address this problem below.

REDUCING THE VARIANCE OF THE GRADIENTS
Stochastic gradients of the form Equation (25) have been extensively studied in the reinforcement learning literature and several approaches with different levels of complexity have been proposed for reducing their variance (Munos, 2005;Bhatnagar et al., 2007). In the following, we sketch some theoretical arguments for why gradients of the form of Equation (25) should be expected to scale badly with the size of the network.
Let N be the number of neurons in the network and T be the time window on which we compute the relevant quantities. Equation (25) contains the free energyF which is, according to Equation (22), a sum of traces (integrals on the interval [0, T]) for each neuron. Therefore, in a weak-coupling scenario, the estimator (Equation 25) is the integral on the interval [0, T] of a sum of N weakly correlated terms in the free energy, that we assume to have some typical variance σ 2 0 and mean m. Under these assumptions, the variance of the gradient (Equation 25) scales approximately as That is, as the size of the network grows, the variance of the gradient (Equation 25) grows with the square of the number of neurons. A naive solution would be to decrease the learning rate as μ Q ∝ 1/N but this would make learning too slow for larger networks.
In the following, we adopt a simple baseline removal approach to reduce the variance of our gradient estimator. That is, we simply subtract the meanF of the free energyF , calculated as a moving average across several previous batches of length T, from the current valueF (T). This yields the learning rulė where we have introduced the "free energy error signal" e(T) = F(T) −F . With this simple change, we can see that the variance of our gradient scales as The baseline removal trick is not a solution to the problem but it drastically reduces the variance of our estimator when the size of the network is increased. Note that the quadratic or linear growth of the gradient's variance with the number of neurons is not just an artifact of our variational approximation. The arguments presented in this section apply to any learning rule that has the form of local Hebbian traces modulated by global signals that are composed of several weakly correlated terms. For instance, the learning rule proposed in Brea et al. (2011) also falls into this category.

ONLINE vs. BATCH LEARNING
The gradients that define our learning scheme (Equations 24, 26) are given in terms of quantities accumulated over a given time interval [0, T]. In this sense, we have derived a "batch" learning rule and we would have to wait until the end of the interval [0, T] in order to apply the changes in the parameters of our model.
However, compatibility with biology requires that we have an online version of our algorithm. This can be approximately achieved if, instead of accumulating the traces required for calculating the gradients on time interval [0, T] and applying the parameters updates in the end, we replace the traces by moving averages and apply the parameter updates at every time step. The modified learning rules can be formulated as follows. At each synapse, we have "Hebbian traces" H M,Q ij (t) that keep track of pre-and post-synaptic activity and evolve according to where we have introduced a time constant τ G controlling the time-scale of the moving averages. Similarly, the online estimate of the error signal e N (T) is obtained by replacing time integrals in the interval [0, T] with moving averages That is,F is a "short term" moving average of the instantaneous free energy F t (Equation 23) with time-scale τ G whileF is a "long term" moving average of F t with a longer time-scale depending on τ G and τ baseline . Note that the "error signal" e N (T) can also be interpreted as an instantaneous surprise measure relative to the slow "background" surprise levelF . The updates of the weights uses the Hebbian traces and fixed learning rates μ M and μ Qẇ Thus the update of the M-network is given by a "Hebbian" rule whereas the update of the Q-network follows a three-factor rule with the surprise as a global factor. This architecture is illustrated in Figure 4A and the three-factor rule for the Q-synapses is illustrated in Figure 4C. Another way of interpreting the learning rule (Equation 32) is that it is simply proportional to the covariance between the Hebbian trace H Q ij and the moving average of the free energyF . If they are uncorrelated, the expected change in the parameters will be zero and the synaptic weights will just perform a centered random walk.

A SIMPLIFIED MODEL
Since the variational distribution q in Equation (13) can be arbitrary, one could imagine a model simpler than the one derived in the previous sections which consists in approximating the posterior distribution of the hidden neurons directly by the forward dynamics of the generative model. In practice this approximation amounts to constraining the synaptic weights w Q ij of the Q-network to be equal to the synaptic weights w M ij of the M-network for i ∈ H and j ∈ V ∪ H.
Under this constraint, the learning equations (31) and (32) and the instantaneous free energy simplifies to a sum over the observed neurons only The architecture of this model is illustrated in Figure 4B The idea of using the forward-dynamics as a proposal distribution for the posterior has been used in Brea et al. (2011) and Brea et al. (2013), where the proposals are then weighted by an importancesampling scheme to better represent the true posterior distribution over the hidden activity. Further below (see results) we show that, at least in the context of the variational learning discussed here, this approximation does not outperform our more general model. The values of the parameters used in this study are reported in Table 1.
The initial synaptic weight of both M and Q were sampled from a Gaussian distribution with mean zero and standard deviation of 0.01.
For all experiments, the training data consists of binary arrays with ones indicating spikes and zeros indicating no-spike in the corresponding time bin. The training data-sets are organized in batches of 200 ms which are sequentially presented to the model. During our training sessions, each learning epoch corresponds to 500 presentations of data batches. In other words, each epoch corresponds to a total of 100 s of spiking data. During the learning phase, the visible neurons are exactly driven by the data spike trains, that is, they are forced to spike or to not spike in the exact same way as the data samples at each time bin. During the spontaneous activity phase the networks are running without any external drive.
The log-likelihood of test data was estimated by an importance sampling procedure. Given a generative model with density p(x v , x h ) over observed variables x v and hidden variables x h , importance sampling allow us to estimate the density p(x v ) of a data point x v as whereF (x v , x h ) is a point-estimator of the free energy. Using Equation (36) we estimate the log-likelihood of a observed spike-train by sampling several times from the network Q and taking the average of the exponentiated point-estimator of the free energy (Equation 22). For our applications, we have generated 500 samples of duration 100 ms from the network Q per estimation.

RESULTS
The learning rules derived above (see Materials and Methods) come in four different variants. First, a batch-based naive gradient descent rule. Second, a variant that reduces the variance of the gradient estimator. Third an online version of the variancereduced rule that is biologically more plausible than a batch rule. Finally, a simplified version of the model where the Q-network is merged with the M-network in a specific manner.

VARIANCE REDUCED RULE vs. NAIVE RULE
Naive gradient descent on the free energy yields the batch learning rule (Equation 25). Tested on the stairs-patterns ( Figure 3A) using a network of 30 visible neurons and 30 hidden neurons generates very slow learning. In Figure 5A we show that the learning rule with the variance reduction (Equation 26) performs substantially better than the naive gradient (Equation 25). Both networks have the same number of visible and hidden neurons, but the first network is trained with the gradient given in Equation (25) whereas the second network is trained with the gradient defined in Equation (26). The data log-likelihood of both networks is approximated by importance sampling after every learning epoch.
In Figure 5A, along the horizontal axis we plot, across several epochs the log-likelihood of the model using the learning rule with variance reduction and along the vertical axis the loglikelihood of the model without variance reduction (i.e., the naive batch gradient rule). We find that the log-likelihood of the variance reduced model is consistently above the log-likelihood of the naive model, indicating an increase in learning speed of more than a factor 100.

ONLINE ALGORITHM vs. VARIANCE REDUCED BATCH ALGORITHM
Online algorithms are biologically more plausible than batch algorithms which require storage of intermediate results.
Here we show that the online learning rules (Equations 31, 32) induce only minor impairments of the performance compared to the batch rules (Equations 24,26).
Both versions of the model were trained on the same "stair pattern" task and the results are shown in Figure 5B. Along the horizontal axis we indicate the log-likelihood of the batch model and along the vertical axis that of the online model, across several epochs of learning. Both performances are strongly correlated, indicating that the online approximation does not introduce any major impairment in the model for this particular dataset.

FORWARD DYNAMICS vs. THE Q-NETWORK
Here we show that the simplified model defined by the Equations (33, 34) does not reach the performance of the more general model defined by Equations (31,32).
Both versions of the model were trained on the same "stair pattern" task and the results are shown in Figure 5C. Along the horizontal axis we indicate the log-likelihood of our model with Q-network and along the vertical axis that of the simplified model, across several epochs of learning. Both performances are correlated. However, the simplified model has clearly lower loglikelihoods than the complete model. This result suggests that the forward dynamics of the generative model may provide a poor approximation to the true posterior distribution of the hidden neurons compared to having an independently parameterized inference network.

HIDDEN REPRESENTATIONS AND INFERENCE
To show that our model is not only learning a prior that represents the data but can also form interesting hidden representations and perform inference on the hidden explanations of the incoming data, we use the Q-network of a model with 50 hidden neurons, trained on the stairs dataset Figure 3A. The activity of the hidden neurons of the model during sampling (or "dreaming") are shown on Figure 3C (top). As we can see just by visually inspecting the activity of the hidden neurons, they form an unambiguous representation of the activity of the visible neurons in this simple example. Conversely, by running the model on "inference mode" (i.e., with the synapses w Q activated) can also see that the model is capable of performing inference on the causes of the incoming data Figure 3E.

THE ROLE OF THE NOVELTY SIGNAL
The resulting online rule given in Equation (32) for is a Hebbiantype plasticity rule modulated by a global novelty signal e N (t) where e N (t) is a measure of surprise relative to a slow moving average of the free energy. We repeat the learning rule for the Q-network from Equation (32), where H Q ij (t) is a synaptic trace that keeps track of "Hebbian" coincidence between pre-and post-synaptic activity.
The Hebbian trace is activated by the product of a voltage-dependent post-synaptic and a presynaptic EPSP caused by presynaptic spike arrival.
It is known that the brain is able to detect novelty and to broadcast novelty related signal across large brain regions (Gu, 2002;Ranganath and Rainer, 2003). In order to illustrate the dynamics of the novelty signal in our model we describe a task which could be transformed into a real animal experiment.
We place an agent (e.g., a rat) in a maze and let it explore it. During the exploration, the agent will learn the topology of the maze through a combination of visual and proprioceptive information. If the agent is suddenly transported to another maze with many similar but a few different parts we expect this change to trigger some novelty signal in the brain of the agent whenever it encounters the parts of the new maze that differ from the learned maze. In order for the agent's brain to detect a change in the environment, it must first learn a sufficiently accurate model of the environment. We hypothesized that our recurrent neuronal network learns the structure of the environment and at the same time provides an online surprise signal when it encounters a "novel" situation. Such a novelty signal could be compared to recordings of different neuromodulators.
We created two virtual mazes composed of 16 "rooms" arranged as a square lattice where only neighboring rooms are accessible from each other: a first one which the agent will learn, the target maze, indicated in Figure 6A (left) and the test maze Figure 6A (right) which is similar to the first but with a few, randomly chosen rooms, replaced as indicated in Figure 6A by the yellow squares. Note that the only way to detect the difference between the two mazes in this simulation is to actually learn the exact connectivity graph of the rooms, because the rooms are themselves identical in both mazes.
The views corresponding to each room were generated by randomly choosing images of handwritten digits from the MNIST dataset (LeCun and Cortes, 2010). The MNIST images are 28 × 28 gray-scale images of handwritten digits. We converted the pixel-values to firing rates in the range [0.01 Hz, 9 Hz]. To keep the simulations simple, time was considered in abstract units for these simulations (with time steps of 100 ms instead of 1 ms).
For the "brain" of the agent we used a recurrent binary network with 30 hidden neurons and 28 × 28 visible neurons. In order to train this network, data batches where produced by recoding the activity of the visible neurons while the agent performs random trajectories of 100 time-steps in the target maze.

FIGURE 6 | Novelty-gated Hebbian learning. (A)
The target maze (left) and test maze (right). Each one of the 16 rooms in the maze is represented by a random MNIST digit. Transitions are only possible between neighbor rooms. The test maze was generated from the target one by randomly replacing a few rooms (indicated with yellow borders). A sample trajectory is shown with the green arrows. (B) Evolution of the slow moving average of the free energyF as a function of the amount of observed data for the target maze (blue) and the mean free energy of the same model when "teleported" to the control maze (red) every 500 s. (C) Fine-grained temporal details of the fast free energy traces e N (t) while traversing the sample trajectory for the target maze (blue) and for the test maze (red). Visited rooms are indicated by the annotated coordinates at each change-point. During the intervals between two consecutive change-points the agent remains in the same room.
Each learning epoch corresponds to 500 presentations of these data-batches to our model.
In Figure 6B we plot the "slow" moving average of the free energyF as a function of the learning time during the exploration phase for the target maze (blue curve) and for the test maze (red dots). The reported free energy for the control maze was measured every 5000 s for a path of length 50 s. As we can see, in the beginning of the learning the model is unable to distinguish between both mazes (both have high free energy). But as the model learns the target maze, it successfully identifies the test maze as "unfamiliar" (attributing low free energy to the target maze and much higher free energy to the test maze).
In Figure 6C we plot for both mazes the free energy error signal e N (t) for the sample trajectory shown in Figure 6A. From Figure 6C (blue) we can see that e N (t) fluctuates around zero for the learned maze but deviates largely from zero for the test maze.
This result suggests that if animals have a neurophysiological correlate of the "free energy error signal" e(T) introduced in Equation (26) we should look for activity bursts when the animals traverses unexpected situations (e.g., when traversing the test maze from position (2, 1) to position (2, 2)).
Moreover we should expect a substantial increase in the variance of the changes in synaptic weights when moving from a learned maze to an unfamiliar maze due to the change in the baseline of the surprise levels.

DISCUSSION
We have proposed an alternative to the learning algorithms previously proposed in Brea et al. (2011) andJimenez Rezende et al. (2011) for learning a generative model of spike trains defined by recurrent spiking networks. Our new model combines techniques from variational learning and reinforcement learning to derive a new efficient synaptic plasticity rule.
The resulting (see Materials and Methods) online rule for synapses is a Hebbian-type trace modulated by a global novelty signal. The Hebbian terms are traces of products of pre-synaptic terms (EPSPs) and post-synaptic terms. Similar gradients have been studied in Pfister et al. (2006) where they are found to yield STDP-like dynamics. Importantly, the global modulating signal, Equation (23) is a linear superposition of terms locally computed by each neuron so we can interpret it as the diffusion of a neuromodulator in the extra-cellular medium.
The original feature of the proposed model is that it uses an auxiliary recurrent spiking network in order to approximate the posterior distribution of the hidden spiking activity given the observed spike trains in an on-line manner. Using this auxiliary network allowed us to derive a learning rule which is based on local gradients modulated by slow non-local factors conveying information about "novelty. " We have shown that naive stochastic gradients derived in such a framework are not viable in practice due to their high variance (which may grow quadratically with the number of hidden neurons). Deriving viable learning rules thus requires finding low-variance and unbiased estimators of the gradients defined in Equations (24,25). In this paper we have only reduced this problem as our learning rule (Equation 26) still has a variance that grows (linearly) with the number of hidden neurons.
Our proposed learning algorithm, has potential applications for finding functional networks from recorded neurophysiological data. Since it can learn a recurrent spiking network that approximatively "explains" the data. Taking into account external currents injected into the network would be straightforward.
We also provide an on-line neural estimator of novelty/surprise and make experimentally testable predictions about its dynamics. The estimator is a quantity that naturally emerges from the statistical principles behind our framework instead of being an ad hoc quantity.
From the biological literature, it is known that novelty or surprise is, at least partly, encoded in neuromodulators such as ACh (Ranganath and Rainer, 2003;Yu and Dayan, 2005). Moreover, neuromodulators are known to affect synaptic plasticity (Gu, 2002). We suggested a hypothetical animal experiment that could test predictions concerning a novelty signal and its potential relation to plasticity.

RELATED WORK
The framework in which we derive our model is very general and has been used in different ways in previous works (Dayan, 2000;Friston and Stephan, 2007;Jimenez Rezende et al., 2011;Brea et al., 2013;Nessler et al., 2013). The uniqueness of the present works relies on the combination of methods from variational learning and reinforcement learning yielding a learning rule that is both biologically plausible and efficient.
The main difference between our model and models that exploit more analytical properties of the variational approximation (explicitly computing expectations and covariances terms in the free energy) (Friston and Stephan, 2007;Jimenez Rezende et al., 2011) is that, although these analytical approximations may provide gradient estimators with much lower variance and typically yield more scalable algorithms, these methods are intrinsically non-local and further approximations are required in order to obtain a biologically plausible learning rule.
An interesting learning algorithm for recurrent spiking networks which approximates the gradients of the KL-divergence (Equation 13) using an importance sampling technique has been proposed in Brea et al. (2011) and Brea et al. (2013). Their algorithm does not use an auxiliary network to approximate the activity of the hidden dynamics conditioned on the observed spike trains. Instead, the generative forward dynamics of the model is used as a proposal distribution which is then weighted by an importance-sampling approximation.
Another important family of algorithms proposed in Habenschuss et al. (2012) and Nessler et al. (2013), heavily relies on approximating recurrent neural networks with soft winer-takes-all (WTA) dynamics. One advantage of assuming a WTA dynamics is that the computation of the gradients of the KL-divergence (Equation 13) greatly simplifies, yielding a simple local learning rule. A limitation of their approach is that the model does not take into account the full temporal dynamics during inference.

FUNDING
Financial support was provided by the Swiss National Science Foundation (SystemsX) as well as by the European Research Council (Grant Agreement no. 268 689), and the European Community's Seventh Framework Program (Grant Agreement no. 269921, BrainScaleS).