Forward Models in the Cerebellum using Reservoirs and Perturbation Learning

The cerebellum is thought to be able to learn forward models, which allow to predict the sensory consequences of planned movements and adapt behavior accordingly. Although classically considered as a feedforward structure learning in a supervised manner, recent proposals highlighted the importance of the internal recurrent connectivity of the cerebellum to produce rich dynamics, as well as the importance of reinforcement-like mechanisms for its plasticity. Based on these models, we propose a neuro-computational model of the cerebellum using an inhibitory reservoir architecture and biologically plausible learning mechanisms based on perturbation learning. The model is trained to predict the position of a simple robotic arm after ballistic movements. Understanding how the cerebellum is able to learn forward models might allow elucidating the biological basis of modelbased reinforcement learning.


Introduction
Recent models have described the recurrent connectivity of the cerebellum in terms of dynamical reservoirs (Yamazaki & Tanaka, 2007;Rössert, Dean, & Porrill, 2015). Granule cells and Golgi cells indeed form a recurrent inhibitory network, actively storing information about the history of recent cortical inputs. This information is encoded in the changing activity of Granule cells and forwarded via the parallel fibres to the Purkinje cells and ultimately to output neurons in the deep cerebellar nuclei. As in reservoir computing (Jaeger, 2001;Maass, Natschläger, & Markram, 2002), the connectivity of the reservoir is left unchanged and learning takes place by adapting the connections between the reservoir and the read-out neurons. This is in agreement with classical theories of cerebellar learning, which suggest that synaptic plasticity between parallel fibres and Purkinje cells is the main mechanism of motor learning in the cerebellum (e.g. Albus, 1971;Ito, 2000).
The classical Marr-Albus-Ito hypothesis is based on supervised learning and long-term depression at parallel fibre-Purkinje cell synapses (Albus, 1971;Ito, 2000). More recently, Bouvier et al. (2018) proposed an alternative framework of cerebellar learning which borrows from reinforcement learning and relies on both long-term depression and potentiation at the synapses between parallel fibres and Purkinje cells. In this framework, climbing fibres, which reach Purkinje cells from the inferior olive, do not just provide teaching signals but are also the source of random perturbations. These perturbations are used to vary the cerebellar output and explore different responses by trial and error. A binary teaching signal indicates whether the perturbations improved or worsened the cerebellar response, adapting the synapses accordingly.
We propose a neuro-computational model combining the reservoir model of cerebellar interneurons of Rössert et al. (2015) with the perturbation learning rule for Purkinje cells proposed by Bouvier et al. (2018), in order to learn a forward model of an effector. We focus here on a simple 2-joint planar arm to demonstrate the capacity of the network to learn using biologically plausible learning rules.

Methods
The task of the forward model is to predict the next position of an end effector (x next , y next ) given its previous position (x prev , y prev ) and a change in joint angles ∆θ 0 , ∆θ 1 representing the motor command. The positions (x prev , y prev ) are obtained from the joint angles (θ 0 , θ 1 ) (initially taken randomly in [0, π]) with a simple mechanical model of the arm (both segments have a length of 0.5). The motor commands ∆θ 0 and ∆θ 1 are sampled from U(− π 18 , π 18 ) to obtain (x next , y next ), producing large enough movements to make the prediction problem non-linear.

Network Architecture
Pons To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 two population model by Rössert et al. (2015) serves as the reservoir, consisting of N z = 1000 granule cells (GC) and N g = 100 Golgi cells (GoC). The convergence of granule to Golgi cells is limited to c = 10 and the variability of excitatory weights is increased to v u = 4 to improve performance (please refer to (Rössert et al., 2015) for the meaning of these parameters). The baseline of mossy-fibre activity I 0i is chosen randomly from N (1.2, 0.1). The mean of excitatory weights is set to u = 50 and the mean of inhibitory weights is set to w = 0.05/τ w to keep the network dynamics close to the edge of chaos. Contrary to Rössert et al. (2015), who trained their model on a single input x, our task uses four different input signals (x prev , y prev , ∆θ 0 , ∆θ 1 ). Each of the granule and Golgi cells is randomly assigned a single input. Figure 2 shows the response of 20 granule cells to a random input signal presented for 20 ms. All other reservoir parameters are identical to Rössert et al. (2015).  The reservoir is read out by additional feedforward layers consisting of N p = 20 Purkinje cells (PC), N d = 2 projection neurons in the dentate nucleus (DN) and N i = 2 neurons in the inferior olive (IO). The firing rate p i (t) of a PC is defined by: D and H are binary matrices describing the connectivity between the granule cells of rate z j (t), respectively IO cells of rate l j (t), and the Purkinje cells. The PC's firing rate itself is not dynamic, but each synapse integrates exponentially its inputs, leading to a similar effect. The connectivity is chosen so that 10 Purkinje cells project to the same DN neuron and receive input from the same IO neuron. The plastic weights connecting GCs and PCs are initialized with m = [N (0.1, 0.05)] + . The climbing fibre weights w IO are all set to 1.0. τ m and τ IO are the time constants of the different synapses and are both set to 1 ms. l j refers to the firing rate of the j-th inferior olive neuron and represents the perturbations due to climbing fibre input.
How Purkinje cells modulate cerebellar output is defined by the following equation for the firing rate n i (t) of the N d DN neurons: Like GC and GoC, DN neurons are randomly connected to a single mossy fibre input I i (t), with the push-pull mechanism described in (Rössert et al., 2015). G is a binary matrix and represents the connectivity between PC and DN neurons, as described in the previous paragraph.
The corresponding connection weights w PC are fixed and set to 0.1. The time constant for synaptic integration is τ PC = 1 ms. As the firing rate is restricted to be positive, x next and y next are translated and scaled to fit in the range [ I 0i 4 , 3I 0i 4 ]. The complete neural network was implemented using ANNarchy (Vitay, Dinkelbach, & Hamker, 2015), a neural simulator for distributed rate-coded and spiking networks, with a step size of 1 ms. The code is available at github.com/kimschmi/CerebellumForwardModel.

Perturbation-based Learning Rule
As in Bouvier et al. (2018), each synapse between the i-th GC and the j-th PC maintains an eligibility trace e i j (t) according to: The eligibility trace integrates the product of the presynaptic activity and the postsynaptic perturbation over time. As a consequence, only those synapses whose presynaptic neurons were active when postsynaptic perturbations occurred will be updated. In the current model, the eligibility trace does not decay over time and relies on the precise timing of perturbations. It is hypothesized that perturbations only occur during the response period, when the network's response to the input is computed. An alternative would be to consider a decaying eligibility trace, which would however limit the response period to a short time before the error feedback.
As the information about the network error becomes available, the weight updates are computed by: ∆m i j = −η c e i j (4) η = 2e−4 is the learning rate, e i j the eligibility trace and c signals whether the error for the current input-output pair improved or got worse relative to previous trials. This binary 620 signal is given by: The current error ε µ for an input µ is the Euclidean distance between the network output and the target value. The average of recent errors for a specific input pattern I µ is estimated by a linear combination of the GC outputs I µ = ∑ i v i z i (t), where z i (t) is the firing rate of the i-th GC and the coefficient v i is updated at the end of each trial according to: The firing rate of the granule cells is averaged over the response period [70, 75] (see next section) and η = 2e−4 is the learning rate. The absolute value of both weight updates ∆m and ∆v is not allowed to exceed 0.5 to ensure stability.
In the original version by Bouvier et al. (2018), the synaptic weights between parallel fibres and Purkinje cells are not restricted to be positive, which they explain by the fact that the GC firing rate may be relayed to Purkinje cells via inhibitory interneurons. During learning, synaptic weights can therefore change their sign, which could correspond to the creation of new inhibitory synapses and the pruning of existing excitatory synapses or vice versa. This form of structural plasticity cannot be explained by current observations of long-term depression and long-term potentiation at these synapses. We restrict here the weights to positive values only.

Training Procedure
The network is presented with the four input signals x prev , y prev , ∆θ 0 and ∆θ 1 for 20 ms. After a delay of 50 ms, the network response is read out. During this response period of 5 ms, the Purkinje cells receive random perturbations that vary the cerebellar output. Perturbations are generated randomly and independently by each IO neuron with a mean rate of 50 Hz (the probability that a DN neuron receives a perturbation during the response period is 0.25) and an amplitude of 0.1. Finally, the network response is evaluated, compared with the desired positions, and the parallel fibre-Purkinje cell weights are updated. Between the presentation of different input samples, the firing rate of granule cells and Golgi cells and the results of synaptic integration are reset to values chosen randomly from U(0.0,0.1). The network is trained on a set of 5,000 random samples for 2,000 epochs and its performance is evaluated on a test set of another 5,000 random samples. Figure 3 shows that the mean remaining error between the prediction and the arm's target position decreases rapidly during learning. Figure 4 shows the time course of the activity of the two projection neurons when the network is presented with a random test sample after learning. Figure 5 illustrates how the corresponding predicted arm position evolves during learning for that particular movement. The initial prediction is completely wrong, as the weights are initialized randomly, reaches quite quickly a mean target position around (0, 0.5) and finally converges quite slowly to the target value (note the remaining error). The same pattern can be observed for most learned movements, as can be seen in Figure 6, where the variance of the distance between this mean position and the predicted positions first decreases sharply, before increasing again. Eventually, learning converges to a mean Euclidean distance of 0.079 over both the training and test sets. Removing the positivity constraint on the weights between the parallel fibres and the PC slightly reduces the mean error to 0.069 on both sets. Figure 7 visualizes the distribution of the network prediction errors in the arm's workspace (test set). The accuracy of the predictions is generally better around the mean position

Discussion
The proposed model combines the recurrent dynamics of the GC-GoC excitatory-inhibitory network proposed by Rössert et al. (2015) with the perturbation-based learning rule for parallel fibres-PC synapses proposed by Bouvier et al. (2018). The model is able to learn a simple non-linear prediction task on a 2D simulated arm, although still imprecisely. Contrary to the classical supervised approach requiring complete error signals, the model learns from a binary teaching signal indicating whether the prediction error has improved compared to baseline performance. This allows to learn forward models with a cerebellar model: the IO mainly receives low-level motor and proprioceptive information, so it can only drive supervised learning of inverse models (motor adaptation). For supervised forward models, the IO would need to compare cortical sensory representations in order to compute the teaching signal, what seems to be a very challenging task for such a small nucleus. By relying on a much simpler reinforcement-like teaching signal, the proposed model could learn forward models even if the predicted sensory space is high dimensional. Many aspects of the proposed model need to be further studied, such as its precision, its use for high dimensional sensory spaces or longer prediction delays. An efficient and bi- ologically realistic model of the cerebellum for the acquisition of forward sensory models would be an important step for the study of many cognitive functions, including model-based reinforcement learning, planning, body awareness or the sense of agency. The integration of this model into broader neural architectures involving the cerebral cortex and the basal ganglia might allow to both understand better the brain at the systems-level and create adaptive cognitive agents.