A model of operant learning based on chaotically varying synaptic strength

Operant learning is learning based on reinforcement of behaviours. We pro-pose a new hypothesis for operant learning at the single neuron level based on spontaneous ﬂuctuations of synaptic strength caused by receptor dynamics. These ﬂuctuations allow the neural system to explore a space of outputs. If the receptor dynamics are altered by a reinforcement signal the neural sys-tem settles to better states, i.e., to match the environmental dynamics that determine reward. Simulations show that this mechanism can support operant learning in a feed-forward neural circuit, a recurrent neural circuit, and a spiking neural circuit controlling an agent learning in a dynamic reward and punishment situation. We discuss how the new principle relates to existing learning rules and observed phenomena of short and long-term potentiation.


Introduction 1
Operant learning (also called operant conditioning or instrumental con-2 ditioning) is a type of learning in which a new behaviour is increased, or an 3 existing behaviour is suppressed, by pairing it with reward or punishment.For example: (a) In a Skinner box, when a rat occasionally presses a lever, it 5 gets some food.After a while, it increases the rate of lever pressing (Jensen, 1963).(b) In a flight simulator, a fruit fly is heated when it generates yaw torque to one side and released from heat when it generates yaw torque to the other side.In minutes the fly learns to maintain its torque in the range that is without punishment (Wolf and Heisenberg, 1991) .(c) When an Aplysia produces a bite, the esophageal nerve can be stimulated in vivo to mimic the food signal.After training, it produces more bites than a yoked control that has received the same stimulation without the coupling to its own actions (Cash and Carew, 1989;Brembs, 2003).
Some of this research, e.g. in Aplysia (see review in Nargeot and Simmers (2011)), implies that mechanisms at the single neuron level can play important roles in operant learning.There are some existing single neuron or synapse models intended to account for operant learning.For example, the Hedonistic Synapse is a spike-based synapse model with stochastic synaptic transmissions, where the probability of transmitter release (the synaptic strength) is updated continuously according to the correlation between the transmitter fluctuation and a reward signal (Seung, 2003).Learning models based on modulated spike-timing-dependent plasticity (MSTDP) have also been applied to operant learning, using a reward signal to alter the weight of synapses that have been tagged by STDP as contributing to the output that produced the reward (for a review, see Frémaux et al. (2010)).These models only apply to spiking neural networks, and moreover, they have to introduce some arbitrary mechanism, such as a random number generator, to explore output space (i.e.generate different actions).Use of random number generators leads to the exploration of discrete output spaces with ever-present unpredictability.
An alternative option for generating exploration of the output space is chaos.Chaotic motion, which is a type of irregular motion that can exist in simple systems, has very complex, unpredictable and ergodic solutions (Tél et al., 2006;Eckmann and Ruelle, 1985).Chaos is widely found in biological systems (for a review, see Cavalieri and Koçak (1994)), including neurons and neural circuits.In a neuron, the dynamics of membrane potential and ion flows can be chaotic, as has been verified in several models, such as Nobukawa et al. (2014), Storace et al. (2008) and Canavier et al. (1990), and observed in the Nitella intermodal cell (Hayashi et al., 1983).Simulations of neural circuits also show chaos can exist at the circuit level, e.g.Sussillo (2014) and Angulo-Garcia and Torcini (2014).A chaotic system can be a source to generate unpredictable, continuous and ergodic actions for operant learning or reinforcement learning.This idea has been applied to algorithms for robot learning, such as a Fish-Catching Robot that uses a chaotic generator for unpredictable motion planning to avoid fishes adapting to repetitive motions (Inukai et al., 2015) and a hexapod robot with a chaotic Central Pattern Generator (CPG) that produces chaotic signals for exploration of new motions to free its leg from a hole in the floor (Steingrube et al., 2011).
The signals generated by a chaotic process are more continuous and more suitable for controlling a robots (or animals) interaction with the physical world than the signals generated by a random number generator, which are usually discrete white noise.Chaos in a physical system usually results in a more continuous and smooth variation of states than a random system.This property allows a transient delay of reward and modulator, which is common in learning in the real world.In principle, continuous and smooth trajectories can be obtained from a random number generator using interpolation, but, unlike chaos, the system will be predictable during the interpolation.
Although chaos is widely found in biological systems, the potential for chaos in synaptic dynamics and how this could support learning has not been previously considered.Here, we hypothesise that the following 'Dynamic Synapse' mechanism could underly operant learning (Fig 1 Is there a plausible biological mechanism that could produce the hypothesised synaptic strength fluctuation?The number of neurotransmitter receptors (from now on we will refer simply to receptors) embedded in the membrane of a post-synaptic dendritic spine is a key factor in synaptic strength (Sheng and Hoogenraad, 2007).Enlargement of a dendritic spine increases its capacity for anchoring structure, including scaffold proteins and cytoskele- A neuron has multiple inputs, and its output is the sum of the inputs multiplied by the synaptic strengths, passed through a non-linear function.Because the synapses are dynamic, their values continuously change, and thus the output will explore a space of possible outputs.A value function on the output controls the release of a modulator which alters the synaptic strengths.(Right): Illustrating the dynamic synaptic strength of one synapse.During learning, the centre of synaptic strength oscillation is shifted towards the instantaneous synaptic strength that coincides with increased modulator, e.g., as illustrated, the modulator (red) is high when the instantaneous strength (green) is high, so the centre of synaptic strength is gradually increased (blue).The modulator also affects the damping of the oscillation, so the amplitude of oscillation decreases, and the learning can converge.An observer can infer the effective synaptic strength by low-pass filtering on the instantaneous synaptic strength (black) but note this is only an approximation of the actual centre of oscillation which cannot be directly observed.
Figure 2: Decoupling between changes in spine size and synaptic strength under certain conditions.The membrane is formed mainly by the lipid bilayer and proteins.Cytoskeleton supports the shape of the dendrite spine.There are two forms of receptor trafficking.Lateral movement of receptors is observed as Brownian motion on the membrane.Endosomal trafficking carries receptors driven by motor protein along the cytoskeleton.Scaffold proteins can help receptors to anchor, increasing the capacity of the dendrite spine to hold the receptors.On the left, the size of neural spine stays the same, but the synaptic strength (number of receptors) varies.On the right, the size of dendrite spine varies, but the synaptic strength stays the same.Modified from Cingolani and Goda (2008) ton, and thus the number of neurotransmitter receptors it can accommodate (Allison et al., 1998).However, the size and the capacity are not closely coupled (Cingolani and Goda, 2008) The number of receptors in the membrane of a spine is also affected by two broad types of movement between synaptic and non-synaptic pools: lateral movement, which is mainly passive diffusion on the cell membrane; and endosomal trafficking, which is active transportation (Lau and Zukin, 2007).The lateral movement is affected by the cytoskeleton, which restricts or guides the diffusion (Jaqaman et al., 2011).In particular, the actin cytoskeleton has an active contribution to the regulation of postsynaptic receptor mobility both in and out of synapses (Cingolani and Goda, 2008).The endosomal trafficking includes endocytosis of receptors from cell membrane to endosome, intracellular transportation of endosome, and exocytosis of receptors from endosome to the cell membrane (Roth et al., 2017).Endosomal trafficking can recycle receptors, transporting them between different regions (Petrini et al., 2009).There are also ongoing processes of receptor synthesis and degradation (Triller and Choquet, 2005).
The timescale of these receptor dynamics can be relatively fast.Receptors move from synaptic to extrasynaptic regions and vice versa usually with periods of up to a few minutes (Triller and Choquet, 2005).The size of a post-synaptic dendrite spine and the amount of actins in it oscillate in a time scale from tens of seconds (in immature dendrite spine) to a half hour (in a mature synapse) (Koskinen and Hotulainen, 2014;Honkura et al., 2008).Receptors anchored to the actin cytoskeleton (Hausrat et al., 2015) can move with the actin flow (Sergé et al., 2003).Post-synaptic receptor dynamics have been modelled at a mesoscopic level treating the regulation of numbers of the receptors and scaffold proteins as quasi-equilibrium based on thermodynamic theory (Sekimoto and Triller, 2009).The model proposed in Haselwandter et al. (2011) describes formation and stability of synaptic receptor domains as a reaction-diffusion system.We note these models are dynamic, but not chaotic.We propose i) that the complexity of post-synaptic dynamics (Choquet and Triller, 2013), especially receptor trafficking (Triller and Choquet, 2005) can support chaos and ii) that this can provide a mechanism for operant learning as described in Fig 1 .It is notable that dopamine has been shown to affect the same receptor trafficking dynamics (Sun et al., 2008).This supports the possibility that, in an operant learning paradigm, the relationship between the current synaptic strength (changing chaotically due to receptor trafficking) and a reward (signalled by neurotransmitter release) is a basis for learning.The possible role of alteration in postsynaptic receptor distribution and size of dendritic spines in learning (particularly in short-term and long-term potentiation (STP & LTP) protocols) is well established (Isaac et al., 1995;Kauer et al., 1988;Shepherd and Huganir, 2007).In Shouval et al. (2002), Shouval et al. proposed a thermodynamic model of AMPA receptor endosomal trafficking to explain bi-directional synaptic strength variation during LTP and long-term depression (LTD).Xie et al. (1997) proposed a synapse level model in which AMPA receptors are attracted toward NMDA receptors during STP, and some of the AMPA receptors become anchored near the NMDA receptors while others diffuse again during LTP.The plausibility that such changes in receptor distribution could alter synaptic efficiency has also been demonstrated (Allam et al., 2015).
In the learning model presented here, we do not include any Hebbian process (see discussion).Instead, we allow chaotic synapses in a neuron to explore possible synaptic strengths; the neuron thus becomes a function on its inputs with chaotic coefficients, generating unpredictable output signals to explore action spaces.If the consequences of the action are reflected in a reinforcement signal delivered to the synapses, the parameters of the chaos can be altered to centre around synaptic strengths that optimise the output.
We show through simulation the learning functionality of such a system in several different scenarios.

Result
Our model simplifies the structure of a neuron to consist of multiple input synapses and a dendrite, which together comprise the dendritic tree (Fig 3).
We do not model the soma and axon of the neuron but simply calculate the somas input as the sum (across the dendritic tree) of the synaptic inputs multiplied by their respective synaptic strengths, then calculate the somas output by passing the input through a non-linear function.The number of receptors in a synapse represents the synaptic strength of the synapse.Receptors in the dendrite do not contribute any synaptic strength.Because of the receptor trafficking dynamics, the synaptic strength fluctuates spontaneously.In the methods we provide an abstracted mathematical model for receptor trafficking, but summarise here the key properties needed to support learning: 1. Spontaneously and smoothly varying synaptic strength w i around an oscillation centre w ci ; 2. The phases of the oscillations are not locked 3. The oscillation centre w ci and amplitude depend on properties of the dendrite tree that can be altered by a learning signal.
When a neuron or network of neurons with such synapses produces output in a way that meets a specific requirement (given by a value function), modulator representing reward is released.The modulator affects the centre of synaptic strength oscillation, which shifts towards the instantaneous synaptic strength at the time of the modulator release.The simplest way to implement this is as a learning rule depends only on the current centre of synaptic strength oscillation, the instantaneous synaptic strength and amount of the modulator: where n M is amount of the modulator, and k w is a coefficient controlling the learning rate.By this learning rule, a circuit with dynamic synapses can conduct operant learning, as the instantaneous synaptic strength is near or in the range that satisfy a criterion when modulator is released (note in the experiments that follow we use a slightly altered rule (equation 23 in Methods) to compensate for a biased drift in synaptic strength).To allow learning to converge, the learning rule should also reduce the oscillation amplitude (equation 24).Conceptually, we relate the centre of oscillation to the capacity of a dendritic spine to hold receptors (Fig 2 ; and the amplitude of oscillation to the damping of the receptor movement dynamics.We assume these can result from changes in spine size or to the scaffold cyto-skeleton complex, but do not model these explicitly.

Simulation of a dendrite tree
In That is, when the reward is delivered, the state of the synapse should still be near the state that caused the action that resulted in reward.However, the timescale cannot be too long or else the generation of new actions will be limited, and the learning might converge to a local minimum.We note there may be other factors that produce unpredictable synaptic strengths, such as Brownian movement of receptors due to thermal noise, but suggest that these may be subsumed within the higher level dynamics described above, and it is not necessary to include them as a source of noise to support learning.

Applying learning in a simple linear example
In this experiment we test learning in a single neuron with reward provided when the output is higher than a threshold and increasing.The neuron where n m is the amount of modulator, k m 1 a coefficient, y the output of the neuron, and y 0 a threshold of y to trigger the release of modulator.The trajectory starts by exploring a large volume then gradually converges.

Tuning the period of a central pattern generator
A Central Pattern Generator (CPG) is a type of Recurrent Neural Network (RNN) which exists in many animals to control rhythmic motions, such as walking and heartbeat.It is also applied in legged robot control as an alternative to explicit motion planning (Ijspeert, 2008;Xia et al., 2017).However, where w i CP G is CPG synapses weights, w i 0 the ith initial synaptic weight of the CPG, β is a base of exponentiation that scales the weights.As the CPG is symmetric, in the model, the state of dynamic synapses of one neuron is Note that the statistical output value starts to increase after unstable initial fluctuation.At the end of the learning, the centre of the oscillation of the synaptic strength shifts so that the order of strengths is the same as the order of the input values, and the synaptic strength of the synapse with highest input value increased while the others declined, which is the most efficient way to get higher output with conservation of the total number of receptors.
Figure 7: A CPG with the learning rule.Two neurons with spontaneous firing inhibit each other's firing alternately.The simulation aims to tune the period of oscillation, using the same operant learning rule to alter the synaptic strengths.
a mirror of the other one.When the output of the CPG crosses zero, the error between the target period and the actual period is calculated, and the modulator is released at a speed that is proportional to the decline of the error compared with the previous error.If the error increased, no modulator is released: where ω i is the period of the CPG from ith to i + 1 th zero crossing, ω obj the target period, i is the error between them, n m I the amount of modulator released.
The CPG originally had a period of about 0.5 seconds.The target of training is to alter the period to be 2 seconds by tuning the synaptic strengths.The results are shown in Fig 8 .Using the same operant learning rule as before, the period of the CPG converges to the target period.
The period of the output of CPG and the synaptic strength is nonlinear and dynamic synapses have no prior knowledge of the CPG, but the simple neural circuit still finds and learns the parameters of the target effectively.
The experiment shows that the Dynamic Synapse can be applied to an RNN without requiring any specific analysis of the properties of the network.

Reinforcement learning in Puckworld
The Dynamic Synapse model was tested in a game named PuckWorld, available as part of the Python Learning Environment.The game has a planar environment with three agents (Fig 9 ): a player that is controlled by a reinforcement learning algorithm, a reward source that changes its location Figure 9: The environment of PuckWorld.The green point is the reward source, the blue point is the player, the red point is the punishment source, and the dark magenta circle is the range the punishment source effects.
after a specific period, and a punishment source that chases the player and decreases the reward if the player is within a specific range of the punishment source.
In the game, the player can move in 4 directions: left, right, down and up.
The states of the player and the environment can be observed (Fig 10).The states are the velocity of the player, the locations of the player, the position of the reward source and the position of the punishment source.The states are pre-processed then used as sensor input.In this instance, the sensory inputs are the velocity of the player, the distance to the reward source, and the shortest distance the player is from the edge of the range of the punishment source (the distance to escape).As the game codes the states using an absolute coordinate system, the player does not have orientation.To transform the potentially negative values and direction of distance information in absolute coordinates into positive sensor values, the player is assumed to have sensors in 4 directions that correspond to the positive and negative directions of the x-and y-axis of the coordinate system, and the sensor on the side of the agent information coming from is positive, while the other side is zero (Fig. 10).As the player has a symmetric structure, the neural circuits are designed in a symmetric structure: four integrate-and-fire motor neurons control the motion in the four directions, respectively.Each neuron gets three types of sensory inputs (as outlined above) in the four directions.
Each sensory input feeds into the neuron through a dynamic synapse.Also because of the symmetry of the structures and motions, to simplify and accelerate the training, the dynamic synapses of each motor neuron from sensors in the same direction relative to that motor neuron are treated as the same (have the same dynamics and parameters during the learning).There are four sets of neural circuits in the player; because the neural circuits, agents and the environment are symmetric, all homologous synapses are assumed to share the same dynamics and synaptic strengths to accelerate the learning.(c) The sensors indicate distances by orthogonal decomposition; when a measured object is in the direction that can be projected to the positive direction of a sensor, the sensory value is positive, otherwise 0.
The function of the motor neurons is: where v is membrane potential, s i the ith sensory input, v rest the rest membrane potential and v threshold the threshold of firing.
The reward of the game is the weighted sum of the normalised distance to the reward source and the normalised distance into the range of the punishment source: where R is reward, d r the distance between player and reward source, d e the distance between player and the edge of punishment range.
The reward is fed into a firing rate neuron with an adaptive current, which releases the modulator.With the adaptive current, the neuron is sensitive to the change of the reward but insensitive to the value of the reward.The adaptation speed factor from low to high is higher than the adaption speed factor from high to low, thus the neuron has a trend to increase the expectation of the reward: where I adapt is the current intensity, k R a factor from reward to current intensity, k adapt 1 and k adapt 2 are factors of adaption speed.Thus modulator amount n m is given by: where k mI is a factor to map the current after adaption to an appropriate range.
As this is a single layer circuit, the ability of a player controlled by the circuit is simple and limited.Hence, we can analyse the possible best solution of the synaptic strengths and compare it with the solution obtained by operant training with dynamic synapses.Treating the single layer circuit as a linear function, the whole system can be interpreted as a second-order system.For an appropriate solution, the interactions of the elements in the system should work as though (1) there is an extension spring connecting the player and reward source; (2) the punishment range is an elastic ball that pushes the player away; and (3) the elastic coefficient of the elastic ball is higher than the elastic coefficient of the spring so the player will avoid punishment even when the reward is inside the punishment range.Because of (1), the synaptic strengths of positive y distance to reward input should be higher than the synaptic strengths of negative y distance to reward input; because of (2), the synaptic strengths of positive y distance to escape input should be higher than the synaptic strengths of negative y distance to escape input; and because of (3) the synaptic strengths of positive escape input should be higher than the synaptic strengths of positive reward input.The positive y velocity (line 3) is also higher than negative y velocity (line 2), which means the agent tends to accelerate.These appear to be two strategies to avoid chasing by the punishment source.

Discussion
We have proposed a model of operant learning based on continuous unpredictable synaptic strength fluctuations, with dynamics that are altered in response to a reinforcement signal.We illustrate the application of this principle to optimise the output, for given inputs, first in a simple linear neuron model, then to tune a recurrent CPG network to a target period, and finally to enable a spiking neural circuit embedded in an agent to improve performance in a continuous environment with dynamic reward and punishment.
An important property of our approach is that the source of variation that supports operant learning is continuous, unlike reinforcement learning algorithms that are based on random number generators, which have either discrete random outputs, or are partially predictable because of interpolation.
By defining a system that has chaotic dynamics we can generate continuous motion without interpolation, so the unpredictability is continuous on any scale.An additional advantage over alternative synapse-level models for operant learning, such as the Hedonistic Synapse (Seung, 2003), are that the applications are not limited to a specific type of neural circuit or neural network.We have shown we can use our Dynamic synapse in both spiking and firing rate neural circuits, and the method can also be suitable for general online parameter optimisation, as it acts to scale the synaptic strength value to the suitable ranges.It can also be applied to discrete systems by adjusting the time step to an appropriate range or by sampling.We plan to further explore the application of this model to a range of problems in robot learning and reinforcement learning.
A key difference between our model and previous models is that our model learns in parameter space but not action space.Previous models usually alter the synaptic strength based on the pattern of synapse activities Where n T is amount of the synaptic transmitter, k w1 is a coefficient.In this extended model, when neurotransmitter is released, the instantaneous synaptic strength (the number of receptors) will tend to increase, resulting in STP.When the instantaneous synaptic strength is higher than the centre of the oscillation, if modulator is released, the capacity of the synapse to contain receptors will increase.Because of the oscillation of the amount of receptors in the synapse, some of the receptors diffuse again.Because the capacity is increased, more receptors are held in the synapse, resulting in LTP.
The model in this paper represents postsynaptic dynamics in a simplified form, at the statistical level of receptor trafficking, allowing it to emulate some features of receptor flow dynamics and synapse dynamics.Modelling individual receptors is out of the scope of this study because it would not be relevant at the level of learning.However, the mathematical functions for the receptor dynamics in our model are not exclusive.As long as the receptor dynamics has the features of chaotic oscillation, and the centre of oscillation is controllable by our learning rule, our learning rule could work for alternative formulations.The model could be be extended to include more detail.
For example, the receptor trafficking within the dendrite is assumed to be fast enough (compared to dendrite to synapse trafficking) to ignore its time constant.In reality, variations of AMPA receptor numbers on neighbouring dendrite spines are usually in the same direction (Zhang et al., 2015).This phenomenon could be modelled by taking account of the speed of receptor trafficking in the dendrite, which would have the consequence that neighbouring synapses would tend to have a similar concentration of receptors in the dendrite.Hence the receptor oscillation in neighbouring synapses would have a higher probability to be in similar phases than in distant synapses.Thus several predictions arise from our model which we hope may be tested in future experiments.
However, the key concept presented here is not crucially dependent on the details of receptor trafficking.Other models of chaotic neurons or neural circuits suggest chaos exists in the membrane potential, and alternative chaotic processes in an animal could also possibly contribute to the generation of actions and learning with the same desirable properties of continuous unpredictability.Rather, the key properties are that the learning mechanism is entirely local to the synapse, and does not require an explicit tag for the Hebbian correlation of pre-and post-synaptic activity but rather allows this property to emerge from the behavioural or output consequences caused by the recent state of the circuit.That is, synapses that contribute to obtaining reward are strengthened; but this does not depend on the firing of either the pre-or post-synaptic neuron, except insofar as this is necessary to cause behavioural outputs that result in reward.
It is nevertheless interesting to consider a simple variation on the learning rule we have used to make synapses with active presynaptic neurons (neurons that have released neurotransmitter, indicating they have fired) learn actively (c.f.Eqs. 1 and 24): where n T is amount of the synaptic transmitter.With n T , variation of synaptic strength of a synapse is proportional to the presynaptic neuron activity, which can help to improve the pertinence of learning to the inputs.For example, a neuron gets multiple inputs but only a small set of them is activated by a specific stimulus, and with this rule, the synaptic plasticity only applies between the neuron and these activated inputs.Note this is a 3-factor learning rule, depending on the correlation between the amount of the synaptic transmitter, the amount of modulator, and the difference between instantaneous synaptic strength and the centre of the oscillation.When the absolute value of the correlation is higher, the variation of the centre of the oscillation is more significant.
However, another possible learning rule could use the weighted average, rather than the product, of the synaptic transmitter and instantaneous synaptic strength: where k w4 is a coefficent to fit the amount of transmitter to synaptic strength, q a proportion representing the relative weighting of these two factors, and α a constant.Notably, this rule can potentially account for Pavlovian classical conditioning, where the stimulus and reinforcer (neuromodulator) are presented together irrespective of the output.When q = 1, the learning rule is Pavlovian learning; when q = 0, the learning rule is operant learning.When q is close to 1, the learning process might look like classical conditioning with noise.Thus, classical and operant learning may coexist in the same neuron and even in the same synapse.Figure 13: Justification for a continuous representation of the effects of receptor location between dendrite and synapse.The boundary between a synapse and dendrite can be considered wide and smooth, and as a receptor approaches the synapse, it can receive more neurotransmitters and contribute more to the synaptic strength.Rather than model the boundary area explicitly, we associate synaptic strength with the 'amount' of receptors a synapse contains, treated as a continuous variable.

Mathematical model
When the number of receptors per synapse is sufficiently large, their dynamics can be modelled statistically using differential equations (Holcman and Triller, 2006), e.g.like gas, which consists of free-moving molecules and uncertain intermolecular distance.However, even for a smaller number of receptors per synapse, we note their contribution to synaptic strength can be proportional to their distance from the centre of the synaptic cleft, due to diffusion of neurotransmitter (Fig 13).Thus, rather than explicitly represent discrete receptors and their positions, we represent the number of receptors in a synapse that currently contribute to its synaptic strength as a continuous 'amount'.
In the following equations, constants are represented by normal font and variables by italics (except v for membrane potential of integrate-and-fire neurons).The meanings of the symbols are shown in Table 1.The unit of time is millisecond.
The model assumes that the capacity of the dendrite to contain receptors is proportional to the number of synapses: Where V d is the capacity of a dendrite, N the number of synapses, and V s a constant factor, which is the average capacity of a dendrite per synapse.
The concentration of receptors in the dendrite, c d , is given by: where w total is the (fixed) total amount of receptors in the dendrite tree; w i is the amount of the receptors in the ith synapse; and V d is the capacity of the dendrite.
We model the continuous flow of receptors between synapses and dendrite as a movement rate times the concentration of receptors on the source side: where w i is the amount of receptors of the ith synapse, w i /V i is concentration of receptors of the ith synapse, c d the concentration of receptors in the dendrite, and v i is the bidirectional movement rate, which is affected by lateral diffusion, endosomal trafficking and friction as described in the overview: where v i is bidirectional movement rate from dendrite to synapse (the direction from dendrite to synapse is positive); r is movement rate inertia , which represents factors (e.g.properties of actin) that drive receptors to keep their direction of flow; V i is the capacity of ith synapse, which is affected by w ci ; c d − w i /V i is a term that represents the concentration difference between synapse and dendrite, which causes motion of receptors by diffusion; a sign(V i) × 2 |V i | is positive feedback term of the movement, with positive feedback coefficient a; −bv i is a damping term with represents friction during the motion, with damping factor b.
As shown in Fig 12, the receptors also move between neighbouring dendrite regions by diffusion: where q d is a coefficient from concentration difference to concentration variation rate.In practice, we found that when the number of synapses is less than 33, modelling this this diffusive process has little effect.Hence, in the simulations in this paper, the diffusion is treated as instantaneous.For larger numbers of synapses, neglecting the dendritic diffusion can result in collapse of the chaotic dynamics, but these can be recovered if we run simulations with limited diffusion (results not included here).As described in the Results section, a simple learning rule for this system is: where n M is amount of a neuromodulator that represents reward, and k w is a coefficient controlling the learning rate.range, the rate of bias can be approximated as a constant.To compensate it, a learning rule with compensation can be applied: where k wc is a constant factor to compensate the bias.However, if the centre of oscillation changes in a larger range, the bias is variable, and cannot be compensated using the above rule.In our model, this bias is towards positive values for a centre of oscillation above 0.5, and negative values below 0.5.As a consequence there can be a positive feedback effect that accelerates learning.
To allow learning to converge, the learning rule should also reduce the oscillation amplitude.When the modulator is present, damping factors also increase: where b is the damping factors, k b a coefficient.

Acknowledgments
This work was supported by FP7 FET-Open project Minimal.We thank Matthieu Louis for discussions of earlier versions of this work. 4 ).A neuron (Fig 1 (left)) has multiple input synapses, for which the synaptic strengths spontaneously fluctuate with uncorrelated phases (Fig 1 (right) green curve) around the centre of oscillation (Fig 1 (right) blue curve).We argue in more detail below that this could be caused by receptor trafficking.The neuron recieves inputs (e.g. from sensors or other neurons), and the inputs are multiplied by the synaptic strengths, summed up and passed through a non-linear function to determine the output.The output of the neuron causes some outcome (e.g. for an agent in an environment) which results in release of a neuromodulator according to a value function (Fig 1 (right) red curve).The modulator acts to bias the centre of the synaptic strength oscillation towards the instantaneous synaptic strength, and to decrease the amplitude of oscillation.Thus the synaptic strengths will converge to match the input-output properties of the neuron to the value function.

Figure 1 :
Figure 1: Basic concept of how operant learning works with a Dynamic Synapse.(Left):A neuron has multiple inputs, and its output is the sum of the inputs multiplied by the synaptic strengths, passed through a non-linear function.Because the synapses are dynamic, their values continuously change, and thus the output will explore a space of possible outputs.A value function on the output controls the release of a modulator which alters the synaptic strengths.(Right): Illustrating the dynamic synaptic strength of one synapse.During learning, the centre of synaptic strength oscillation is shifted towards the instantaneous synaptic strength that coincides with increased modulator, e.g., as illustrated, the modulator (red) is high when the instantaneous strength (green) is high, so the centre of synaptic strength is gradually increased (blue).The modulator also affects the damping of the oscillation, so the amplitude of oscillation decreases, and the learning can converge.An observer can infer the effective synaptic strength by low-pass filtering on the instantaneous synaptic strength (black) but note this is only an approximation of the actual centre of oscillation which cannot be directly observed.
. As shown in Fig 2, under certain conditions, synaptic strength can change without changes in spine size, and spine size can change without changes in synaptic strength.

Figure 3 :
Figure 3: (Left) A dendrite tree consists of a dendrite (in dark brown) and multiple synapses (in light brown).(Right) A schematic diagram of the dendrite tree.Receptors can move between dendrite and synapse to dynamically modify the synapse strength w i around some centre W ci .
Fig 4, we show in simulation that our receptor trafficking model produces apparently chaotic and unpredictable oscillation of the synaptic weights.The simulated dynamic synapse system has six synapses, and the trajectory of the first three is plotted: it can be seen that it samples relatively evenly in the space of synaptic weight values.Fig 4 (right) shows how the range of exploration can be controlled.If the damping factor of a synapse increases, oscillation in the corresponding dimension of the plot will be narrower.If the capacity of a synapse changes, the centre of oscillation of the corresponding dimension in the plot will translate.These properties are the basis of the principle by which the system can learn and converge.In this example, the periods of the oscillations are from 10 s to 20 s.With different parameters, the periods can be in a different range, such as in tens of minutes or hours, and the oscillations still appear chaotic after the equivalent of several days of simulated time.It is important for learning in our model that the synaptic dynamic timescale matches the causal dynamics of the learning situation.
is a linear neuron, i.e. its output is the sum of the product of input values and their synaptic strengths.During the simulation, the input values of the neuron are constants ranging from 0-5 as shown in Fig 5.The reward function is:

Fig 6
Fig 6 (a) shows the instantaneous synaptic strengths, and the labels of lines show the constant input value of corresponding synapses.The equilibrium synaptic strengths, which are also average synaptic strengths, are shown in Fig 6 (b).Note that the later equilibrium synaptic strengths have the same ordering from highest to lowest as input strengths.The neuron has a fixed total of receptors, for which it finds an efficient distribution across the synapses to maximise.Fig 6 (c) shows the output of the neuron.In the first half of the learning process, the output decreased a little because the initial value is high but not stable.In the second half, the output gradually

Figure 4 :
Figure 4: Trajectories of synaptic strengths.(Left): all synapses have the same damping factors.(Right): synapse one has a higher damping factor than others.(a) & (b) show the change over time of the synaptic strengths (the proportional number of receptors in each synapse); (c) & (d) plot the trajectory formed by the first three synapses (for (d) the synapse on the X-axis has higher damping); (e) & (f) are Poincaré maps, i.e., sections of (c) and (d) when the instantaneous synaptic strength passes the plane defined by the centre of oscillation for one synapse (blue and green are for two different directions, and time of intersection is indicated by the intensity).It can be seen that synaptic strength oscillates chaotically and unpredictably, tracing out a search space.With higher damping factors, the amplitude of the oscillation for that synapse is decreased, reducing the search space.The periods of the oscillations can be different with different parameters.

Figure 5 :
Figure5: A linear neuron with dynamic synapses and several constant inputs.Its output is the sum of the inputs, each weighted by the respective synaptic strength.
training of a CPG is difficult.People often have to tune it by hand or by offline parameter optimisation, such as brute force search or Genetic Algorithms.Our approach has a potential advantage in tuning or training a CPG because it can train a CPG online.This experiment shows an example of tuning a CPG to change its period.The CPG model is modified from the model described in Mori et al. (2004).The CPG is symmetric, and the synapses are replaced by Dynamic Synapses (as shown in Fig 7).The initial values of dynamic synaptic strengths were set to be the original synaptic strengths, and the initial amplitude of oscillation of synaptic strengths are scaled by an exponential function to be in the nearby order of magnitude of the original synaptic strengths.

Figure 6 :
Figure6: Simulation results of the simple linear example.The value function determining modulator release is that the output is higher than a threshold and increasing.(a) The instantaneous synaptic strengths, the labels of lines show the input value of corresponding synapses (b) the central synaptic strengths (c) the output value of the neuron (d) trajectory of the first three synaptic strengths.Note that the statistical output value starts to increase after unstable initial fluctuation.At the end of the learning, the centre of the oscillation of the synaptic strength shifts so that the order of strengths is the same as the order of the input values, and the synaptic strength of the synapse with highest input value increased while the others declined, which is the most efficient way to get higher output with conservation of the total number of receptors.

Figure 8 :
Figure 8: Results of tuning CPG with Dynamic Synapse.(a) Before learning the period of oscillation is about 500ms.(b) After learning the period of oscillation is about 2000ms.(c) The instantaneous synaptic strengths before scaling by the exponential function.As the model is symmetric, the two neurons share same states of dynamic synapses.Hence, only two synapses are plotted.Same in (d) and (e).(d) The centre of synaptic strength oscillation before scaling by the exponential function.(e) The error between the period of the output of the CPG and the target period during simulation.(f) the trajectory of chaotic exploration of the synaptic strength, which converged on the bottom left.

Figure 10 :
Figure 10: Sensors and neural circuits model for PuckWorld.(a) Velocity (v) sensors, distance to reward source (d r ) sensors and distance to escape (d e ) sensors get input from four directions; a motor neuron gets all of the sensory inputs by Dynamic Synapses.(b)There are four sets of neural circuits in the player; because the neural circuits, agents and the environment are symmetric, all homologous synapses are assumed to share the same dynamics and synaptic strengths to accelerate the learning.(c) The sensors indicate distances by orthogonal decomposition; when a measured object is in the direction that can be projected to the positive direction of a sensor, the sensory value is positive, otherwise 0.
The simulation results are shown in Fig 11.The simulation result was largely consistent with the analysis above, as shown in Fig 11 (a) and (c).However, surprisingly the highest synaptic strength is for negative x distance to reward input (line 4 in Fig 11 (a)) are higher than other lines, which means the agent would go forward when the reward source is on its left side.

Figure 11 :
Figure 11: The simulation results of Dynamic Synapse in PuckWorld.The relationships between the labelled number of synapses and the sensor a synapse connects to are: 0,1: x-velocity; 2,3 y-velocity; 4,5 d r in x; 6,7 d r in y; 8,9 d e in x; 10,11 d e in y; in each case odd numbers are the inputs in the positive direction as explained in the text.(a) Instantaneous synaptic strength of 12 synapses.(b) The trajectory of the first 3 synaptic weights; the explored range gradually converges.(c) The centres of synaptic strength oscillations; (d) The damping factors of instantaneous synaptic strength oscillation.All lines overlap.(e) A Poincaré map of Dynamic Synapse.It is a section of (b) when instantaneous synaptic strength passes its centre of oscillation.Each point is an intersection of the trajectory and the plane defined by the centre of oscillation.The blue and green points show the intersections from two different directions.The intensity of colour indicates the time of intersections.(f) shows the reward R, adaption current I adapt and Reward after adaption.
(i.e.those conveying signals that led to reward), but our model directly learns the synaptic strengths that led to reward.As the synapse dynamics reflect recent states of the synapse, exploring parameter space enables our model to solve the credit assignment problem without an eligibility trace, which is necessary for some previous models, such as extended STDP models byIzhikevich (2007);Gurney et al. (2015).As the time scale of synaptic strength fluctuations is longer than synapse activity dynamics, the model can function with temporally distant reward.Exploring parameter space means that the learning concerns the overall function instead of the specific outputs of the neural circuits, so our model allows remodelling of synaptic connections independently from action potentials of neurons, which is a potentially powerful tool for neural computation.We have proposed a possible grounding for the chaotic dynamics in the phenomena of receptor movement in dendritic spines.The model is inspired by recent evidence concerning the extent and mechanisms of these dynamics, but abstracted from the level of individual proteins to the level of the receptor flows between a dendrite and synapses as an integrated system.By focussing on postsynaptic receptor dynamics, our model can be related to synaptic mechanisms of short and long-term potentiation and depression (STP/LTP, STD/LTD).For example, the relations between STP and LTP as well as STD and LTD are similar to the relation in our model between the instantaneous synaptic strength and the centre of synaptic strength oscillation.The model can be expanded to explicitly explain some phenomena during STP, LTP, STD or LTD.For example, in STP-LTP model proposed inXie et al. (1997), AMPA receptors are attracted toward the activated NMDA receptors when neurotransmitter is released, then a proportion of AMPA receptors diffuse again.This learning rule can be implemented by adding k w1 n T into the function describing the change of the amount of receptors in a synapse: Our model depends on several hypothetical assumptions, such as the form of the dynamics of receptor trafficking, dynamics of capacity to contain receptors, and the equilibrium point of receptor oscillation, which are not yet directly supportable from biological research.To understand the dynamics of receptor trafficking requires continuous observation of the collective motion of receptors and concentration change of receptors in dendrites and synapses on timescales from seconds to hours.Similarly, understanding the dynamics of capacity to contain receptors requires continuous observation of actin flow between synapses and dendrites, size change of synapses and size change of postsynaptic density on similar timescales.Both types of observations are difficult but becoming experimentally more plausible, e.g.approaches of video microscopy in Zhang et al. (2015) and Esteves da Silva et al. (2015)continuously recorded the motions of proteins that can be observed as a group enabling the concentrations and flows to be understood.Observation of the phase relations between the oscillation of the receptors or structural components would be helpful for validating our model.In our model, we assume that the instantaneous weight leads the change of equilibrium point of receptor oscillation when the modulator is present.This could be tested by transplanting receptors to or from a synapse and giving modulator treatment, then observing if the synapse size or postsynaptic density changes.

Figure 12 :
Figure 12: Schematic Diagram and Symbols of Dynamic Synapse.A schematic diagram of the dendrite tree; the main variables and parameters of the model are indicated.For the meaning of the symbols, see Table 1 a verbal description of how our model represents the alteration of synaptic strength in terms of the dynamic movement of receptors, and then provide a precise mathematical formulation of the principle.Two forms of receptor trafficking can move receptors between the synapses and the dendrite.Lateral diffusion creates a passive flow along a gradient from a high concentration region to lower concentration region.Endosomal trafficking acts as an active flow that can move receptors against the gradient.The active flow is formed by endosome transportation which carries numbers of receptors.Our model has a minimal form to capture the key phenomena.Endosomal trafficking is active transportation and is modelled with a positive feedback term which provides motive force, and two negative feedback terms which limit the speed of transportation.The negative feedbacks are the receptor concentration gradient, which is proportional to the concentration difference between a synapse and dendrite, and friction of endosome transportation, which is proportional to the endosome transportation speed.These properties together produce dynamic oscillation of the number of receptors in each synapse.Because of the concentration gradient, the equilibrium point of the dynamics of endosome transportation of a single synapse is when the concentration of receptor in the synapse is same as the concentration in the dendrite.It is also the equilibrium point of lateral diffusion.Note that because effects of receptor synthesis and degradation on receptor concentration are slower than receptor trafficking, they are assumed to have a negligible contribution to the dynamics.The proportion of receptors in endosomes is also ignored.Hence, in our model the total number of receptors in a dendritic tree is constant.There are two factors in addition to receptor trafficking that could affect the concentration of receptors in each synapse: the size of the synapse and the number of receptors per unit area the synapse can accommodate.The size of the synapse is affected by the activity of actin.The number of receptors per unit area a synapse can accommodate is affected by scaffoldcytoskeleton complex.The two factors are not distinguished in the model but are jointly represented as the 'capacity' of the region to hold receptors.Thus, the equilibrium point of receptor motion can be altered by altering the capacity.The mechanism of learning in our model is to alter the capacity according to the following rule: whenever a neuromodulator signalling reinforcement is present, the instantaneous number of receptors in a synapse determines a change in its effective capacity, establishing a new equilibrium point nearer to that instantaneous value.

Figure 14 :
Figure 14: The bias of oscillation at different centre of oscillation.The curves are instantaneous synaptic strength, which oscillate around centres of synaptic strength oscillation (shown as straight lines).
In practice we need to slightly modify this rule to compensate for a biased drift in synaptic strength.If, during an oscillation period, the integrated values of the differences between instantaneous synaptic strength and the centre of oscillation on each side is not equal (as shown in Fig 14, the sizes of adjacent yellow and blue coloured areas), uncorrelated modulator release (e.g. the release experienced by a synapse that is not making any useful contribution to satisfying the value function) can cause the centre of oscillation to become biased during long training times.During learning, if the centre of oscillation changes in a small