Learning continuous-time working memory tasks with on-policy neural reinforcement learning

An animals’ ability to learn how to make decisions based on sensory evidence is often well described by Reinforcement Learning (RL) frameworks. These frameworks, however, typically apply to event-based representations and lack the explicit and ﬁne-grained notion of time needed to study psychophysically relevant measures like reaction times and psychometric curves. Here, we develop and use a biologically plausible continuous-time RL scheme of CT-AuGMEnT (Continuous-Time Attention-Gated MEmory Tagging) to study these behavioural quantities. We show how CT-AuGMEnT implements on-policy SARSA learning as a biologically plausible form of reinforcement learning with working memory units using ‘attentional’ feedback. We show that the CT-AuGMEnT model efﬁciently learns tasks in continuous time and can learn to accumulate relevant evidence through time. This allows the model to link task dif-ﬁculty to psychophysical measurements such as accuracy and reaction-times. We further show how the implementation of a separate accessory network for feedback allows the model to learn continuously, also in case of signiﬁcant transmission delays between the network’s feedforward and feedback layers and even when the accessory network is randomly initialized. Our results demonstrate that CT-AuGMEnT represents a fully time-continuous biologically plausible


Introduction
The environment we live in presents a stream of information, where new events have to be recognised, some elements have to be maintained in memory, and behavior has to be adapted to optimally respond to the perceived state of the environment. Reinforcement Learning (RL) is the theoretical framework for learning from interaction with the environment, and it is deeply linked with neuroscience and psychology [1,2]. RL has been used to explain how an agent can solve complex problems, learning from very sparse and often delayed signals -rewards or punishments [3]. In many tasks, the behaviorally relevant state of the environment includes past events, like past roadsigns determining a current roadturn -learning what to remember is then crucial to constructing a compact state representation on which to act. While RL com-bined with modern deep learning has demonstrated impressive results in various game settings [4,5], from a biological perspective the question is how the brain accomplish such working memory tasks.
A line of recent work has proposed a central role for attentionally gated feedback in biologically plausible deep reinforcement learning [6][7][8][9][10]. In particular, [6,7] propose AuGMEnT (Attention-Gated MEmory Tagging) as a biologically plausible neural network RL framework that implements SARSA. In AuGMEnT, a feedforward pass through the neural network computes q-values for the various available actions from sensory inputs and an action is selected as a function of the q-values. Then, attentional feedback from the action selection stage is used for spatial credit assignment: feedback signals highlight only those weights that are responsible for the selection of the winning action. These connections are subsequently modified according to a globally released neuromodulator implementing a biologically plausible form of error-backpropagation. AuGMEnT includes working memory units to store relevant sen-sory information -similar to Long Short-Term Memory (LSTM) [11,12].
Still, in most such models of RL, the representation of time is abstracted into discrete events sampled in the ordered presentation: the agent is given only behaviourally meaningful new observations to elicit an update in the agent's state and the selection of a new action. Events are thus externally defined and effectively provide the agent the information on when a decision has to be made. Alternatively, for example in problems involving sequences of video frames like in video games, complicated frame-selection schemes are used to limit the number of actions selected [4], as standard RL methods scale poorly with the fine-grained timescale effected by higher framerates. To map closely to typical psychophysically meaningful measures like psychometrical curves and reaction times, we need to take serious the issue of continuous-time in biologically plausible reinforcement learning.
Work on powerful yet plausible models of continuous-time reinforcement learning is sparse: Bellec et al. [13] implement approximate backpropagation through time (BPTT) to implement Proximal policy optimization algorithms in a reinforcement learning with spiking neurons for tasks which require only limited memory, and Zambrano et al. [14] developed a first continuoustime version of AuGMEnT. This continuous-time RL framework, CT-AuGMEnT, solves working memory RL problems in machine learning in continuous-time through a continuous-time version of SARSA reinforcement learning coupled to an action-selection system modeled after the basal ganglia model developed by [15].
Here we expose and expand the CT-AuGMEnT framework to include effective exploration strategies, and we show how this CT-AuGMEnT decision-making framework can learn timecontinuous versions of classical cognitive tasks from the literature much more efficiently as compared to a fine-grained time-stepped version of AuGMEnT. The framework thus allows us to study an important open question in neuroscience: what is the role of time in reinforcement learning? Specifically, we here study how we can use networks trained with CT-AuGMEnT to obtain reaction times and psychometrical curves for classical RL tasks studied in neuroscience.
Evidence integration in continuous-time has been studied in monkeys during decision-making tasks, such as motiondiscrimination tasks where the optimal integration of sensory information is critical for an accurate response. We show that CT-AuGMEnT allows networks to learn to accumulate the relevant evidence by tuning the memory units to the appropriate perceptual inputs, with a performance that is comparable to that of the animals. Indeed, networks trained with CT-AuGMEnT predict how the task difficulty affects both performance and reactiontime: these aspects can only be modelled when continuous-time is considered.
We also study a related issue in continuous-time learning: the phased nature of learning in neural networks. In standard neural networks, a feedforward 'inference' phase in the network to compute outputs (or q-values) is alternated with a feedback phase where credit is assigned synapses for learning [16]. We show how non-alternating continuous learning can be implemented via a separate accessory network that carries the feedback signal [17,18]. The use of a separate accessory network to carry feedback also allows us to study network with both symmetrical feedforward and feedback weights, and asymmetrical weights, where the former is problematic from a biological perspective [19]. We find that an asymmetric accessory network incurs little cost in terms of network convergence.
Since, in a biological setting, such an accessory network will cause feedforward and feedback activity to be out of phase due to inevitable transmission delays between network layers, we investigate to what degree our CT-AuGMEnT networks can cope with delays in the propagation of information between layers: we find that the trained networks perform well, even for significant delays in an accessory network.
The paper is organised as follows: in Section 2, we summarize CT-AuGMEnT and introduce all the relevant components and the learning rule. In Section 3, we demonstrate the CT-AuGMEnT framework by illustrating continuous-time implementations of a number of standard working memory tasks from the neuroscience literature that expose different aspects of complexity in task learning, and show how we can model reaction-time experiments within the CT-AuGMEnT framework. In Section 4, we examine the impact of delays between layers when feedback is carried by a separate and randomly initialized accessory network. Finally, in Section 5, we discuss our findings and their context.

Related work
RL algorithms are typically derived as a solution of the Bellman equation [3] and aim to find policies for agents that optimize the obtained sum of (discounted) future rewards in an environment where the agent can select actions in a succession of specific state transitions. Reinforcement algorithms exist in on-policy and offpolicy flavors, where on-policy algorithms like SARSA use only the experienced state-action transitions to update their policy, in contrast to off-policy algorithms like Q-Learning [3]. Both SARSA and Q-Learning algorithms are value-based RL algorithms as they aim to estimate the value of a state-action pair as the expected sum of future rewards, so-called action-values. On-policy RL algorithms like SARSA result in more conservative policies, and monkey studies have provided evidence that their behaviour is only compatible with on-policy algorithms [20][21][22]; experimental work by [23] suggests that working memory comprises an intrinsic and crucial part of RL in humans.
For event-based and discrete-time optimization problems, reinforcement learning has been used to successfully train deep [4,5] and recurrent neural networks [12,8]. For working memory tasks, [12] demonstrated that LSTMs can be trained with the RL Advantage Learning algorithm.
In a biological context, Todd et al. [24] used a tabular actorcritic representation, where working memory is explicitly represented as a second actor -a gating actor -which augments the current observation with past observation. The gating actor can choose to maintain or replace its element memory with the current observation. Lloyd extends the gating model by comparing two learning algorithms, Actor-Critic and SARSA, with learning patterns in rats [25]: the authors suggest that only SARSA provides faster learning as seen in animals. In this model, the motor actor is only used in the final stage of the task. Song et al. [26] proposed a working memory neural network model for decision making trained with an actor-critic off-policy algorithm called REINFORCE [27,28]. Their model shows comparable results as presented in this paper on similar tasks, but the learning algorithm is off-policy and still formulated in discrete-time. In [29] the authors proposed a biologically-plausible continuous-time approximation to gradient-based supervised learning to train a recurrent neural network. For the locality constraint, the algorithm builds on the feedback alignment theory proposed in [30].
The AuGMEnT framework [6,7] implements the SARSA RL algorithm in a neural network with working memory using a biologically plausible local learning rule.
For continuous-time (or very fine time-steps), RL algorithms can be obtained by solving the continuous-time equivalent of the Bellmann equation, the Hamilton-Jacobi-Bellman (HJB) equation [31][32][33]. When learning action-values, Baird demonstrated that RL using the off-policy Q-learning algorithm [34] and the onpolicy SARSA RL algorithm are theoretically infeasible in continuous-time: when the time resolution increases, the effect of a single infinitesimal action on the total reinforcement becomes undetectable [35]. Advantage Learning has been proposed as a continuous-time formalisation of Q-learning [36,35]; Advantage Learning however is still an off-policy method that computes updates using the best available action rather than the actually taken action, and is therefore insensitive to large and negative rewards (potentially fatal) during exploration [37,3, Chapter 6.5]. Continuous-time actor-critic architectures have also been proposed, where the control of actions is computed separately from an estimate of the value of the current state [33], and Bellec et al. [13] developed a spiking neural network version to implement proximal policy optimization [38]. [39] proposed a neural network model with working memory units but without hidden layers, that was trained with continuous-time TD learning. However, a continuous-time solution for SARSA has not been developed yet, and given the evidence for on-policy RL in the brain, this is an important hiatus.
In continuous-time formulations of RL, the process of decision making has to be addressed, as potentially noisy perceptual input needs to be integrated across time to make optimal decisions. In a decision-making process, the sensory-motor mapping is thought to involve cortical and subcortical structures that contribute to sensory processing, decision making and actions selection. Evidence suggests that the basal ganglia contribute to the action selection process [40,41]. [42] demonstrate how an architecture composed of an evidence accumulator implemented in the cortex together with an action selection system modelled by the basal ganglia model of [15] can optimally solve the Multiple Sequential Probability Ratio Test (MSPRT), a multi-hypothesis version of the Sequential Probability Ratio Test [43] often used to explain the brain's decision-making process. However, the decision making model from [42] does not include learning what perceptual evidence should be integrated; [44] proposed an actor-critic architecture for learning to make decisions, but this model lacks working memory and is not defined in continuous-time. Rao in [45] studies a combination of Bayesian inference, Partially Observable Markov Decision Processes (POMDP) and TD learning and shows how this approach can also solve MSPRT problems. CT-AuGMEnT differences from this work in the sense that it formulates TD learning in continuous-time and studies its implications in the tasks; and it adds an explicit representation of the action-value functions typical of the on-policy SARSA learning framework.CT-AuGMEnT thus serves as a model for studying how decision-making is learned in the brain based on reinforcement learning and integration of sensory evidence within working memory.

Continuous-time action-value functions
The CT-AuGMEnT algorithm [14] is a continuous-time formalisation of the on-policy SARSA neural reinforcement learning framework described in [6,7]. The working memory units in CT-AuGMEnT employ a similar linear memory principle as the Constant-Error-Carousel in LSTMs [11,12] but substitute gating mechanisms with rectified derivative inputs [6,7] for lower learning complexity; CT-AuGMEnT is also formulated strictly in terms of RL. The model framework for CT-AuGMEnT [14] is described below for discrete time-steps of size dt: by decreasing this timestep, the model approximates continuous-time.
We consider a POMDP as a continuous-time dynamical system, f ðtÞ, with a discrete state set, S, and a discrete action set, A. For every time-step t, the system is in a state s 2 S, and an action a 2 A is selected. The system receives a reward r as a function of the current state and the selected action: The goal is to find a state-dependent policy for selecting actions, lðtÞ, This condition holds for any policy including the optimal policy Q Ã ðs; aÞ (see Eq. (4)), and can be used to compute the so-called Temporal Difference (TD) error [3] as: By combining the backwards Euler approximation as _ Q ðtÞ ¼ ðQ ðtÞ À Q ðt À dtÞÞ=dt and Eq. (7), we can derive the following discrete TD update: where the reward rðtÞ has been rescaled as rðtÞ=dt. If dt ¼ 1, and if the discount factor is c ¼ 1 À dt s , we obtain the standard formulation of the discrete time TD error. Note that the previous equation is exact when Q l is differentiable over t. Therefore for abrupt changes in the state or action an error may occur. These abrupt changes, however, do not exist in real systems. For example, in brains the perception system acts as a filter for the unexpected events in the environment.

Continuous-time AuGMEnT
In CT-AuGMEnT, we use as a function approximator an artificial neural network (ANN) composed of three layers of units connected by modifiable synapses (see Fig. 1), plus an action layer -the Zlayer -which specifies the currently selected action. The ANN is an abstracted representation of neural computation in the brain: in the input layer, the sensory neurons represent the stimuli, in the association layer the stimuli are processed further, and in the Q-layer action-values are computed. The stimuli are represented in the input layer by instantaneous (''instant" in Fig. 1) and transient (on/off) units, mimicking the behaviour of cells found in the early stages of visual cortex [46]. Instantaneous units, x, are active as long as the stimuli is present, while transient units x þ =x À represent positive and negative changes in sensory input: x þ ðtÞ ¼ ½_ xðtÞ þ ¼ 1 dt ½xðtÞ À xðt À dtÞ þ ; x À ðtÞ ¼ ½_ xðtÞ À ¼ 1 dt ½xðt À dtÞ À xðtÞ þ ; where ½: þ is a threshold operation that returns 0 for negative values; as before, we assume backward Euler approximation of the time derivative of _ x þ ðtÞ and _ x À ðtÞ for small dt. The instantaneous units i from the input layers are fully connected to regular (R) units j in the association layer through con- where x AE ðtÞ ¼ ½x þ ðtÞ; x À ðtÞ is shorthand for the on/off inputs to memory units, and r is the sigmoidal activation function rðinpÞ ¼ 1 1þexpðhÀinpÞ , where h is a threshold parameter set to 2:5 and with derivative: @rðtÞ @inpðtÞ The memory units are modeled as perfect integrators: their persistent activity mimics the behaviour of cells found for example in frontal cortex or in area LIP area of the parietal cortex [47][48][49]. The second part of Eq. (10) is derived from the temporal gradient The Q-layer receives input from the association layer by the connections w R jk and w M mk . Every neuron in this layer, q k , computes the action-value Q ðs; kÞ of the action k in the current state s as:

Action selection
In CT-AuGMEnT, actions associated with the estimated actionvalues computed in the Q-layer continuously compete for control over behaviour. An action selection mechanism typically resolves this competition by selecting the action with the highest actionvalue by default and by occasionally randomly sampling from lower-valued actions. Such exploration/exploitation strategies allow the agent to find novel, more rewarding paths in the state space [3]. CT-AuGMEnT uses the Max-Boltzmann strategy [ Fig. 1. CT-AuGMEnT architecture with the action selection system (feedforward dis-inhibition) in the output layer. Higher activity is depicted in dark grey: the cell's activity is proportionally responsible for the action selection and their connections are tagged correspondingly for later updating. Synaptic Tag (blue pentagons) and synaptic Traces *purple circles) are also shown. Inset: during exploration an extra input is added from the explorative Q-neuron, and consequently to all the units of the motor layer. b Summary of Equations for the AuGMEnT versus CT-AuGMEnT models.
D. Zambrano, P.R. Roelfsema and S. Bohte Neurocomputing xxx (xxxx) xxx and with probability a random exploration action is selected by sampling from the Boltzmann distribution: This action selection rule however cannot be directly applied to a continuous-time setting. For action-values, [35] already demonstrated that the reduction of the time-step duration negatively affects the convergence rate. Intuitively, a shorter time-step corresponds to more state-action transitions, thus to a smaller effect of that action on the final reward. Moreover, function approximators, such as neural networks introduce their own imprecision in the action-value computation, exacerbating the problem. It also seems intuitively incorrect that the duration of an action, and thus its effect on the environment, depends on the dt size of the algorithm's update.
Starting from the observation that actions in the real-world have their own time requirements, CT-AuGMEnT uses an action selection system that dynamically solves the competition among the actions in the form of a simplified model of basal ganglia operation [15]. The basal ganglia inhibits all the non selected actions, and disinhibits the selected action -the action with the highest action-value. In CT-AuGMEnT, this is achieved by connecting the Z-layer to the Q-layer with off-centre and on-surround connectivity: each neuron in the Q-layer transmits a disinhibitory signal to the corresponding neuron in the Z layer through the connection w À z (the negative sign in Eq. (15) represents the inhibitory contribution), and it transmits a positively valued signal to inhibit all the other neurons in the Z-layer through the weights w þ z (see the connections between the Q and Z layers in Fig. 1).
The activity of the action selection system (Z) is computed as: where u Q i ðtÞ ¼ Àw À z q i ðtÞ þ w þ z X n j-i q j ðtÞ: ð15Þ Here, the balance between disinhibition and inhibition has been where n is the number of actions (i.e. the optimal solution; [15]): in the case of equal q-values for all actions, the sum of the positive inhibitory input exactly balances the negative disinhibitory input. The activity of the Z-layer (action-selection), a Z , is modelled as a leaky integrator where the constant q ¼ expðÀ dt sq Þ depends on the time constant s q of that action. This equation can be viewed as an exponential filter: as s q approaches 0, there is no filtering -the output equals the new input. In this case, the output action follows any variation of the Q-values. As the time constant becomes large, transient inputs are ignored. In principle, different actions can have different time constants; however here we endow all actions with an identical s q . Winner-take-all behaviour in the Z-layer is guaranteed if the action with minimal activation (maximal disinhibited) is selected at every dt. In our model, this action is selected for a continuous period before it can be interrupted by the next action: Exploration takes place with probability dt, and the action is selected from the Boltzmann distribution of the actions' expected values, according to Eq. (13). The exploration mechanism overrides the selection mechanism by adding an external input; this strategy is compatible with the evidence observed in humans [51], where prefrontal regions associated with high-level control are active during a behavioural switch from an exploitation to an exploration strategy.
In the model, an external input I ex is added to the explorative action in case of exploration -in Eq. (15), see insert in Fig. 1 -as: The magnitude of I ex guarantees that the selected action -with Q-value q ex -overcomes the inhibition provided by the highest action-value, which is achieved by setting I ex to the sum of the difference between the highest and the lowest Q-values and the difference between the highest valued action and the selected exploratory action. The signal I ex is added for a fixed amount of time T ex , set to T ex ¼ 3s q : given the step response of the firstorder linear system of Eq. (14), the summed contribution corresponds to the time needed to reach 95% of the maximum activation. The exploration mechanism in our model takes into account the time constant of the selected action: a longer action time constant implies longer exploration; here we used identical time constants for all actions, for simplicity.

Feedback connections gate learning
In CT-AuGMEnT [6,7], two factors modulate the network plasticity during learning: the TD-error computed by Eq. (14), -which in the brain would be signalled by a global neuromodulatory signal (such as the global release in the brain of dopamine or acetylcholine) -and an attentional feedback signal that is propagated from the response selection stage back to earlier processing levels and gates the plasticity: both signals contribute to learning. Since the action-value function is estimated by a function approximator, a convenient way to reduce the TD-error is to use gradient descent on the squared prediction error as [52,53,7]: s Q l ðsðtÞ; aðtÞ; wÞ þ @Q l ðsðtÞ; aðtÞ; wÞ @t where w is the vector of the ANN's parameters and delta is the TDerror also known as reward prediction error. The feedback is provided by the unit that encodes the selected action a, and it makes the synapses responsible for the current selection eligible for plasticity by creating synaptic tags (blue pentagons in Fig. 1). As defined in [6], synaptic tags are equivalent to eligibility traces, which, as in [33], have the form of: where 0 < / < s is the time constant of the tag decay. Thus, by discretising Eq. (19), the tag updates for the synapses of regular units (R) and the memory units (M) with the Q-layer are defined as: with z k ¼ 1 for the selected action (k ¼ a), z k ¼ 0 elsewhere (ka) as defined by Eq. (16) (Appendix B gives the full derivation of these updates). Hence, the selected action a provides feedback and it thereby enables the plasticity of connections to winning output unit a. Note that to be fully local, the winning action activity has to be visible to the connections in the Q-layer. This can be achieved through an accessory feedback network as described in Section 4 combined with a fast winner-take-all circuit [54]. In the discrete-time AuGMEnT the tag decay was defined as a ¼ ð1 À kcÞ; here, to be consistent, we define: k ¼ ð1 À dt / Þ=ð1 À dt s Þ. As a result from the tag update equation, we observe that the association units that provided stronger input to the winning action a, also receive stronger feedback -they will be held responsible for the outcome of the action and increase their strength if dðtÞ is positive but decrease their strength if dðtÞ is negative. Equivalently, tags on connections between regular units in the association layer and instantaneous units in the input layer depend on the activity in the input layer units themselves and on the feedback activity from the selected action to the regular unit in the association layer: where x i is the presynaptic activity in the input layer, depends on the postsynaptic activity in the regular association layer unit, and w 0R aj is feedback from the winning output unit to unit j; all three signals are locally available at the synapse (see also the full derivation in Appendix B). In this formulation, the feedback connections w 0R aj and feedforward connections w R ja have the same strength, though as pointed out in [18] this is not a necessary requirement and can emerge during the learning process.
As in AuGMEnT [7], we use synaptic traces (purple circles in Fig. 1) between sensory units and memory cells for working memory learning: The traces build up over time if the pre-synapse is active. Traces can be transformed in tags by feedback from the response selection stage, just as is the case for the tags on the connections between the input layer and the regular units in the association layer.
The plasticity of all synapses (either R or M units) follows: which shows that only when tags are formed, the synapses become susceptible to the TD-error dðtÞ as encoded by the neuromodulator.
Since the weight update uses the current estimate of the dðtÞ error and the value of the tags, the traces and tags have to be updated after updating of the weights. As is common in discrete-time RL, memory units, tags and traces are reset to zero at the end of every trial, and the transition to the terminal state generates a d error computed with an expected reward set to zero. In the brain, tags and traces would reset either passively, through temporally spaced trials, or actively, for example using internal reset actions [55]. For CT-AuGMEnT, an equivalent method is to update the network for an entire unit-of-time (T end ¼ 1 dt dt), with the d error computed with Q ðsðtÞ; aðtÞÞ ¼ 0 and where T end denotes the final event or timestep in the respective task. In Appendix B we show how these learning rules minimise the expected squared-prediction error.
Summarising the adaptation of AuGMEnT to continuous-time, the equations that map the forward neural computation in AuG-MEnT to CT-AuGMEnT are shown in Fig. 2.

Solving continuous-time working memory RL tasks
First, we compare the CT-AuGMEnT model to the event-based version of the AuGMEnT method to demonstrate the limits of time-stepped representations for neuroscience modelling and the need for effective decision-making circuits and related exploration mechanisms. For this, we study three classical tasks from the neuroscience literature: a Saccade/anti-Saccade task, a Delayed Match to Category and a two-and four-choice Motion Discrimination Task. The Saccade/anti-Saccade (SaS) task as presented in [7] models a classical problem [48] which requires learning a non-linear XORlike mapping. The Delayed Match to Category (DMC) task introduced in [56] demonstrates continuous-time evidence collection and decision making, while the Two-and Four-choice motion dis- Neurocomputing xxx (xxxx) xxx crimination task (MDT) allows us to study the link between continuous-time learning and psychophysical measurements like reaction times (RT) and performance. In Appendix C, two other tasks are described: the Motion-or-Colour (MoC) task from [57] combines continuous-time evidence integration with non-linear mapping; and the T-Maze (TM) task from the machine learning literature, where an agent has to reach the end of a corridor of length N and then make a decision; for the latter task, we compare CT-AuGMEnT with a continuous-time version of LSTM. The meta-parameters for all the simulated tasks are set to b ¼ 0:15; k ¼ 0:20; c ¼ 0:90; ¼ 0:025 and h ¼ 2:5. To compare with the same parameters used in [7], we use s and / computed for dt ¼ 1 and then k and c are scaled accordingly with respect to dt. The initial synaptic weights are drawn from a uniform distribution of U½À0:25; 0:25. At the end of the learning phase, we test the network by evaluating the accuracy of the responses for every condition with b and ¼ 0 (so that learning and exploration is switched off). The accuracy is reported in Table 2 for the Saccade-anti-Saccade task. For the Delayed Match to Category and Two-and Four-choice tasks, due to the presence of noise in the input, we report the number of networks that achieve the convergence criterion. Table 1 summarises the full implementation details; all results in this section are reported for networks with symmetric feedback weights as in the original AuGMEnT implementation. As in AuG-MEnT, the network architectures use relatively few neurons, which proved sufficient to learn the tasks.
The Saccade/anti-Saccade (SaS) task, [6,7,14] is used to study working memory in monkeys [48] and requires the maintenance of a presented cue in working memory and a cue-dependent action to be taken at the go signal. This task is an example of a task where a non-linear mapping between the state and action space has to be computed so that the network needs to have at least one hidden layer. The continuous-time implementation demonstrates how the learning process depends on the timed phases of the task, such as the time duration of the cue or the delay phase.
As illustrated in Fig. 3a, the SaS task starts with an empty screen, then the Fixation phase begins where the respective mark is shown: a small reward is given, r fix , if the agent succeeds in maintaining fixation for 2 s. Then, a cue appears on either the left or the right of the fixation mark -the Cue phase (red circle in Fig. 3a). The cue is presented for 1 s, then a Delay phase of 1s follows when only the fixation mark is presented. Any interruption of fixation during this phase (e.g. an eye movement towards the red cue) terminates the task without reward. The disappearance of the fixation mark signals the Go phase, where an eye movement is requested.
In the SaS task, the type of fixation mark determines the strategy to adopt: a cross mark requires a pro-saccade decision, while a triangle mark requires an anti-saccade. In the pro-saccade condition, the agent has to move its eyes toward the remembered location of the cue, while in the anti-saccade it has to move its eyes in the opposite direction. Only the correct choice is rewarded with r fin .
The AuGMEnT and CT-AuGMEnT networks are comprised of 4 input units, two that signal the presence of the fixation marks (cross or triangle, signalling the pro-saccade or anti-saccade condition, respectively) and the others signal the two possible cues (left or right). The networks contain 4 regular and 4 memory units in the association layer, and 3 output neurons, corresponding to the 3 actions they can take: fixate in the centre of the display, move eyes left or move eyes right.
The results for the SaS task are plotted in Fig. 3b,c: we see much better scaling behavior for CT-AuGMEnT as compared to AuGMEnT (Fig. 3b), as AuGMEnT quickly fails to converge for smaller dt whereas CT-AuGMEnT successfully learns the task for every dt. and for both action time-constants s q , with similar learning curves for the two action time-constants (Fig. 3c) . CT-AuGMEnT learns the task with a moderate increase of trials for increasing time resolution (Table A3.3), which is likely due to the structure of the task that induces specific moments when exploration is most effective, and these moments become less likely to be selected for explo-  rative actions as time-resolution increases. For example, an explorative action taken at the beginning of the GO phase is much more effective than the same choice made earlier or later in time, which will necessarily results in a break of fixation aborting the trial without reward. In event-based representations, the explorative or exploitative actions are chosen in specific and crucial taskrelated moments.
The Delayed-Match-to-Category (DMC) task was originally introduced in [56]. This task is a decision-making task where the sensory evidence has to be collected and memorised to be compared to subsequent sensory inputs. Here the decision is made as soon as enough evidence from the second motion direction is collected. The agent has to map 12 motion directions of a cloud of moving dots onto two categories (Fig. 4a red or blue arrows in the inset). In [7], every motion direction was modelled as a different input signal for one time-step; here, the same task is modeled in continuous-time. The agent has 12 inputs units each tuned to one of 12 different motion directions -from 0 to 330 spaced by 30 -with a Gaussian tuning function (l ¼ i Á 30 , where i is between 0 and 11; r ¼ 30 ), with a receptive field including all dots, thus each input unit receives input from all the moving dots. At every dt; 100 dots are generated, representing one of the motion directions to be categorised, and Gaussian noise (l ¼ 0; r ¼ 15 ) is added to the dots' motion. This process is shown in the inset of Fig. 4a. Here, the amount of noise has been modelled for the task with dt ¼ 0:03, which is similar to the update frequency of motion-dots tasks for monkeys. Note that we consider the motion as a property of the dot that is perceived within the receptive field    4. a DMC task. The agent has to discriminate whether the second motion direction belongs to the same or opposite category by making an eye movement to the right or left respectively. In the inset: at every dt a new set of dots is presented, each representing the current motion direction plus noise. The motion direction belongs to one of the two categories (blue or red arrows) and is chosen from 12 possible directions (from 0 to 330 , with step size of 30 ), with a category boundary separating the two categories.
b Comparing the event-based version of AuGMEnT (blue line, event-based) and CT-AuGMEnT (red line sq ¼ 0:5 and yellow line sq ¼ 0:1). We plot the number of trials needed to reach convergence for the task; the abscissa denotes the effective size of dt used for the simulations. c Comparing the learning curves for the two action time constants (red line sq ¼ 0:5 and yellow line sq ¼ 0:1) at dt ¼ 0:1.

D. Zambrano, P.R. Roelfsema and S. Bohte
Neurocomputing xxx (xxxx) xxx of the input neurons. For that, the dot motion value is affected by a measurement error, which is the type of noise we modelled.
To speed up learning, and as in [18], we introduce a pre-training session that teaches the agent to group each motion direction in one of the two categories: the agent has to fixate a fixation mark in the centre of the display for 1s -receiving a shape reward of r fix ¼ 0:2 -the dots are then presented for 650 ms while the agent has to maintain fixation. Next, the stimulus is removed and the agent has to select one of the two categories -moving its eyes to the left or right. Learning ends when at least 80% of the answers were correct in the last 50 trials for each category. After the pretraining phase, the task begins: the agent has to fixate the fixation mark for 1s -worth r fix -after which a first phase of dots are presented for 650 ms, with the direction of the dots chosen from one of the 12 directions. If the agent selects an action other than Fixate, the trial is aborted without reward. After a delay phase of 1s, another motion phase follows, with the dots moving in a new direction: the agent has to choose whether the two motion directions belong to the same category by selecting the action Right, or Left when the motions do not belong to the same direction (see time line of events in Fig. 4a).
As shown in Fig. 4b,c, CT-AuGMEnT successfully learns the DMC task within considerably fewer trials than AuGMEnT (Fig. 4b), completely and AuGMEnT completely fails to learn the task for smaller dt (Table A3.3) whereas CT-AuGMEnT requires only a small increase in the number of trials needed to reach convergence. Note also the limited dependence of the learning curves on the action time-constant (Fig. 4c).
We illustrated the integration of evidence by the working memory units of fully trained CT-AuGMEnT networks for the DMC task in more detail in Fig. 5, and compare this to neurophysiological data. We plot the activity of a number of example neurons that have been recorded in area LIP of the parietal cortex of monkeys (top row) and a trained model with dt ¼ 0:03 and s q ¼ ½0:5 (bot-tom row). Motion directions near the category boundary between the two motion categories are more difficult to discriminate in the presence of noise. As a result, the accuracy of the model is lower near the category boundary ( Fig. 5a for original data from [56], b for CT-AuGMEnT). Fig. 5c and d show the activity of example memory cells for all the 12 motion directions in the different phases of the task: just like hand-wired models of decision-making with working memory [58], the memory cells learn to maintain sensory evidence about the category of the sample during the delay period. Similar to the neuronal data recorded in LIP of parietal cortex (Fig. 5c), the activity of memory cells in the model is specific for the category of the motion (Fig. 5d, blue or red lines). We also find that the response of individual memory cells ramps according to the amount of evidence encountered in different conditions, just as LIP neurons do. For motion directions close to the category boundary (dashed lines) the category selectivity is less pronounced. When increasing the number of memory units, the category specialization still holds (see Figs. A5.2 and A5.3) In contrast to the LIP data, in CT-AuGMEnT the evidence for one category is integrated through time but does not exhibit the initial transient response visible in LIP; this transient response in LIP data may be related to neural adaptation processes, which are not modelled in CT-AuGMEnT. Fig. 5f plots the tuning of two memory cells to motion direction: it can be seen that memory cells have become selective for the category of the motion stimulus, just like the neurons in area LIP (Fig. 5e).
In the two-and four-choice Motion Discrimination Tasks (MDT), the accuracy and speed of choices of monkeys are measured to evaluate how multiple alternatives affect the decision process [59]. This decision-making task allows us to study how different degrees of motion coherence affect both accuracy and reaction times of the agent. The sequence of the task events is shown in Fig. 6a. The agent has to fixate on the fixation mark for 1s -worth r fix ¼ 0:2 -then 2 or 4 targets are shown for another second (blue diamonds in Fig. 6a). Next, 100 moving dots are presented, similar to the DMC task (inset in Fig. 6a). The coherence of the dots' motion is varied across trials: a fraction of ½0%; 3%; 6%; 9%; 26%; 51%; 72%; 76% of the dots moves accordingly in the target direction, while the other dots move to random directions (chosen uniformly between 0 and 360 ). At every dt ¼ 0:03, new dots are presented. The agent has to respond as quickly as possible by making an eye   6. a MDT task. The agent has to discriminate the dots' motion direction as soon as possible by making an eye movement to on one of the two or four targets shown (blue diamonds, the green diamond is the one selected by the model). In the inset: at each dt a new set of dots is presented. A fraction C moves coherently, toward the target, the others move randomly. b The learning curves for the two action time constants (red line sq ¼ 0:5 and yellow line sq ¼ 0:1) at dt ¼ 0:1. Fig. 7. Comparison between monkey data [59] (left) and model results (right) for the two-and four-choice MDT task. Similar to the monkey data, accuracy increases with increased coherence. In both two and four choice conditions, the model and the monkey learn to choose an action directed to a target, even in the absence of coherent motion rather than to maintain fixation (top row). In the bottom row, we show that the model approximately matches the reaction times observed in the monkey data. More evidence results in faster reaction times and more choices result in s.lower responses.

D. Zambrano, P.R. Roelfsema and S. Bohte
Neurocomputing xxx (xxxx) xxx movement in the direction cued by the dots' motion. Reaction times are measured starting from the dots' presentation until the model selects one of the saccade targets in the Z-layer. In the four-choice condition the motion directions are 90 apart, while in the two-choice condition they are 180 apart. As in the DMC task that was described in the above, we model the tuning-curve of units in the input layer with a Gaussian function with mean centred on one of the 12 motion directions and r ¼ 30 . For the full task, training is interrupted as soon as the model reaches an accuracy of 90% over the last 50 trials, measured across all conditions with at least 51% of motion coherence. As shown in Fig. 6b, CT-AuGMenT learns the task, for both action time constants at approximately the same rate of improvement. 1 CT-AuGMEnT also achieves good convergence (Table A3.3), though substantially better and faster for the faster action timeconstant.
The MDT task allows us to study the ability of the CT-AuGMEnT model to commit toward a choice when evidence is provided continuously through time. Monkeys exhibit an increase of the reaction times (RTs) as a consequence of a larger number of alternatives [59]. In CT-AuGMEnT, the discount factor c encourages the model to make decisions as quickly as possible. We compare the measured reaction times in the trained models to the original data from the MDT choice task. Fig. 7 illustrates the similarities between the monkey data (left) and the trained model (right). When more choices are possible, the task becomes more difficult: the number of correct trials decreases and the reaction times increase. When the motion coherence is 0%, the probability of a correct choice is 0:5 for the two-choice condition (blue line) and 0:25 for the four-choice condition; the agent responds always correctly for high motion coherence (> 75%). Importantly, CT-AuGMEnT correctly predicts an increase in reaction time when the number of choices increases. For the MDT task, we also find that action-values are closer when the four-choice condition starts demonstrating higher uncertainty in the action selection, and thus longer reaction times.
As was done for the DMC task, a qualitative comparisons can be made for the MDT between our model predictions and electrophysiological data (Fig. 8). Again, notice that the recorded neural activity only matches our model in the ordering of activity magnitude as we do not explicitly model neuron dynamics.
Overall, for the two working memory tasks (SaS, DMC), we find that the reduction of the time-step size (dt) affects AuGMEnT both in terms of number of trials to reach the convergence criterion and in the percentage of networks that correctly learn the task (Table A3.3, first column). In particular, for dt ¼ 0:1 none of the AuGMEnT networks reached convergence for any of the tasks. As shorter action time-constants CT-AuGMEnT behave more like AuG-MEnT, we find in that this indeed results in somewhat lower convergence rates (Table A3.3, second and third columns); for the SaS tasks the action time-constant also affects the number of trials to reach convergence in that decreasing the action time-constant in CT-AuGMEnT increases the number of trials needed to reach convergence. This is mainly due to the effect of the action timeconstant on the duration of the exploration: longer timeconstants imply longer explorations, thus longer explorative actions, which have a higher impact on the task (more credit is assigned). For the DMC task however, the agent reaches the convergence criterion faster with a shorter action time-constant. The reason is likely the kind of task, as the DMC is effectively a decision-making task under uncertain information and a faster action time-constant corresponds to a lower threshold in the decision-making process (since a faster action time-constant induces a more rapid switch of actions and thus a smaller amount of evidence is needed to make a decision); this effect can be seen also in the MDT decision-making task.
Comparing the speed of learning in CT-AuGMEnT to that of animals, we note that monkeys typically require tens of thousands of trials (about 1,200 trials per day for weeks to months to learn a task with a complexity that is similar to the DMC [60]. Learning in networks trained with CT-AuGMEnT is therefore substantially faster than learning in experimental animals. Finally, learning in  [59] during three motion strengths. Top row is for two choices and bottom row for four choices. e-h Averaged memory neurons activity during a 10 k random test-set MDT for similar motion coherences. The same memory unit was considered. Time is shown referenced at the motion onset. CT-AuGMEnT shows several similarities with electrophysiological data: 1) Memory units increase their activities during the dots presentation e,g; 2) the order for different motion strengths is kept e,g; the four-choice condition has lower activity g. One-to-one matching it is not possible, however as neural dynamics are no.t modelled here. animals seems compatible with the RL framework. Specifically, the activity of dopaminergic neurons in specific regions of the midbrain has been correlated with the hypothesis of RPE signal [1]. However, during working memory tasks as modelled in this paper -and more importantly during learning -the dopaminergic neurons activity has not been recorded to the best of our knowledge. CT-AuGMEnT can predict, accordingly with other TD learning based models, how this activity should look like. The RPE is generally high at the beginning of the training when an unexpected reward is provided (Fig. 9). During learning, the RPE shifts toward the first clue that is correlated with the upcoming reward. This behavior conforms with data recorded by [1].

An accessory feedback network
While CT-AuGMEnT can successfully learn tasks in continuous time, the algorithm requires a feedforward phase followed by a feedback phase after action selection for credit assignment, which is then combined with the reward prediction error to determine synaptic changes. The weight updates result in a biologically plausible implementation of a learning rule that approximates errorbackpropagation in standard ANN's, provided that there is enough time for the feedforward and feedback interactions.
As communication between neurons, and between layers of neurons, is not instantaneous, we investigate the possible influence of delays in the feedforward and feedback pathways. [18] already suggested that feedback can be carried by a separate accessory network, similar to the feedback networks proposed by [17] for assigning credit to synapses in lower layers. Additionally, learning should still converge even when the weights of the accessory network are different from that of the feedforward network [18]. We, here, wished to investigate the impact of transmission delays, which create a (partial) temporal mismatch between the forward and backward phases.
We implement an accessory network as shown in Fig. 10a. The accessory network is composed of two layers of neuronsdescribed with the superscript S -connected by randomly initialized weights different from those in the feedforward network (the orange feedback network in Fig. 10a). The accessory network takes as input the executed action determined by the Z-layer, and it carries the feedback signal needed for the weights update of the feedforward network (red arrows in Fig. 10). The neuron with the weakest inhibition is the only unit that provides the feedback signal that gates plasticity: where d aK is the Kronecker-delta function and K is the selected action. The association layer activity in the accessory network can be the computed as: where we distinguish between the units that carry feedback to the regular association units j and those that carry feedback to the memory units m. For the accessory network, we adopt a linear transfer function for the neural activation. Although using a linear activation function might be justified for small signals, this is not a strict requirement for CT-AuGMEnT to learn, other saturating functions can be used as well [61]. This network can be used to compute the weights updates; however, due to the transmission delay D, the feedback signal can potentially be associated with an earlier stimulus than the one currently processed by the feedforward network (see Fig. 10b). Both forward and backward weights are modified through learning. We use the same update of the equivalent feedforward neurons as shown in [18], thus Eqs. (20) and (24) where the index J denotes either regular or memory units in the association layer.
Effect of transmission delays. We examined how this continuous feedback architecture with an accessory network can learn the previously introduced tasks in the presence of transmission delays.
The same meta-parameters and network architectures are used as before, but now the weights in the feedforward and feedback networks are randomly and differently initialized. We introduce delays between the layers of the feedforward and feedback networks: from the sensory layer to the association layer, from the association layer to the Q-layer of the feedforward network, and between the layers of the accessory network (see Fig. 10, D). The simulation step-size was set to dt ¼ 0:01 (corresponding to 10 ms per time step) for all the tasks. We evaluate the architecture with various delays D ¼ ½0; 3; 9; 25; 50 time-steps with 10 ms per step: the maximum duration of a ''round-trip" of the activity experienced by the network is then 1.5 s. Note that there is no delay between the Q-layer and the action selection model: the action selection has its own dynamics, which we fix for all the simulations (s q ¼ 0:5), except for the MDT task where we used s q ¼ 0:1. We report results based on runs with 100 randomly initialized networks. For the TM and SaS tasks, the same convergence criterion is used as for the standard implementations; every network is given 2:5 Â 10 4 trials to learn the task; for the DMC and MDT tasks we allowed networks a maximum of 5 Â 10 4 trials to reach the convergence criterium (see Table 3).
We find that CT-AuGMEnT is still able to learn all tasks when an accessory network is used to carry feedback activity of the output layer without delays or when small delays are introduced in between the network's layers: Fig. 11 plots the percentage of converging networks for the three tested tasks. A reduction in convergence rate is evident after about 100 ms of total delay. This is likely due to the mismatch between the forward processing and the feed back information provided by the accessory network during learning. The development feedback activity during learning is detailed in Appendix E.

Discussion
We presented CT-AuGMEnT, a biologically plausible framework for continuous-time neural SARSA reinforcement learning with working memory, formulated in the limit of small time-steps in a discrete-time model. For the weight update, we exploited the same principles as in the original time-step formulation of AuGMEnT: a combination of a neuromodulatory signal coding for the temporal difference and an 'attentional' feedback signal which tags those synapses that contributed to action selection [7] -for a detailed review of the neurobiological plausibility and the tagging mechanism and its relation to the ''synaptic tagging and capture" theory [62,63] see also [8]. In the final layer, CT-AuGMEnT includes an action-selection system that implements a winner-take-all mechanisms in continuous-time, based on [15]. This action-selection system is a simplified neural architecture modelled after the basal ganglia. Several studies have suggested a role of the basal ganglia in action-selection and in reward-based learning [64][65][66]15]. Here, the dynamics of the action-selection system, expressed by the action-time constant, help stabilise the action execution by avoiding rapid switches between actions. The time-scale of the action dynamics should depend on the environment of the agent, like  Neurocomputing xxx (xxxx) xxx the speed of muscle recruitment, and decouples these dynamics from the time-step in the network. The action-selection system is endowed with a built-in exploration mechanism that is linked to the action dynamics. Exploration overrides the presently selected action by providing an additional input to the explorative action, related to what has been reported in humans [51]; the exploration duration depends on the action time-constant, allowing more exploration for longerduration actions. This in turn helps the learning algorithm by ensuring that, during the weight updates, the correct amount of credit is assigned to the responsible weights in the neural network.
CT-AuGMEnT correctly learns the various cognitive tasks we presented when the time resolution of the simulation increases. We compared CT-AuGMEnT to AuGMEnT on three workingmemory tasks that probe various aspects of task difficulty and decision making. Our results demonstrate that CT-AuGMEnT reaches the convergence criterion for every time-step duration we tested, while, whereas AuGMEnT usually did not reach convergence: CT-AuGMEnT needed a constant or moderately increasing number of trials to reach convergence when the time resolution increased.
We further demonstrated the ability of CT-AuGMEnT to train networks to learn to make a decision. The CT-AuGMEnT architecture can be seen as a simplified version of the principal structures involved in the decision-making process in the brain -cortex and basal ganglia. Since we are using the basal ganglia model of [15], this results in a (slightly) sub-optimal decision making model compared to the model for optimal multihypothesis sequential probability ratio test (MSPRT) derived in [42]; we found empirically in the tasks we studied that using this MSPRT model led to slightly but insignificantly worse performance. We speculate this may be related to the model of action-dynamics we use, or, alternatively, the type of tasks we study are just not very sensitive to the difference.
Importantly, CT-AuGMEnT is a plausible explanation of how working memory units learn what to accumulate during reinforcement learning, which is necessary to learn decision-making tasks. Moreover, different from the approach in [65], CT-AuGMEnT learns in continuous-time, which is a necessary feature when modelling time-dependent variables such as RTs, and enables CT-AuGMEnT to model RT patterns in decision-making problems with multiple alternatives. The dynamics of the action-selection system also allows for the prediction of RTs without need to set an arbitrary threshold as in the standard race models typically used to study this behaviour [67,65,42].
Our results show a good match between the accuracy and reaction times of monkeys and the network's performance. In this context, the working memory units learn to act as integrators for the perceptual evidence. Importantly, we did not modify the structure of the network, which was able to learn what to accumulate using only the reward signal that was only given at the end of the trial, highlighting the efficiency of the credit assignment process.
We studied the same tasks when a separate accessory network was used to carry the feedback signals [17,30]. We demonstrated that CT-AuGMEnT correctly learns all tasks when the feedback accessory network is used, even when the forward and feedback networks were randomly initialized to different values. Using a fixed random network, as in [30] results in a drop of performance for hard tasks [68]. Our formulation requires the same update in corresponding forward and backward connections, which can be implemented with biological networks [68,10]. We introduced transmission delays between the layers of the two networks to understand the limits of feedback during continuous-time RL problems in biological settings, where neural transmission and response dynamics introduce such delays. Our results show that CT-AuGMEnT is still able to learn the tasks when small delays are introduced -from 90 to 270 ms total delay, compatible with biological delays [69]; for larger delays, the mismatch between the feedforward and the feedback signals affects the network's performance.
The present study focused on the learning process and we did not specifically model the neuronal interactions that are responsible for maintaining a scalar value in working memory, which has been addressed in previous work [70]. One of these models by [58] designed a mechanism that allows the same network to store the memory for a sensory stimulus during a delay and to commit to a decision at a later point in time. The present study goes beyond these previous findings by demonstrating that CT-AuGMEnT can discover a similar mechanism during trial-and-error learning. Still, our model does not fully capture the dynamics of the working memory cells, as done, for example, in the attractor dynamic model proposed by [58]. In their work, working memory and the decision are represented by one single variable, while here, they are represented by two different layers of neurons. This implies that, with respect to the biological cells, our working memory units do not reproduce the commitment to the decision, although the decision indirectly affected their value through the learning rule. It would be interesting for further studies to combine the attractor dynamics model with the CT-AuGMEnT learning rule to fully explain this behaviour. Finally, in the present study, working memory units are not explicitly controlled and they have to be Table 3 Summary of convergence results when we varied the transmission delay, measured as the number of trials required for learning the task. Delay between layers is measured in units of the timesteps of dt ¼ 0:01 (10 ms).  reset at the end of each episode. Other works have addressed this problem [24,25] and a similar mechanism can be implemented in CT-AuGMENT as suggested in [55]. Our model only implicitly deals with perceptual uncertainty, unlike explicit approaches like Bayesian networks [45]. An extension of our model toward Bayesian inference could possibly be implemented using dropout sampling [71,72] or related sampling methods [73]. The model as presented is to the best of our knowledge the first example of a plausible end-to-end neural network model for learning decision making tasks using SARSA. Such a formulation makes it possible to consider implementations based on spiking neurons and compare such spike-based models directly to measurements of the time-course of neural activity.
CRediT authorship contribution statement

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
where s is the state at time t, and a is the selected action. Here we are interested in the time derivative of the Q-values, thus, since both s and a depend on time, we make a variable change as xðtÞ ¼ ½sðtÞ; aðtÞ. According to the principle of optimality, the integral in Eq. (3), can be divided in two parts ½t; t þ dt and ½t þ dt; 1 as: For a small dt, the first term is approximated as: while, by expanding through Taylor and applying the chain rule, the second term is: By substituting them into (29) and collecting Q Ã on the left hand side, we have an optimality condition for ½t; t þ dt as: By dividing both sides by dt and taking dt to zero, and by reapplying the variable change, we have the condition for the optimal action-value function as: We can show that the continuous-time formulation of AuG-MEnT outlined above reduces the reward-prediction error (RPE) as the original AuGMEnT formulation. The objective function is defined as: Given (18) and (7), the gradient of the objective function with respect to the weights w R ja becomes: where bis the learning rate.
Since the boundary condition for the Q-function, defined in (3), is given at t ! 1, it is more appropriate to update the past estimates without affecting the future estimates as in [33]. Thus, recalling (8) and discretizing (35), a reduction of the gradient is guaranteed if: Note that in the latter equations the update of the synapses has to be consistent with the neuron activity at the previous dt, which is stored in the Tags (see (20) and (21)). Thus, Tags have to be updated after the weights. Gradient decent for the weights v R ij is similarly computed: where we assume for simplicity that the strength of the feedback from the motor layer back to the association layer w R aj is equal to w R ja and, analogously, w M am ¼ w M ma .

Appendix C. Additional tasks
The Motion-or-Colour task (MoC) has been used to train monkeys to study decision-making under two different contexts [57]. Here, the agent has to attend one of two features of the same randomdot stimulus, either the colour or the motion direction, based on context indicated by the shape of the fixation mark. This task combines the continuous collection of evidence with various degree of motion coherence (as for the 2&4 choice task, with a non linear mapping between inputs and outputs, as in the SaS task. The task phases are shown in Fig. A3.1. The fixation mark -indicating the context -is shown for 300 ms, then the two targets are presented for 350 ms. While the agent has to maintain fixation on the fixation mark (for which it receives r fix ¼ 0:2), the stimulus is presented for 750 ms. The dots have a particular colour coherence ½À50; À16; À6; 6; 16; 50 and motion coherence ½À50; À16; À6; 6; 16; 50; where a colour of À50 denotes a clearly distinguishable red and 50 a clearly distinguishable green and values closer to zero have less coherence. Similarly, À50 and 50 denote a strong motion signal to the left and right, respectively. After the stimulus presentation, a delay phase of 1s follows. The disappearance of the fixation mark indicates the ''Go" signal, requiring the agent to make one of two responses. In this task, the network uses 10 input units: 2 fixation marks, 2 indicating the colour stimulus, 2 for the motion stimulus, and 4 targets. The network also contains 5 regular units, 5 memory units and 3 output units. We pre-trained the model in the colour and motion tasks separately, with full coherence ½À50; 50. The pre-training stopped when the agent performed correctly on 90% of the last 50 trials for each condition. The full task is stopped when the agent reaches 85% of accuracy on the 50 trials in the conditions with a coherence of ½À50; À16; 16; 50 (see Table A3.1).
The results are obtained for the motion-or-colour task also reproduced many aspects of the monkeys' data (see Fig. A3.2). After learning, the agent correctly discriminates between the two relative features (colour or motion) depending on the current context. Fig. A3.2 shows that the non-relevant features do not affect the performance, whereas the coherence of the relevant feature determines the agent's accuracy.
The T-Maze (TM) task is a working memory RL task adapted from [12,55], where an agent has to reach a goal position at the end of a corridor and the location of the goal depends on the location of the road-sign observed at the start of the task (Fig. A3.3a). Hence, the agent has to learn to keep information in memory for multiple time-steps and where the difficulty of the task can be adapted by changing the corridor length.
The agent's position is defined by a two-dimensional coordinate system ½x; y. The agent starts from position ½0; 0, and it can select Motion-or-colour task. The agent has to discriminate the colour or the motion of the presented dots, depending on the context cue -a cross for motion and a hexagon for colour -by making an eye movement to one of the two targets (blue or red diamond). In the inset: the dots have a motion direction and a colour. AgentPositionþ ¼ds Á ½UpÀ Down; Right À Left, where the position is increased by a stepsize, ds, proportional to dt as ds ¼ dt (e.g. with dt ¼ 0:1 it takes 10 steps to move 1 cell). To ensure a consistent comparison between the event-based version of AuGMEnT and CT-AuGMEnT, the task is adapted to be identical for both algorithms, where the length of the corridor was fixed at N ¼ 10. Walls are hit when the x position is P 1 or 6 À1. The agent has 3 sensory inputs where 1 represents a wall and 0 an empty space: in the corridor it thus sees ½1; 0; 1. For the first second, the agent observes ½2; 0; 1 or ½1; 0; 2, where a 2 denotes the road sign. A attempted move through the wall returns a negative reward of r w : to avoid excessive collection of negative rewards when dt decreases, movement into the wall returns one r w per second, i.e. this punishment is proportional to the time spent moving into the wall. At the T-junction, the agent is rewarded with r g ¼ 4 if it moves in the same direction  The agent's x position determines its choice: as soon as it crosses þ1 or À1 its decision is evaluated. We imposed a time-restriction condition proportional to the task difficulty N, which was 1:5N þ 2 time-steps; if the network did not reach the correct corridor within this time, the trial was aborted and no reward was obtained (see Table A3.2 for details).
For the T-Maze task, the number of trials needed to converge for both AuGMEnT and the CT-AuGMEnT is plotted in Fig. A3.3b. The number of trials needed to reach convergence rapidly increases for event-based AuGMEnT with increasing time resolution, and AuGMEnT quickly fails to converge at all (see Table A3.3). This illustrates the problem with learning action-values noted by [35] already: as the time resolution increases, the effect of a single infinitesimal action on the total reinforcement becomes undetectable. However, CT-AuGMEnT successfully learns the tasks for every dt and for two different values of the action time-constant s q . When reducing dt, for CT-AuGMEnT the number of trials needed to reach convergence remains constant and convergence remains near 100%. Fig. A3.3c plots the learning curves for the T-Maze task for different action time-constants: the learning curves are highly similar, with a slight advantage for the longer action time-constant. Effectively, these results show that CT-AuGMEnT is independent of the size of dt.
We also trained CT-AuGMEnT with a strict softmax action selection policy [51]: we observe that then CT-AuGMEnT has lower convergence rate and needs more trials, on average, to reach the convergence criteria (see also Appendix D). Since the network needs to learn proper action-selection from randomly initialized weights, a more conservative exploration strategy like Max-Boltzmann seems beneficial. To compare our results to a nonbiologically-inspired algorithm, we trained an LSTM-based network that uses with Advantage Learning [12] in the continuoustime RL setting. We find that such an CT-LSTM does not converge Feedback Activity

Correct Responses
Left response Right response Validation experiments on TM. Here we give the results for additional experiments on the T-Maze task to investigate the effect of our action selection policy and to compare CT-AuGMEnT to continuous-time LSTM. To examine the effect of the action selection policy, we replaced the Max-Bolzmann selection rule by a softmax rule. We trained 100 networks with Eq. (13) changed to: with the temperature parameter, Temp set to 5e10 À 3. We found, however, that this softmax rule prevents the algorithm from converging. Indeed, at the beginning of the task, the network needs to learn the input representation and a large amount of exploration -due to very similar q-values -is counterproductive. Therefore, for the first 150 trials we used the Max-Boltzmann action selection policy and then switched to softmax. All other settings remained unchanged. The results are shown in Table A3.4, left column. The softmax CT-AuGMEnT exhibits a lower convergence rate and requires more trials, on average, than CT-AuGMEnT to reach the convergence criteria. The second set of experiments uses LSTM with Advantage Learning as reported in [12]. Since Advantage Learning can approximate continuous time, these experiments become an effective validation for our algorithm. Following [12], we used 12 standard units and 3 LSTM units, the learning rate a ¼ 0:0002; c ¼ 0:98; k ¼ 0:8; j ¼ 0:1 and we trained 10 networks for 500k trials. Moreover, we scaled the Eligibility Traces with a similar approach used for our Tags (see Eq. (19)) (not scaling the Traces yielded similar negative results). As shown in Table A3.4, despite the large number of trials, CT-LSTM does not converge for small dt, likely due to the lack of extended action-duration in the learning algorithm.

Appendix D
To visualize how the accessory network develops during training, we show the activity of the neurons in the association layer (see Eq. (26)), for the Saccade-Anti-Saccade Task, similar to [74]. Fig. A4.1 shows the sum of the feedback activity for the units y S m ðtÞ that carry the feedback from the selected action in the Qlayer to the memory units. This activity is collected at the end of every trial, and shown for one network. We see that the feedback  activity increases during training and then stabilizes. The graph demonstrates that learning changes the attentional feedback during training. Similarly, Fig. A4.2 shows that the difference between incorrect responses and the correct choice also grows during training, as is to be expected since the average feedforward response for correct choices also increases relative to incorrect choices. The difference is computed between the actual feedback (incorrect) and what should have been the correct one (left or right) in case of pro-saccade (top) and anti-saccade (bottom) conditions. As expected, the number of incorrect responses decreases during training. Note that an incorrect response could also be a fixation or even the correct one selected in an incorrect time: too early, breaking the fixation or too late, when the time-out condition is applied. In this case, the difference is approximately zero. Overall these Figures demonstrate that the network learns the attentional feedback to use during the task. (see Fig. A5.1)

Appendix E. Extended analysis on the DMC and MDT tasks
This appendix adds some more insights on the DMC and MDT tasks.