Learning precise spatiotemporal sequences via biophysically realistic circuits with modular structure

The ability to express and learn temporal sequences is an essential part of learning and memory. Learned temporal sequences are expressed in multiple brain regions and as such there may be common design in the circuits that mediate it. This work proposes a substrate for such representations, via a biophysically realistic network model that can robustly learn and recall discrete sequences of variable order and duration. The model consists of a network of spiking leaky-integrate-and-fire model neurons placed in a modular architecture designed to resemble cortical microcolumns. Learning is performed via a learning rule with “eligibility traces”, which hold a history of synaptic activity before being converted into changes in synaptic strength upon neuromodulator activation. Before training, the network responds to incoming stimuli, and contains no memory of any particular sequence. After training, presentation of only the first element in that sequence is sufficient for the network to recall an entire learned representation of the sequence. An extended version of the model also demonstrates the ability to successfully learn and recall non-Markovian sequences. This model provides a possible framework for biologically realistic sequence learning and memory, and is in agreement with recent experimental results, which have shown sequence dependent plasticity in sensory cortex.

Introduction structure, there is nothing encoding start times, stop times, or durations of the individual elements of the sequence. In general, such chain structures can learn and encode the order of sequence elements, but not other temporal properties of the sequence. Some models have attempted to address these issues via ad-hoc solutions such as variable adaptation time constants 15 , but these typically require the network to have a priori information about the sequence it will be representing.
Unlike explicit feed forward chain models, RNNs learn sequences by leveraging rich, dynamical representations to approximate target outputs. However, common learning rules for these models, such as backpropagation through time (BPTT) or FORCE are biologically unrealistic, as they require some combination of non-local information, precise and symmetrical feedback structures, and/or explicit feedback about the targets at every time point 21 . Interestingly, recent work has shown that networks can under some conditions learn inputs via random feedback connections rather than backpropagation 22 , but this random feedback is less effective than BPTT for learning sequences with long timescales 23 .
Our model takes elements from both chain and recurrent models in order to establish a new, hybrid framework for sequence learning which avoids the pitfalls of either method when used in isolation. The model's modular structure enables it to learn both the order and duration of sequence elements, and to do this with biophysically plausible learning rules.

Learning
In order to learn a sequence, a network needs to learn both the order of elements as well as the duration of each element. In a traditional "chain-like" network with Hebbian learning, order of presented stimuli can be readily encoded by learning directional feed forward connections between populations. However, this simple architecture proves insufficient for representing the duration of presented stimuli and their specific start and end times, as intrinsic time constants determine the speed at which the signal travels through the network 14,15 .
This issue is illustrated in Figure 1. In this schematic model, each module responds to a specific external stimulus. An external sequence activates these different modules in a given order and with externally determined durations for each element (Fig. 1a). Learning can change the feedforward synaptic efficacies between modules to reflect the order of the presented stimuli. Upon recall, triggered by activating the Figure 1. Sequence representation in networks. a) A network composed of different populations of cells, each population is activated by a specific stimulus, and there are plastic connections between and within these populations. Initially these connections are random and weak. Upon presentation of a sequence of stimuli (filled circles, left), the populations will become activated for the duration and in the order in which they are stimulated (right). b) After many presentations of a particular sequence, successful asymmetric Hebbian learning encodes the order of the stimuli into the synaptic weights of the network. After training, upon presentation of the first element of the sequence (filled circle, left), the network can recall (right) the order of presentation, but the timing and duration of each element is lost. In a generic network such as this, the timing of recall is determined by intrinsic time constants of the system and not the duration in the sequence that was presented.   first stimulus, each of the of the encoded stimuli are activated at the correct order. However, the duration of activation within the module, as well as the timing of the activation of the subsequent module, are determined by the intrinsic time constants of the network (Fig. 1b), causing a temporal compression of the recalled sequence. While certain time constants, asymmetrical inhibition, or adaptation with slow time constants could be manually placed in the network to facilitate recall of the particular sequence in Figure 1a, there does not yet exist a biophysically realistic, robust, and general solution to this problem.
In this work we present a network of spiking neurons that can learn both the order and duration of sequence elements. This model uses biophysically realistic learning rules, in contrast to most RNN models [17][18][19][20] . As in the "chain-like" models, the order is learned by modifying the feed-forward synaptic efficacies between modules. Unlike those models, the duration of each element is learned via modification of the recurrent connections within each module [11][12][13] . However, these two components are still not sufficient in order to avoid sequence compression during recall. In order to solve this problem, we assume additional structure within each module and in the allowed connections between modules. This additional structure allows us to avoid compression during recall while using relatively simple biophysically realistic learning rules. The cellular response types generated by this network are consistent with experimental observation 6,7 , as described below.
Our network is composed of different modules that are selectively activated via feed-forward connections by different external stimuli. Within each module there are two populations of excitatory cells, as well as inhibitory cells (Fig. 2a). The excitatory cells in both populations are identical in their intrinsic properties but differ in their learned and fixed connections with other cells within the module and in different modules. We name these two excitatory populations "Timers" and "Messengers" -the reason for these terms will become clear below as we describe their roles within the network. The "Timer" cells have strong recurrent connections with other Timer cells within the module, and they also excite the inhibitory cells (Inh) and the "Messenger" cells. module decay quicker than their Timer counterparts, thanks to shorter time   constants for synaptic activation (80ms for excitatory, 10ms for inhibitory), and small Timer to Inhibitory weights (there are, in general, a number of degenerate sets of parameters which can facilitate quickly decaying Inhibitory cells). The Messenger cells receive excitatory input from Timer cells and inhibitory input from inhibitory cells within the module, and they also learn to project feed-forward connections to Timer cells in other modules.

Inhibitory cells in the
In this paper we use a previously described reinforcement learning rule based on two competing, Hebbian-activated eligibility traces 24,25 : one for long-term potentiation (LTP) and one for long term depression (LTD). In addition, we have assumed that on every transition between stimuli, a 'neuromodulator' is released globally, and this serves as a reinforcement signal (see Methods). We have used this rule because it can solve the temporal credit assignment problem, allowing the network to associate events distal in time, and because it reaches fixed points in both recurrent and feedforward learning tasks 24 . The assumption of a temporally precise but spatially wide spread neuromodulatory signal might seem at odds with common notions of temporally broad neuromodulator release, but they are indeed consistent with recent recordings in several neruomodulatory systems 26,27 .
Using this rule and the described architecture within a single module, we find ( Figure 2a) that the network naturally generates cells that learn to be active for the duration of the stimulus (Timers), and other cells that activate towards the end of the stimulus duration (Messengers). The Timer cells learn the duration of the stimulus via changes in the excitatory synaptic efficacies within the Timer population. The Messenger cells evolve their specific temporal profile due to the temporal profiles of their fixed excitatory and inhibitory inputs from the Timer cells and the inhibitory cells.
Since the activity of the inhibitory cells decays at a quicker rate than the Timer cells, the Messenger cells arise at the end of the decay phase, when excitation from the Timer cells dominates. These results are consistent with experimental observations in V1 circuits that learn the duration between a stimulus and a reward [6][7][8] , as shown in Figure 2b. These results replicate our previous results 28 which used a learning rule with a single trace 29 .
This modular microcircuit, in which both Timer and Messenger cells emerge from learning, acts as a "Core Neural Architecture" (CNA) 28 , an elemental package of basic temporal neural responses which can then be used within the larger network to create more complicated representations. This CNA functions as the basic microcircuit for our sequence model, and each "element" of an input sequence is represented by one of these CNA circuits. One must note that the individual CNAs are plastic, as their temporal properties are not fixed but adapt to the environment. In a structure akin to the microcircuits found in cortical columns 9 , this model places distinct CNAs sensitive to particular visual stimuli in a columnar structure, as shown in Figure 3a.
When presenting a sequence of visual stimuli, different CNAs in turn become activated in sequence. Timer cells within those CNAs learn the duration of their particular stimuli via recurrent connections, while order is learned via feed forward connections from Messengers in one column to Timers in a subsequently presented column. We will show that this modular architecture can overcome the problem of sequence compression encountered by "chain-like" models.

Learning and Recalling the Order and Variable Duration of Presented Sequences
With the described architecture and learning rule (see Methods), the network is capable of robustly learning sequences of temporal intervals. Figure 3 shows the network learning a sequence of four different elements, of duration 500ms, 1000ms, 700ms, and 1800ms. The different stimuli here are labeled by color (blue, green, red, orange), and there are four corresponding columns in the network (out of ten) which are sensitive to these stimuli. In this example, the inputs are modeled after LGN responses to spot stimuli 30,31 (see Supplementary   Figure 1), but more generally one may consider these stimuli to be oriented gratings, specific natural images in a movie, or non-visual stimuli such as pitches of sounds in a song.
Before learning, presentation of the blue stimulus only produces a transient response in the CNA microcircuit housed in the blue-sensitive column (Fig. 3b). During learning (Fig. 3c,d), a sequence of stimuli is presented, and microcircuits in their respective columns learn to represent the duration of the stimulus which activates them. The different microcircuits also learn to "chain" together in the order in which they were stimulated. After learning, presentation of the blue stimulus triggers a 'recall' of the entire sequence ( Fig. 3e): blue for 500ms, red for 1000ms, green for 700ms, and orange for 1800ms. The network is capable of learning sequences of temporal intervals where the individual elements can be anywhere from ~300ms to ~1800ms (see (e) During learning, columns are stimulated in the sequence indicated by the colorbars (500, 1000, 700, and 1800 ms for blue, red, green, and orange, respectively). b) Before learning, the stimulation of a particular column only causes that column to be transiently active. c) During the first trial of learning, all columns in the sequence become activated by the stimuli, but have not yet learned to represent duration (through recurrent learning of Timer cells) or order (through feed forward learning of the Messenger cells). d) After many trials, the network learns to match the duration and order of presented stimuli. e) After learning, presenting the first element in the sequence is sufficient for recall of the entire sequence. is appropriately temporally delayed. While the results shown in this work were obtained using a two-trace learning rule (described below), the network can also be trained with a learning rule based on a single trace 29,32  For the network to learn and recall non-Markovian sequences, it must somehow keep track of its

Reservoir
Sparse Pattern Net history, and use this history to inform transitions during sequence learning and recall. To this end, we include two additional "stages" to the network ( Figure 5). The first is a fixed (non-learning) recurrent network, sometimes called a "reservoir" (as in reservoir computing) or a "liquid" (as in a liquid state machine) 34,35 , which receives inputs from the Messenger cells in the main columnar network. Owing to these inputs and due to its strong recurrent connectivity, this network possesses a long-term memory of the state of the columnar network. The second additional stage is a high dimensional, sparse, non-linear network which receives input from the reservoir, serving to project the reservoir states into a space where they are highly separated and non-overlapping.
The result is that a given "pattern" in this sparse network at time t uniquely identifies the history of the main network up to and including time t. Since these patterns are highly non-overlapping,  occurs if the dynamic history of the network is the same as it was during training), the Timer cells which are thereby stimulated by that pattern will be the same ones which were externally stimulated following that pattern during training.
We test the ability of this three-stage network to learn long, non-Markovian sequences with repeated elements by presenting a sequence blue-green-blue-red-blue-orange (BGBRBO) during training ( Figure 6).   Figure 7a shows recall upon stimulation of blue, while Figure 7b shows recall upon stimulation of green. In both cases, the sequence makes the correct transition from red to the appropriate third element. Recall of both sequences is also robust to perturbations in the initial state of the reservoir, as shown in Supplementary Figure 8.
Of note in these two simulations is that the "incorrect" transitions do get slightly activated, as the sparse patterns responsible for these transitions are not completely non-overlapping. There is an interplay here between the robustness and uniqueness of these patterns -the sparser they are, the more non-overlapping they are, but also the more sensitive they are to noise. Other parameters in the model, such as the level of recurrence in the reservoir, or the projection strengths from the reservoir to the sparse net, also effect the robustness/uniqueness of the patterns. A full quantification of the robustness, capacity, basin size etc. of the patterns is beyond the scope of this work. For the simulations shown, a set of parameters (Supplementary Table 2) was empirically chosen so that a handful of non-Markovian transitions could be reliably encoded.
The addition of the reservoir and sparse network is essential for the ability to encode and replay non-Markovian sequences, but they alone are not sufficient for sequence learning and recall. One must note that the reservoir receives its input from the modular network, not from the environment, and therefore if the modular network does not learn, the inputs to the reservoir would not be generated appropriately. If instead, we supplied input directly to the reservoir, recall would not be possible because input to the reservoir on learning and recall trials would be very different and therefore the reservoir dynamics would completely differ.
Other groups have approached non-Markovian sequences by including additional hierarchy or long synaptic time constants 36,37 , and solutions like RNNs handle such sequences natively since they are continuous systems with a dependence on a relatively long history when compared to the intrinsic time constants. We combine these two methods, using highly recurrent networks in the context of a larger architecture, and still maintaining biophysical realism by using Hebbian learning rules.

Spiking Dynamics and Two-Trace Learning (TTL) Rule
The network is comprised of microcircuits of both excitatory and inhibitory spiking leaky-integrateand-fire (LIF) neurons placed in a modular architecture akin to that observed in cortical columns.
The following equations describe the membrane dynamics for each model neuron : Here, subscripts , , and refer to leak, excitatory and inhibitory, respectively. refers to the corresponding conductance, and to the corresponding reversal potentials. and are the membrane potential and synaptic activation, respectively, of neuron . σ is a random noise term.
Once the membrane potential reaches a threshold ℎ , the neuron spikes and enters a refractory where τ u is the characteristic time constant, and r i is the resulting firing rate for neuron i. ξ is a piecewise nonlinear transfer function 38 : In place of a pair based spike timing dependent rule or rate based Hebbian rule, which fail to solve the temporal credit assignment problem, the network learns based on "eligibility traces" for both long-term potentiation (LTP) and long-term depression (LTD) 28,39 . These eligibility traces are synapse-specific markers that are activated via a Hebbian coincidence of activity between the pre-and post-synaptic cells. At a maximally allowed activation level, these traces saturate, and in the absence of Hebbian activity, these traces decay. LTP and LTD traces are distinct in that they activate, decay, and saturate at different rates/levels. These dynamics are described in the equations below. where ⟨H ij ⟩ is the mean of Hebbian activity during saturation, in cases of prolonged and steady pre-and post-synaptic activation 24 . Equations 8-9, using parameters as in Supplementary Table   1, create traces with a fast, quickly saturating rising phase in the presence of constant activity, and a long, slow falling phase in the absence of activity. The trace dynamics in the rising and falling phases can be approximated using these effective terms to have the simple exponential forms: T F a (t) = T a e (t−t stim )/τ p ( ) is the trace dynamics in the falling phase, and T R a (t) represents the trace dynamics in the rising phase.
In this rule, synaptic weights are updated upon presentation of a global neuromodulatory signal R(t). In general, this signal can be either a traditional "reward" signal, or a "novelty" signal that accompanies the presentation of a new stimulus. This neuromodulatory signal then converts eligibility traces into synaptic efficacies, consuming them in the process, via a simple subtraction of the LTD trace from the LTP trace: Here, W ij is the synaptic efficacy between pre-and post-synaptic cells j and i , R(t) is the neuromodulatory signal, and η is the learning rate. Following presentation of the neuromodulatory signal, the eligibility traces are "consumed" and reset to zero, and their activation is set into a short refractory period (25ms). Synaptic efficacies reach a steady state under the following To solve for the falling phase fixed point, we first assume that the traces are saturated when the underlying Hebbian activity is above a certain threshold, H th . We call the time when Hebbian activity crosses below that threshold t th . Then, combining equations 12 and 16 (substituting t reward in for t and t th in for t stim ), we can solve for D = t reward − t th at the fixed point: The objective of learning, then, is to move D from its starting value to the value in Eq. 17, which is determined by the parameters in the network. The parameters can be chosen such that the The parameters chosen for the traces are displayed in Supplementary Table 1. For the recurrent traces, the parameters follow the restrictions derived from analysis in an earlier publication 24 . The corresponding analysis for the feed-forward traces, however, is not necessarily applicable in the context of sequence learning, since multiple neuromodulator signals are active during each learning epoch. As a result, the parameters for the feed-forward traces are empirically derived.
Multiple different sets were found to work, but the set in Supplementary Table 1 was the one used for all simulations included in this paper.
The above eligibility trace learning rules are hereafter referred to as "two-trace" learning or TTL.
As described in a previous publication 24 , and demonstrated throughout this work, TTL allows for both feed forward and recurrent learning, and as such can robustly encode temporally dependent input patterns. TTL is supported by recent experiments which have found evidence for eligibility traces existing in cortex 25 , striatum 40 , and hippocampus 41,42 . Other eligibility trace rules, such as the one trace rule demonstrated in earlier work 29,32 , can also replicate the results of this paper. In general, any rule which can associate distal events/rewards, thereby solving the temporal credit assignment problem, would be likely to work with this network model. TTL was chosen for its biological realism, but the novel capabilities of this model (its ability to learn and recall both the duration and order of elements in a sequence) primarily result from the network architecture

Network Architecture
In a totally unstructured network, the above learning rules would be insufficient for the task of sequence learning and memory. In exchange for the freedom of online, biophysically realistic reinforcement learning, we must presuppose some restrictions (still biophysically realistic) on macro structure in the network. The structure imposed on the network has two components: a modular columnar structure with restricted connections, and the weight distribution of the nonplastic synaptic efficacies. The weight distribution results in the emergence of distinct Timer and Messenger cell types, and the columnar structure places populations of these cells into stimulusspecific modules.
The network consists of ten different "columns", each containing 100 excitatory Timer cells, 100 inhibitory Timer cells, 100 excitatory Messenger cells, and 100 inhibitory Messenger cells, all assembled in a CNA microcircuit. Each column is tuned to be selective for a particular stimulus from the input layer, such that a sequence of external stimuli (e.g. ABCD or "red green orange blue") will trigger a corresponding sequence of input-selective columns.
To simulate presentation of visual cues, the network is stimulated by an input layer designed to mimic spiking inputs from the lateral geniculate nucleus. Each stimulus is represented in the input layer by a Poisson spike train with a 50ms pulsed peak, in accordance with the observed cellular responses in LGN 30,31 . In general, this spike train can also include a decaying exponential tail, but short pulses were chosen both for analytical simplicity and to match experimental data 30,31   sequence learning and recall.

Discussion
In this work, we demonstrate the ability of a biophysically realistic network to learn and recall sequences, correctly reporting the duration and order of the individual elements. In combining modular, heterogenous structure with a learning rule based on eligibility traces, the model can accurately learn and recall sequences of up to at least 8 elements, with each element anywhere from ~300ms to ~1800ms in duration. We have also shown a modified architecture that is capable of learning and recalling non-Markovian sequences with multiple history-dependent transitions.
The capabilities, construction, and rules of our model stand in contrast to many contemporary and historical models of sequence learning, such as synfire chains and RNNs. In particular, we are not aware of another model that uses biophysically realistic spiking networks and learning rules, along with experimentally observed cell responses, to robustly learn both the order and duration of presented sequences of stimuli. Previous approaches which have attempted to learn both duration and order of elements 44 , have been highly sensitive to changes in parameters, and were not based on experimentally observed cell responses. Other types of hybrid models combine chain structures with additional hierarchy/functionality 14,15 , but either require manually set adaptation time constants that are specific for a set duration to represent the duration of elements, or do not treat the duration of elements as variable at all. Recently another model has been proposed, based on a periodic and essentially deterministic timing network, which also uses biologically plausible learning rules to learn to represent single non-Markovian sequences 45 .
However, it cannot learn arbitrarily long (and self-terminating) sequences, such as those presented in Figure 6, nor can it learn several different sequences and replay them separately, such as those presented in Figure 7.
The ability of our model to accurately learn and recall sequences is only possible because of the functionally heterogeneous cell types in the CNA microcircuit being given different roles in learning: it is the job of the Timer cells to encode the duration of a particular element, and the job of the Messenger cells (which only fire at the end of the duration), to pass along this information and encode the order in which the elements are presented. Even though this work also provides further validation of Two-Trace Learning, TTL itself is not required, since model also retains its functionality when learning with an eligibility trace rule with only one trace 29 .
The modified three-stage model is capable of learning non-Markovian sequences because of inclusion of a non-learning recurrent network that stores memory over longer durations than the modular network. However, the model's ability to generate learned sequences depends crucially on the backbone of the modular network -because of this, we can use a non-plastic, highly recurrent reservoir and maintain biophysically realistic learning rules in the network. This modified model also differs in functionality from traditional RNNs, as in our model, the stimulus presented during training (the entire sequence) is different from the stimulus presented after learning in order to trigger recall (usually the first element is sufficient). In general, the number of elements necessary to trigger recall is the same as the number of elements needed to disambiguate the sequence (i.e. "A" would be sufficient to trigger ABCD in a network embedded with learned sequences ABCD and EFGH, but "AB" would be required to trigger ABCD in a network embedded with learned sequences ABCD and AFGH). In a way, our network acts as an autoencoder, both learning a reduced encoding (e.g. "A" instead of "ABCD"), and reconstructing the original external input from this reduced encoding.
The reservoir and sparse network components of our three-stage model could be thought to arise from a projection from other cortical or subcortical areas. Functionally similar networks (ones that take complex, multi-modal, and dynamic context and repackage it into sparse, separated patterns) have been observed in the dentate gyrus 46,47 and the cerebellum 48,49 . However, these model components could also be thought of as part of the same cortical network, partially segregated in function but not necessarily by location. Pattern separation is likely a common neural computation that might occur in many brain areas, so we make no particular predictions about the locations of these network components.
Our model suggests that the types of single cell responses in V1 observed in delayed reward tasks (Timers and Messengers) 6-8 will also be present when learning a visual sequence.
Furthermore, we predict that distinct populations of these cells are sensitive to and can learn the Our model makes several testable predictions. It predicts that: 1. After learning a sequence, Messenger cells will more strongly connect to Timer cells that represent subsequent stimuli, than to Timers which represent previous stimuli, or to other Messengers. 2. Sequence learning will increase lateral connections efficacies between Timers within the same module. 3. There will be a population of inhibitory cells within each module that have firing properties similar to Timers but that decay more quickly (Fig 2a).