Stability of working memory in continuous attractor networks under the control of short-term plasticity

Continuous attractor models of working-memory store continuous-valued information in continuous state-spaces, but are sensitive to noise processes that degrade memory retention. Short-term synaptic plasticity of recurrent synapses has previously been shown to affect continuous attractor systems: short-term facilitation can stabilize memory retention, while short-term depression possibly increases continuous attractor volatility. Here, we present a comprehensive description of the combined effect of both short-term facilitation and depression on noise-induced memory degradation in one-dimensional continuous attractor models. Our theoretical description, applicable to rate models as well as spiking networks close to a stationary state, accurately describes the slow dynamics of stored memory positions as a combination of two processes: (i) diffusion due to variability caused by spikes; and (ii) drift due to random connectivity and neuronal heterogeneity. We find that facilitation decreases both diffusion and directed drifts, while short-term depression tends to increase both. Using mutual information, we evaluate the combined impact of short-term facilitation and depression on the ability of networks to retain stable working memory. Finally, our theory predicts the sensitivity of continuous working memory to distractor inputs and provides conditions for stability of memory.


Author summary
The ability to transiently memorize positions in the visual field is crucial for behavior. Models and experiments have shown that such memories can be maintained in networks of cortical neurons with a continuum of possible activity states, that reflects the continuum of positions in the environment. However, the accuracy of positions stored in such networks will degrade over time due to the noisiness of neuronal signaling and imperfections of the biological substrate. Previous work in simplified models has shown that synaptic short-term plasticity could stabilize this degradation by dynamically up-or downregulating the strength of synaptic connections, thereby "pinning down" memorized positions. Here, we present a general theory that accurately predicts the extent of this "pinning with noise-free [38] and, as shown parallel to this study, noisy [45] rate neurons. In simulations of continuous attractors implemented with spiking neurons for a fixed set of parameters, facilitation was reported to cause slow drift [46,47] and a reduced amount of diffusion [47]. However, despite the large number of existing studies, several fundamental questions remain unanswered. What are the quantitative effects of short-term facilitation in more complex neuronal models and across facilitation parameters? How does short-term depression influence the strength of diffusion and drift, and how does it interplay with facilitation? Do phenomena reported in rate networks persist in spiking networks? Finally, can a single theory be used to predict all of the effects observed in simulations?
Here, we present a comprehensive description of the effects of short-term facilitation and depression on noise-induced displacement of one-dimensional continuous attractor models. Extending earlier theories for diffusion [39,40,45] and drift [38], we derive predictions of the amount of diffusion and drift in ring-attractor models of randomly firing neurons with shortterm plasticity, providing, for the first time, a general description of bump displacement in the presence of both short-term facilitation and depression. Our theory is formulated as a rate model with noise, but since the gain-function of the rate model can be chosen to match that of integrate-and-fire models, our theory is also a good approximation for a large class of heterogeneous networks of integrate-and-fire models as long as the network as a whole is close to a stationary state. The theoretical predictions of the noisy rate model are validated against simulations of ring-attractor networks realized with spiking integrate-and-fire neurons. In both theory and simulation, we find that facilitation and depression play antagonistic roles: facilitation tends to decrease both diffusion and drift while depression increases both. We show that these combined effects can still yield reduced diffusion and drift, which increases the retention time of memories. Importantly, since our theory is, to a large degree, independent of the microscopic network configurations, it can be related to experimentally observable quantities. In particular, our theory predicts the sensitivity of networks with short-term plasticity to distractor stimuli.

Results
We investigated, in theory and simulations, the effects of short-term synaptic plasticity (STP) on the dynamics of ring-attractor models consisting of N excitatory neurons with distancedependent and symmetric excitation, and global (uniform) inhibition provided by a population of inhibitory neurons ( Fig 1A). For simplicity, we describe neurons in terms of firing rates, but our theory can be mapped to more complex neurons with spiking dynamics. An excitatory neuron i with 0 � i < N is assigned an angular position y i ¼ 2p N i À p 2 À p; p ½ Þ, where we identify the bounds of the interval to form a ring topology (Fig 1A). The firing rate ϕ i (in units of Hz) for each excitatory neuron i (0 � i < N − 1) is given as a function of the neuronal input: . Excitatory-to-excitatory (E-E) connections (red lines) are distance-dependent, symmetric, and subject to short-term plasticity (facilitation and depression, see Eq (3)). Inhibitory (I) neurons (blue circles) project to all E and I neurons (blue lines) and receive connection from all E neurons (gray lines). Only outgoing connections from shaded neurons are displayed. In simulations with integrate-and-fire neurons, each neuron also receives noisy excitatory spike input generated by independent homogeneous Poisson processes. B1 Example simulation: E neurons fire asynchronously and irregularly at low rates until (dotted line) a subgroup of E neurons is stimulated (external cue), causing them to spike at elevated rates (red dots, input was centered at 0, starting at t = 2s for 1s). During and after (dashed line) the stimulus, a bump state of elevated activity forms and sustains itself after the external cue is turned off. The spatial center of the population activity is estimated from the momentary firing rates (red line, plotted from t = 2.5s onward). Inset: Activity profile in the bump state, centered at 0. where w ij s j (t) describes the total activation of synaptic input from the presynaptic neuron j onto neurons i. The maximal strength w ij of recurrent excitatory-to-excitatory connections is chosen to be local in the angular arrangement of neurons, such that connections are strongest to nearby excitatory neurons (Fig 1A, red lines). The momentary input depends also on the synaptic activation variables s j , to be defined below. Finally, connections to and from inhibitory neurons are assumed to be uniform and global (all-to-all) (Fig 1A, blue lines), thereby providing non-selective inhibitory input J inh to excitatory neurons.
As a model of STP, we assume that excitatory-to-excitatory connections are subject to short-term facilitation and depression, which we implemented using a widely adopted model of short-term synaptic plasticity [49]. The outgoing synaptic activations s j of neuron j are modeled by the following system of ordinary differential equations: The synaptic time scale τ s governs the decay of the synaptic activations. The timescale of recovery τ x is the main parameter of depression. While the recovery from facilitation is controlled by the timescale τ u , the parameter 0 < U � 1 controls the baseline strength of unfacilitated synapses as well as the timescale of their strengthening. For fixed τ u , we consider smaller values of U to lead to a "stronger" effect of facilitation, and take U = 1 as the limit of non-facilitating synapses.
As a reference implementation of this model, we simulated networks of spiking conductance-based leaky-integrate-and-fire (LIF) neurons with (spike-based) short-term plastic synaptic transmission (Fig 1B1, see Spiking network model in Materials and methods for details). For these networks, under the assumption that neurons fire with Poisson statistics and the network is in a stationary state, neuronal firing can be approximated by the input-output relation F of Eq (1) [50,51] (see Firing rate approximation in Materials and methods), which allows us to map the network into the general framework of Eqs (1) and (2). In the stationary state, synaptic depression will lead to a saturation of the synaptic activation variables s j at a constant value as firing rates increase. This nonlinear behavior enables spiking networks to implement bi-stable attractor dynamics with relatively low firing rates [46,52] similar to saturating NMDA synapses [11,47]. Since we found that without depression (for τ x ! 0) the bump state was not stable at low firing rates (in agreement with [52]), we always keep the depression timescale τ x at positive values.
Particular care was taken to ensure that networks display nearly identical bump shapes (similar to Fig 1B1, inset; see also S1 Fig), which required the re-tuning of network parameters (recurrent conductance parameters and the width of distance-dependent connections; see deviation from the mean rate with variance proportional to the rate. The symmetric weighting factors (blue lines show for varying U) are non-zero at the flanks of the firing rate profile. Stronger short-term depression and weaker facilitation increase the magnitude of weighting factors. E Deterministic drift is calculated as a weighted sum (see Eq (7)) of systematic deviations of firing rates from the bump state (frozen noise): a large positive firing rate deviation in the left flank (red line) will cause movement of the center position to the left (red arrow) because the weighting factors ( Optimization of network parameters in Materials and methods) for each combination of the STP parameters above.
Simulations with spiking integrate-and-fire neurons generally show a bi-stability between a non-selective state and a bump state. In the non-selective state, all excitatory neurons emit action potentials asynchronously and irregularly at roughly identical and low firing rates ( Fig  1B1, left of dotted line). The bump state can be evoked by stimulating excitatory neurons localized around a given position by additional external input (Fig 1B1, red dots). After the external cue is turned off, a self-sustained firing rate profile ("bump") emerges (Fig 1B1, right of dashed line, and inset) that persists until the network state is again changed by external input. For example, a short and strong uniform excitatory input to all excitatory neurons causes a transient increase in inhibitory feedback that is strong enough to return the network to the uniform state [11].
During the bump state, fast fluctuations in the firing of single neurons transiently break the perfect symmetry of the firing rate profile and introduce small random displacements along the attractor manifold, which become apparent as a random walk of the center position. If the simulation is repeated for several trials, the bump has the same shape in each trial, but information on the center position is lost in a diffusion-like process. We additionally included varying levels of biologically plausible sources of heterogeneity (frozen noise) in our networks: random connectivity between excitatory neurons (E-E) and heterogeneity of the single neuron properties of the excitatory population [36], realized as a random distribution of leak reversal potentials. Heterogeneities makes the bump drift away from its initial position in a directed manner. For example, the bump position in the randomly connected (p = 0.5) network of Fig  1B1 shows a clear upwards drift towards center positions around 0. Repeated simulations of the same attractor network with bumps initialized at different positions provide a more detailed picture of the combined drift and diffusion dynamics: bump center trajectories systematically are biased towards a few stable fixed points ( Fig 1B2) around which they are distributed for longer simulation times (histogram in Fig 1B2, t = 13.5s). The theory developed in this paper aims at analyzing the above phenomena of drift and diffusion of the bump center.

Theory of diffusion and drift with short-term plasticity
To untangle the observed interplay between diffusion and drift and investigate the effects of short-term plasticity, we derived a theory that reduces the microscopic network dynamics to a simple one-dimensional stochastic differential equation for the bump state. The theory yields analytical expressions for diffusion coefficients and drift fields, that depend on short-term plasticity parameters, the shape of the firing rate profile of the bump, as well as the neuron model chosen to implement the attractor.
First, we assume that the system of Eq (3) together with the network Eqs (1) and (2) has a 1-dimensional manifold of meta-stable states, i.e. the network is a ring-attractor network as described in the introduction. This entails, that the network dynamics permit the existence of a family of solutions that can be described as a self-sustained and symmetric bump of firing rates ϕ 0,i (φ) = F(J 0,i (φ)) with corresponding inputs J 0,i (φ) (for 0 � i < N). Importantly, the center φ of the bump can be located at any arbitrary position is also a solution with input . This solution is illustrated in Fig 1C for a bump centered at φ = 0. Second, we assume that the number N of excitatory neurons is large (N ! 1), such that we can think of the possible positions φ as a continuum. Third, we assume that network heterogeneities are small enough to capture their effect as a linear (first order) perturbation to the stable bump state. Our final assumption is that neuronal firing is noisy, with spike counts distributed as Poisson processes, and that we are able to replace the shot-noise of Poisson spiking by white Gaussian noise with the same mean and autocorrelation, similar to earlier work [39,53]; see Diffusion in Materials and methods, and Discussion. Under these assumptions, we are able to reduce the network dynamics to a one-dimensional Langevin equation, describing the dynamics of the center φ(t) of the firing rate profile (see Analysis of drift and diffusion with STP in Materials and methods): Here, η(t) is white Gaussian noise with zero mean and correlation function hη(t), η(t 0 )i = δ(t − t 0 ). The first term is diffusion characterized by a diffusion strength B, which describes the random displacement of bump center positions due to fluctuations in neuronal firing. For A(φ) = 0 this term causes diffusive displacement of the center φ(t) from its initial position φ(t 0 ), with a mean (over realizations) during an initial phase, increases linearly with time [14,54,55], before saturating due to the circular domain of possible center positions [39]. Our theory shows (see Diffusion in Materials and methods) that the coefficient B can be calculated as a weighted sum over the neuronal firing rates ( Fig 1D) B where dJ 0;i dφ is the change of the input to neuron i under shifts of the center position ( Fig 1C,  orange line), and S is a normalizing constant that tends to increase additionally with the synaptic time constant τ s .
The analytical factors C i express the spatial dependence of the diffusion coefficient on the short-term plasticity parameters through In Brownian motion, the diffusion constant is usually defined as D = B/2. The dependence of the single summands in Eq (5) on short-term plasticity parameters is visualized in Fig 1D, where we see that: a) due to the squared spatial derivative dJ 0;i dφ of the bump shape and the squared factors C i /S, the important contributions to the sum arise primarily from the flanks of the bump; b) for a fixed bump shape, summands increase with stronger short-term depression (larger τ x ) and decrease with stronger short-term facilitation (smaller U, larger τ u ).
The second term in Eq (4) is the drift field A(φ), which describes deterministic drifts due to the inclusion of heterogeneities. For heterogeneity caused by variations in neuronal reversal potentials and random network connectivity, we calculate (see Frozen noise in Materials and methods) systematic deviations Δϕ i (φ) of the single neuronal firing rates from the steady-state bump shape that depend on the current position φ of the bump center. In Drift in Materials and Methods, we show that the drift field is then given by a weighted sum over the firing rate deviations: with weighing factors depending on the spatial derivative of the bump shape dJ 0;i dφ and the parameters of the synaptic dynamics through the same factors C i /S. This is illustrated in Fig 1E: in contrast to Eq (5) summands are now asymmetric with respect to the bump center, since the spatial derivative is not squared.

Analytical considerations
To calculate the diffusion and drift terms of the last section, we assume the number of neurons N to be large enough to treat the center position φ as continuous: this allows us (similar to [39]) to derive projection vectors (see Projection of dynamics onto the attractor manifold in Materials and methods) that yield the dynamics of the center position. However, the actual projection yields sums over the system size N, whose scaling we made explicit (see System size scaling in Materials and methods). For the diffusion strength ffi ffi ffi B p (cf. Eq (5)) we find a scaling as as 1= ffi ffi ffi ffi N p , in agreement with earlier work [11,14,36,39,46]. For drift fields caused by random connectivity, we find a scaling with the connectivity parameter p and the system size N to leading order as 1=ð ffi ffi ffi p p NÞ, whereas drift fields due to heterogeneity of leak potentials (and other heterogeneous single-neuron parameters) will scale as 1= ffi ffi ffi ffi N p , both in accordance with earlier results [16,36,38,46].
In addition to reproducing the previously known scaling with the system size N, our theory exposes the scaling of both drift and diffusion with the parameters τ x , τ u , and U of short-term depression and facilitation via the analytical pre-factors C i /S appearing in Eqs (5) and (7). Our result extends the calculation of the diffusion constant [39] to synaptic dynamics with shortterm plasticity: In the limiting case of no facilitation and depression (U ! 1, τ x ! 0ms), the pre-factor reduces to C i = 1 and the normalization factor simplifies to is the derivative of the firing rate of neuron i at its steady-state input J 0,i .
For static synapses we thereby recover the known result for diffusion [39, Eq. S18], but also add an analogous relation for the drift A static ðφÞ ¼ P Our approach relies on the existence of a stationary bump state (which is stable for large noise-free homogeneous networks), around which we calculate drift and diffusion as perturbations. Following earlier work [11,50,52], we use in our simulations with spiking integrate-and-fire neurons a slow synaptic time constant (τ s = 100ms) as an approximation of recurrent (NMDA mediated) excitation. While our theory captures the effects of changing this time constant τ s in the pre-factors C i /S, we did not check in simulations whether the bump state remains stable and whether our theory remains valid for very short time constants for τ s . Finally, two limiting cases are worth highlighting. First, for strong facilitation (U ! 0) we ing that (i) this limit will leave residual drift and diffusion which (ii) will both be controlled by the time constants for facilitation (τ u ) and synaptic transmission (τ s ), with no dependence upon depression. Second, for vanishing facilitation (U ! 1 and τ u ! 0) we find that the normalization factor S will tend to zero if the depression time constant τ x is increased to a finite value τ x,c . Through the pre-factors C i /S this, in turn, yields exploding diffusion and drift terms (see S8 Fig). While this is a general feature of bump systems with short-term depression, the exact value of the critical time constant τ x,c depends on the firing rates and neural implementation of the bump state (see section 6 in S1 Text): for the spiking network investigated here, we find a critical time constant τ x,c = 223.9ms (see S8 Fig). In networks with both facilitation and depression, the critical τ x,c increases as facilitation becomes stronger (see S8 Fig).

Prediction of continuous attractor dynamics with short-term plasticity
To demonstrate the accuracy of our theory, we chose random connectivity as a first source of frozen variability. Random connectivity was realized in simulations by retaining only a random fraction 0 < p � 1 (connection probability) of excitatory-to-excitatory (EE) connections. The uniform connections from and to inhibitory neurons are taken as all-to-all, since the effects of making these random or sparse would have only indirect effects on the dynamics of the bump center positions. Our theory accurately predicts the drift-fields A(φ) (see Eq (7)) induced by frozen variability in networks with short-term plasticity (Fig 2). Briefly, for each neuron 0 � i < N, we treat each realization of frozen variability as a perturbation Δ i around the perfectly symmetric system and use an expansion to first order of the input-output relation F to calculate the resulting changes in firing rates (see Frozen noise for details): The resulting terms are then used in Eq (7) to predict the magnitude of the drift field A(φ) for any center position φ, which will, importantly, depend on STP parameters. The same approach can be used to predict drift fields induced by heterogeneous single neuron parameters [36] (see next sections) and additive noise on the E-E connection weights [16,38]. We first simulated spiking networks with only short-term depression and without facilitation (Fig 2A, left, same network as in Fig 1B1), for one instantiation of random (p = 0.5) connectivity. Numerical estimates of the drift in spiking simulations (by measuring the displacement of bumps over time as a function of their position, see Spiking simulations in Materials and methods for details) yielded drift-fields in good agreement with the theoretical prediction ( Fig 2B, left). At points where the drift field prediction crosses from positive to negative values (e.g. Fig 2B, left, φ ¼ p 2 ), we expect stable fixed points of the center position dynamics in agreement with simulation results, which show trajectories converging to these points. Similarly, unstable fixed points (negative-to-positive crossings) can be seen to lead to a separation of trajectories (e.g. Fig 2A, left, φ ¼ À p 2 ). In regions where the positional drifts are predicted to lie close to zero (e.g. Fig 2A, left φ = 0) the effects of diffusive dynamics are more pronounced. Finally, numerical integration of the full 1-dimensional Langevin equation Eq (4) with coefficients predicted by Eqs (5)-(7), produces trajectories with dynamics very similar to the full spiking network (Fig 2C, left). When comparing the center positions after 13.5s of delay activity between the full spiking simulation and the simple 1-dimensional Langevin system, we found very similar distributions of final positions (Fig 2D, left, compare to Fig 1B1, histogram). Our theory thus produces an accurate approximation of the dynamics of center positions in networks of spiking neurons with STP, thereby reducing the complex dynamics of the whole network to a simple equation. It should be noted that, in regions with strong drift or steep negative-to-positive crossings, the numerically estimated drift-fields deviate from the theory due to under-sampling of these regions as trajectories move quickly through them, yielding fewer data points. In Short-term plasticity controls drift we additionally show that the theory, as it relies on a linear expansion of the effects of heterogeneities on the neuronal firing rates, tends to generally over-predict drift-fields as heterogeneities become stronger.
Introducing strong short-term facilitation (U = 0.1) reduces the predicted drift fields ( Fig  2B, left, dashed line), which resemble a scaled-down version of the drift-field for the unfacilitated case. We confirmed this theoretical prediction by simulations including facilitation (Fig  2A, right): the resulting drift fields show significant reduction of speeds (Fig 2B, right) while zero crossings remained similar to the unfacilitated network, similar to the results in [38]. Stability of continuous attractor networks under the control of short-term plasticity Theoretical predictions of the drift fields with bump shapes extracted from these simulations again show an accurate prediction of the dynamics (Fig 2B, right). Thus, as before, forward integrating the simple 1-dimensional Langevin-dynamics yields trajectories (Fig 2C, right) highly similar to those of the full spiking network, with closely matching distributions of final positions (Fig 2D, right), indicative of a matching strength of diffusion. In summary, our theory predicts the effects of STP on the joint dynamics of diffusion and drift due to network heterogeneities, which we will show in detail in the next sections.

Short-term plasticity controls diffusion
To isolate the effects of STP on diffusion, we simulated networks without frozen noise for various STP parameters. For each combination of parameters, we simulated 1000 repetitions of 13.5s delay activity (after cue offset) distributed across 20 uniformly spaced initial cue positions (see Fig 3A for an example). From these simulations, the strength of diffusion was estimated by measuring the growth of variance (over repetitions) of the distance of the center position from its initial position as a function of time (see Spiking simulations in Materials and methods for details). For all parameters considered, this growth was well fit by a linear function (e.g. Fig  3A, inset), the slope of which we compared to the theoretical prediction obtained from the diffusion strength B (Eq (5)).
We find that facilitation and depression control the amount of diffusion along the attractor manifold in an antagonistic fashion (Fig 3B and 3C). First, increasing facilitation by lowering the facilitation parameter U from its baseline U = 1 (no facilitation) towards U = 0, while keeping the depression time constant τ x = 150ms fixed, decreases the measured diffusion strength over an order of magnitude (Fig 3B, dots). On the other hand, increasing the facilitation time constant τ u from τ u = 650ms to τ u = 1000ms (Fig 3B, orange and blue dots, respectively) only slightly reduces diffusion. Our theory further predicts that increasing the facilitation time constants above τ u = 1s will not lead to large reductions in the magnitude of diffusion (see S2 Fig). Second, we find that increasing the depression time constant τ x for fixed U, thereby slowing down recovery from depression, leads to an increase of the measured diffusion ( Fig 3C). More precisely, increasing the depression time constant from τ x = 120ms to τ x = 200ms leads only to slight increases in diffusion for strong facilitation (U = 0.1), but to a much larger increase for weak facilitation (U = 0.8).
For a comparison of these simulations with our theory, we used two different approaches. First, we estimated the diffusion strength by using the precise shape of the stable firing rate profile extracted separately for each network with different sets of parameters. This first comparison with simulations confirms that the theory closely describes the dependence of diffusion on short-term plasticity for each parameter set (Fig 3B, crosses). The observed effects could arise directly from changes in STP parameters for a fixed bump shape, or indirectly since STP parameters also influence the shape of the bump. To separate such direct and indirect effects, we used for a second comparison a theory with fixed bump shape, i.e. the bump shape measured in a "reference network" (U = 1, τ x = 150ms) and extrapolated curves by changing only STP parameters in Eq (5). This leads to very similar predictions (Fig 3B, dashed lines) and supports the following conclusions: a) the diffusion to be expected in attractor networks with similar observable quantities (mainly, the bump shape) depends only on the shortterm plasticity parameters; b) the bump shapes in the family of networks we have investigated are sufficiently similar to be approximated by measurement in a single reference network. It should be noted that the theory tends to slightly over-estimate the amount of diffusion, especially for small facilitation U (see Fig 3B and 3C left). This may be because slower bump movement decreases the firing irregularity of flank neurons, which deviates from the Poisson firing  (5)). B,C Diffusion strengths estimated from simulations (dots, error bars show 95% confidence interval, estimated by bootstrapping) compared to theory. Dashed lines show theoretical prediction using firing rates measured from the assumption of our theory (see Discussion). However, given the simplifying assumptions needed to derive the theory, the match to the spiking network is surprisingly accurate.

Short-term plasticity controls drift
Having established that our theory is able to predict the effect of STP on diffusion, as well as drift for a single instantiation of random connectivity, we wondered how different sources of heterogeneity (frozen noise) would influence the drift of the bump. We considered two sources of heterogeneity: First, random connectivity as introduced above, and second, heterogeneity of the leak reversal potential parameters of excitatory neurons: leak reversal potentials of excitatory neurons are given by V L + Δ L , where Δ L is normally distributed with zero mean and standard deviation σ L [36]. The resulting fields can be calculated by calculating the resulting perturbations to the firing rates of neurons by Eq (8) (see Frozen noise in Materials and methods for details).
The theory developed so far allowed us to predict drift-fields for a given realization of frozen noise, controlled by the noise parameters p (for random connectivity) and σ L (for heterogeneous leak reversal-potentials) (see S3 Fig for a comparison of predicted drift fields to those measured in simulations for varying STP parameters and varying strengths of frozen noises). We wondered, whether we could take the level of abstraction of our theory one step further, by predicting the magnitude of drift fields from the frozen noise parameters only, independently of a specific realization. First, the expectation of drift fields under the distributions of the frozen noises vanishes for any given position: hA(φ)i frozen = 0, where the expectation h.i frozen is taken over both noise parameters. We thus turned to the expected squared magnitude of drift fields under the distributions of these parameters (see Squared field magnitude in Materials and methods for the derivation): where s 0,j is the steady-state synaptic activation. Here, we introduced the derivatives of the input-output relation with respect to the noise sources that appear in Eq (8): is the derivative with respect to the steady state synaptic input, and is the derivative with respect to the perturbation in the leak potential. In Squared field magnitude in Materials and Methods, we show that Eq (9) is independent of the center position φ, and can be estimated from simulations as the variance of the drift field across positions, averaged over an ensemble of network instantiations.
We defined the root of the expected squared magnitude of Eq (9) as the expected field magnitude: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi This quantity predicts the magnitude of the deviations of drift-fields from zero that are expected from the parameters that control the frozen noise-in analogy to the standard deviation for random variables, it predicts the standard deviation of the fields. To compare this quantity to simulations, we varied both heterogeneity parameters. First, the connectivity parameter p was varied between 0.25 and 1. Second, for heterogeneities in leak reversal-potentials, we chose values for the standard deviation σ L of leak-reversal potentials between 0mV and 1.5mV, which lead to a similar range of drift magnitudes as those of randomly connected networks. For each combination of heterogeneities and STP parameters (networks had either random connections or heterogeneous leaks) we then realized 18-20 networks, for which we simulated 400 repetitions of 6.5s of delay activity each (20 uniformly spaced positions of the initial cue). We then estimated the drift-field numerically by recording displacements of bump centers along their trajectories (as in Fig 2A and 2B) and measured the standard deviation of the resulting fields across all positions. Similar to the analysis of diffusion above, we find that facilitation and depression elicit antagonistic control over the magnitude of drift fields. In both simulations and theory, we find ( Fig 4A and 4B) that the expected field magnitude decreases as the effect of facilitation is increased from unfacilitated networks (U = 1) through intermediate levels of facilitation (U = 0.4) to strongly facilitating networks (U = 0.1). Our theory predicts this effect surprisingly well, which we validated twofold (as for the diffusion magnitude). First, we used Eq (10) with all parameters and coefficients estimated from each spiking simulation separately (Fig 4A and  4B, plus-signs and crosses). Second, we extrapolated the theoretical prediction by using coefficients in Eq (9) from the unfacilitated reference network only (U = 1, τ x = 150ms) but changed the facilitation and heterogeneity parameters (Fig 4A and 4B, dashed lines). The largest differences between the extrapolated and full theory are seen for U < 1 and randomly connected networks (p < 1), which we found to result from the fact that bump shapes for these networks tended to be slightly reduced under random and sparse connectivity (e.g. the top firing rate is reduced to � 35Hz for U = 0.1, p = 0.25). Generally, as noise levels increase, our theory tends to over-estimate the squared magnitude of fields, since we rely on a linear expansion of perturbations to the firing rates to calculate fields (Eq (8)). Such deviations are expected as the magnitude of firing rate perturbations increases, and could be counter-acted by including higher- order terms. Since in the theory facilitation (and depression) only scales the firing rate perturbations (Eq (7)), these deviations can also be observed across facilitation parameters. Finally, we performed a similar analysis to investigate the effect of short-term depression on drift fields. Here, we varied the depression time constant τ x for randomly connected networks with p = 0.6, by simulating networks with combinations of short-term plasticity parameters from U 2 {0.1, 0.4, 0.8} and τ x 2 {120ms, 160ms, 200ms} ( Fig 4C). We find that an increase of the depression time constant leads to increased magnitude of drift fields, which again is well predicted by our theory.

Short-term plasticity controls memory retention
The theory developed in previous sections shows that diffusion and drift of the bump center φ are controlled antagonistically by short-term depression and facilitation. In a working memory setup, we can view the attractor dynamics as a noisy communication channel [56] that maps a set of initial positions φ(t = 0s) (time of the cue offset in the attractor network) to associated final positions φ(t = 6.5s), after a memory retention delay of 6.5s. We used the distributions of initial and (associated) final positions to investigate the combined impact of diffusion and drift on the retention of memories ( Fig 5A). Because of diffusion, distributions of positions will widen over time, which degrades the ability to distinguish different initial positions of the bump center (Fig 5A, top). Additionally, directed drift of the dynamics will contract distributions of different initial positions around the same fixed points, making them essentially indistinguishable when read out (Fig 5A, bottom).
As a numerical measure of this ability of such systems to retain memories over the delay period, we turned to mutual information (MI), which provides a measure of the amount of information contained in the readout position about the initially encoded position [57,58]. To measure MI from simulations (see Mutual information measure in Materials and methods), we analyzed network simulations for varying short-term facilitation parameters (U) and magnitudes of frozen noises (p and σ L ) (same data set as Fig 4A and 4B). We recorded the center positions encoded in the network at the time of cue-offset (t = 0) and after 6.5s of delay activity, and used binned histograms (100 bins) to calculate discrete probability distributions of initial (t = 0) and final positions (t = 6.5). For each trajectory simulated in networks of spiking integrate-and-fire neurons, we then generated a trajectory starting at the same initial position by using the Langevin equation Eq (4) that describes the drift and diffusion dynamics of center positions. The MI calculated from the resulting distributions of final positions (again at t = 6.5) for each network serve as the theoretical prediction for each network. As a reference, we used the spiking network without facilitation (U = 1, τ u = 650ms, τ x = 150ms) and no frozen noises (p = 1, σ L = 0mV) and normalized the MI of all other networks (both for spiking simulations and theoretical predictions) with respect to the reference, yielding the measure of relative MI presented in Fig 5B-5E.
We found that the relative MI decreased compared to the reference network as network heterogeneities were introduced (Fig 5B, green). This was expected, since directed drift caused by heterogeneities leads to a loss of information about initial positions. There were two effects of increased short-term facilitation (by decreasing the parameter U). First, diffusion was reduced, which was visible in a vertical shift of the relative MI for facilitated networks (Fig 5A, orange and blue, at 0 heterogeneity). Second, the effects of frozen noise decreased with increasing facilitation, which was visible in the slopes of the MI decrease (see also S4 Fig). The MI obtained by integration of the Langevin equations (see above) matched those of the simulations well (Fig 5A, lines). From earlier results, we expected the drift-fields to be slightly overestimated by the theory as the heterogeneity parameters increase (Fig 4), which would lead to an under-estimation of MI. We did observe this here, although for U = 1 the effect was slightly counter-balanced by the under-estimated level of diffusion (cf. Fig 3A, right), which we expected to increase the MI. For networks with stronger facilitation (U = 0.1), we systematically over-estimated diffusion (cf. Fig 3, left), and therefore under-estimated MI.
Using our theory, we were able to simplify the functional dependence between MI, shortterm plasticity, and frozen noise. Combining the effects of both diffusion and drift into a single quantity for each network, we replaced the field A(φ) by our theoretical prediction ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi hA 2 i frozen p in Eq (4) and forward integrated the differential equation for a time interval Δt = 1s, to arrive at the expected displacement in 1s: jDφjð1sÞ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi This quantity describes the expected absolute value of displacement of center positions during 1s: it increases as a function of the frozen noise distribution parameters (Fig 5C), but even in Stability of continuous attractor networks under the control of short-term plasticity the absence of frozen noise it is nonzero due to diffusion. Plotting the MI data in dependence of the first term only ( ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi hA 2 i frozen p ), shows that the MI curves collapse onto a single curve for each facilitation parameter ( Fig 5D). Finally, plotting the MI data against |Δφ|(1s) we find that all data collapse on to nearly a single curve (Fig 5E). Thus, the effects of the two sources of frozen noise (corresponding to hA 2 i frozen ) and diffusion (corresponding to B) are unified into a single quantity |Δφ|(1s).
We performed the same analyses on a large set of network simulations with fixed random connectivity (p = 0.6) and varying STP parameters for both depression (τ x ) and facilitation (U) (same data set as in Fig 4C). Increasing the short-term depression time constant τ x leads to decreased relative MI with a positive offset induced through stronger facilitation (Fig 6A, blue line). Calculating the expected displacement for these network configurations collapsed the data points mostly onto the same curve as earlier (Fig 6B). For strong depression combined with weak facilitation (τ x = 200ms, U = 0.8), the drop-off of the relative MI saturates earlier, indicating that for these strongly diffusive networks the effect on MI may not be sufficiently captured by its relationship to |Δφ|(1s).

Linking theory to experiments: Distractors and network size
The abstraction of our theory condenses the complex dynamics of bump attractors in spiking integrate-and-fire networks into a high-level description of a few macroscopic features, which in turn allows matching the theory to behavioral experiments. Here, we demonstrate how such quantitative links could be established using two different features: 1) the sensitivity of the working memory circuit to distractors, and 2) the stability of working memory expressed by the expected displacement. We stress that our model is a simplified description of biological circuits, in which several further sources of variability and also dynamical processes influencing displacement should be expected (see Discussion). Thus, at the current level of simplification, the results presented in this section should be seen as proofs of principle rather than quantitative predictions for a cortical setting.
Predicting the sensitivity to distractor inputs. In a biological setting, drifts introduced by network heterogeneities (frozen noise) could be significantly reduced by (long-term) plasticity [36]. To measure the intrinsic stability of continuous attractor models, earlier studies Stability of continuous attractor networks under the control of short-term plasticity [11,47,59] have proposed to use distractor inputs (Fig 7A): providing a short external input centered around a position φ D to the network, the center position of an existing bump state will be biased towards the distracting input, with stronger biases appearing for closer distractors. In the context of our theory, we consider a weak distractor as an additional heterogeneity that induces drift. Therefore the time scale of bump drift caused by distractor-induced heterogeneity enables us link our theory to behavioral experiments [59].
Our theory can readily yield quantitative predictions for the distractor paradigm. To accommodate distractor inputs in the theory, we assume that they cause some units i to fire at elevated rates ϕ 0,i + Δϕ i , which will introduce a drift field according to Eq (7) (Fig 7A, purple dashed line). The resulting dynamics (Eq (4)) of diffusion and drift during the presentation of the distractor input then allow us to calculate the expected shift of center positions as a function of all network parameters, including those of short-term plasticity. Repeating this paradigm for varying positions of the distractor inputs (see Distractor analysis in Materials and methods for details), our theory predicts that strong facilitation will strongly decrease both the effect and radial reach of distractor inputs (Fig 7B, blue), when compared to the unfacilitated system (Fig 7B, green)-in qualitative agreement with simulation results involving a related  Stability of continuous attractor networks under the control of short-term plasticity (cell-intrinsic) stabilization mechanism [47]. Conversely, we predict that longer recovery from short-term depression tends to increase the sensitivity to distractors (Fig 7C). The total displacement caused by a distractor input is found by integrating the resulting dynamics of Eq (4) over the stimulus duration. As such, the magnitude of the displacement will increase both with the amplitude and the duration of the distractor input. Finally, our theory demonstrates that the bump shape, in particular the width of the bump, influence the radial reach of distractor inputs (Fig 7D).
Relating displacement to network size in working memory networks. The simple theoretical measure of expected displacement |Δφ|(1s) introduced in the last section can be related to behavioral experiments: a value of |Δφ|(1s) = 1.0 deg lies in the upper range of experimentally reported deviations due to diffusive and systematic errors in behavioral studies [60,61]. What are the microscopic circuit compositions that can attain such a (high) level of working memory stability? In particular, since an increase in network size can reduce diffusion [11] and the effects of random heterogeneities [16,36,38,46], we turned to the question: which networks size would be needed to yield this level of stability in a one-dimensional continuous memory system?
To address the question of network size, we extended our theory to include the size N of the excitatory population as an explicit parameter (see System size scaling in Materials and methods for details). Using numerical coefficients in Eq (4) extracted from the spiking simulation of a reference network (U = 1, τ x = 150 and N E = 800), we extrapolated the theory by changing the system size N and short-term plasticity parameters. We then constrained parameters of our theory by published data (Table 1). Short-term plasticity parameters were based on two groups of strongly facilitating synapses found in a study of mammalian (ferret) prefrontal cortex [62]. The same study reported a general probability p = 0.12 of pyramidal cells to be connected. However, for pairs of pyramidal cells that were connected by facilitating synapses, the study found a high probability of reciprocal connections (p rec = 0.44): thus if neuron A was connected to neuron B (with probability p), neuron B was connected to neuron A with high probability (p rec ), resulting in a non-random connectivity. To approximate this in the random connectivities supported by our theory, we evaluated a second, slightly elevated, level of random connectivity, that has the same mean connection probability as the non-random connectivity with these additional reciprocal connections: p + p � p rec = 0.1728. For the standard deviation of leak reversal-potentials σ L , we used values measured in two studies [63,64]. The resulting theory makes quantitative predictions for combinations of network size N and all other parameters that yield the desired levels of working memory stability (Table 1, see also S5 Fig). Network sizes were all smaller than 10 6 neurons, with values depending most strongly on the value of the facilitation parameter U and the magnitude of the leak reversal-potential heterogeneities σ L . Since the expected field magnitude scales weakly (1= ffi ffi ffi p p ) with the recurrent connectivity p, increasing p lead only to comparatively small decreases in the predicted network sizes. Finally, we see that the increasing the reliability of networks comes at a high cost: decreasing the expected displacement to |Δφ|(1s) = 0.5 deg [60] increases the required number of neurons by nearly a number of 4 for both facilitation settings we investigated. Nevertheless, these network sizes still lie within anatomically reasonable ranges [65].
In summary, we have provided a proof of principle, that the high-level description of our theory can be used to predict network sizes, by exposing features that can be constrained by experimental measurements. Given the simplifying assumptions of our models and the sources of variability that we could include at this stage, continuous attractor networks with realistic values for the strength of facilitation and depression of recurrent connections could achieve sufficient stability, even in the presence of biological variability.

Discussion
We presented a theory of drift and diffusion in continuous working memory models, exemplified on a one-dimensional ring attractor model. Our framework generalizes earlier approaches calculating the effects of fast noise by projection onto the attractor manifold [37,39,40] by including the effects of short-term plasticity (see [45] for a similar analysis for facilitation only). Our approach further extends earlier work on drift in continuous attractors with shortterm plasticity [38] to include diffusion and the dynamics of short-term depression. Our theory predicts that facilitation makes continuous attractors robust against the influences of both dynamic noise (introduced by spiking variability) and frozen noise (introduced by biological variability) whereas depression has the opposite effect. We use this theory to provide, together with simulations, a novel quantitative analysis of the interaction of facilitation and depression with dynamic and frozen noise. We have confirmed the quantitative predictions of our theory in simulations of a ring-attractor implemented in a network model of spiking integrate-andfire neurons with synaptic facilitation and depression, and found theory and simulation to be in good quantitative agreement.
In Section Short-term plasticity controls memory retention, we demonstrated the effects of STP on the information retained in continuous working memory. Using our theoretical predictions of drift and diffusion we were able to derive the expected displacement |Δφ| as a function of STP parameters and the frozen noise parameters, which provides a simple link between the resulting Langevin dynamics of bump centers and mutual information (MI) as a measure of working memory retention. Our results can be generalized in several directions. First, the choice of 1s of forward integrated time for |Δφ| (Eq (11)) was arbitrary. While a choice of � 2s lets the curves in Fig 5E collapse slightly better, we chose 1s to avoid further heuristics. Second, we expect values of MI to decrease as the length of the delay period is increased. Our choice of 6.5s is comparable to delay periods often considered in behavioral experiments (usually 3-6s) [61,66,67]. However, a more rigorous link between the MI measure and the underlying attractor dynamics would be desirable. Indeed, for noisy channels governed by Fokker-Planck equations, this might be feasible [68], but goes beyond the scope of this work.
In Section Linking theory to experiments: Distractors and network size, we demonstrated that the high-level description of the microscopic dynamics obtained by our theory allows its parameters to be constrained by experiments. Considering that our model is a simplified description of its biological counterparts (see next paragraph), these demonstrations are to be seen as a proof of principle as opposed to quantitative predictions. However, since distractor inputs can be implemented in silico as well as in behavioral experiments (see e.g. [59]), they could eventually provide a quantitative link between continuous attractor models and working memory systems, by matching the resulting distraction curves. Our theory goes beyond previous models in which these distraction curves had to be extracted through repeated microscopic simulations for single parameter settings [47]. We further used our theory to derive bounds on network parameters, in particular the size of networks, that lead to "tolerable" levels of drift and diffusion in the simplified model. For large magnitudes of frozen noise our theory tends to over-estimate the expected magnitude of drift-fields slightly (cf. Fig 4). Thus, we expect the predictions made here to be upper bounds on network parameters needed to achieve a certain expected displacement. Finally, while the predictions of our theory might deviate from biological networks, they could be applied to accurately characterize the stability of, and the effects of inputs to, bump attractor networks implemented in neuromorphic hardware for robotics applications [69].
Our results show, that strong facilitation (small values of U) does not only slow down directed drift [38], but also efficiently suppresses diffusion in spiking continuous attractor models. However, in delayed response tasks involving saccades, that presumably involve continuous attractors in the prefrontal cortex [11,22], one does observe an increase of variability in time [66]: quickly accumulating systematic errors (alike drift) [61] as well as more slowly increasing variable errors (with variability growing linear in time, alike diffusion) have been reported [60]. Indeed, there are several other possible sources of variability in cortical working memory circuits, which we did not consider here. In particular, we expect that heterogeneous STP parameters [62], noisy synaptic transmission and STP [70] or noisy recurrent weights [38] (see Random and heterogeneous connectivity in Materials and methods), for example, will induce further drift and diffusion beyond the effects discussed in this paper. Additionally, variable errors might be introduced elsewhere in the pathway between visual input and motor output (but see [71]) or by input from other noisy local circuits during the delay period [72]. Note that we excluded AMPA currents from the recurrent excitatory interactions [11]. However, since STP acts by presynaptic scaling of neurotransmitter release, it will act symmetrically on both AMPA and NMDA receptors so that an analytical approach similar to the one presented here is expected to work.
Several additional dynamical mechanisms might also influence the stability of continuous attractor working memory circuits. For example, intrinsic neuronal currents that modulate the neuronal excitability [47] or firing-rate adaptation [73] affect bump stability. These and other effects could be accommodated in our theoretical approach by including their linearized dynamics in the calculation of the projection vector (cf. Projection of dynamics onto the attractor manifold in Materials and methods). Fast corrective inhibitory feedback has also been shown to stabilize spatial working memory systems in balanced networks [74]. On the timescale of hours to days, homeostatic processes counteract the drift introduced by frozen noise [36]. Finally, inhibitory connections that are distance-dependent [11] and show shortterm plasticity [75] could also influence bump dynamics.
We have focused here on ring-attractor models that obtain their stable firing-rate profile due to perfectly symmetric connectivity. Our approach can also be employed to analyze ringattractor networks with short-term plasticity in which weights show (deterministic or stochastic) deviations from symmetry (see Frozen noise in Materials and methods for stochastic deviations). Although not investigated here, continuous line-attractors arising through a different weight-symmetry should be amenable to similar analyses [39]. Finally, it should be noted that adequate structuring of the recurrent connectivity can also positively affect the stability of continuous attractors [14]. For example, translational asymmetries included in the structured heterogeneity can break the continuous attractor into several isolated fixed points, which can lead to decreased diffusion along the attractor manifold [58].
We provided evidence that short-term synaptic plasticity controls the sensitivity of attractor networks to both fast diffusive and frozen noise. Control of short-term plasticity via neuromodulation [76] would thus represent an efficient "crank" for adapting the time scales of computations in such networks. For example, while cortical areas might be specialized to operate in certain temporal domains [7,77], we show that increasing the strength of facilitation in a taskdependent fashion could yield slower and more stable dynamics, without changing the network connectivity. On the other hand, modulating the time scales of STP could provide higher flexibility in resetting facilitation-stabilized working memory systems to prepare them for new inputs [47], although there might be evidence for residual effects of facilitation between trials [45,78]. By changing the properties of presynaptic calcium entry [79], inhibitory modulation mediated via GABA B and adenosine A 1 receptors can lead to increased facilitatory components in rodent cerebellar [80] and avian auditory synapses [81]. Dopamine, serotonin and noradrenaline have all been shown to differentially modulate short-term depression (and facilitation when blocking GABA receptors) at sensorimotor synapses [82]. Interestingly, next to short-term facilitation on the timescale of seconds, other dynamic processes up-regulate recurrent excitatory synaptic connections in prefrontal cortex [62]: synaptic augmentation and post-tetanic potentiation operate on longer time scales (up to tens of seconds), and might be able to support working memory function [83]. While the long time scales of these processes might again render putative short-term memory networks inflexible, there is evidence that they might also be under tight neuromodulatory control [84]. Finally, any changes in recurrent STP properties of continuous attractors (without retuning networks as done here) will also lead to changes in the stable firing rate profiles, with further effects on their dynamical stability (see final section of the Discussion). This interplay of effects remains to be investigated in more detail.

Comparison to earlier work
Similar to an earlier theoretical approach using a simplified rate model [38], we find that the slowing of drift by facilitation depends mainly on the facilitation parameter U, while the time constant τ u has a less pronounced effect. While the approach of [38] relied on the projection of frozen noise onto the derivative of the first spatial Fourier mode of the bump shape along the ring, here we reproduce and extend this result (1) for arbitrary neuronal input-output relations and (2) a more detailed spatial projection that involves the full synaptic dynamics and the bump shape. While, our theory can also accommodate noisy recurrent connection weights as frozen noise, as used in in [38] (see Frozen noise in Materials and methods for derivations), the drifts generated by these heterogeneities were generally small compared to diffusion and the other sources of heterogeneity.
A second study investigated short-term facilitation and showed that it reduces drift and diffusion in a spiking network, for a fixed setting of U (although the model of short-term facilitation differs slightly from the one employed here) [47]. Contrary to what we find here, these authors find that an increase in τ u leads to increased diffusion, while we find that an increase over the range they investigated (� 0.5s − 4s) would decrease the diffusion by a factor of nearly two. More precisely, for our shape of the bump state (which we keep fixed) we predict a reduction from � 26 to � 16 deg 2 /s for a similar setting of facilitation U. These differences might arise from an increasing width of the bump attractor profile for growing facilitation time constants in [47], which would then lead to increased diffusion in our model. Whether this effect persists under the two-equation model of saturating NMDA synapses used there remains to be investigated. Finally, increasing the time constant of recurrent NMDA conductances has been shown to also reduce diffusion [47], in agreement with our theory, according to which the normalization constant S increases with τ s [39].
A study performed in parallel to ours [45] used a similar theoretical approach to calculate diffusion with short-term facilitation in a rate-based model with external additive noise, but did not compare the results for varying facilitation parameters. The authors report a short initial transient of stronger diffusion as synapses facilitate, followed by weaker diffusion that is dictated by the fully facilitated synapses. Our theory, by assuming all synaptic variables to be at steady-state, disregards the initial strong phase of diffusion. We also disregarded such initial transients when comparing to simulations (see Numerical methods).
In a study that investigated only a single parameter value for depression (τ x = 160ms, no facilitation) in a network of spiking integrate-and-fire neurons similar to the one investigated here, the authors observed no apparent effect of short-term depression on the stability of the bump [44]. In contrast, we find that stronger short-term depression will indeed increase both diffusion and directed drift along the attractor. Our result agrees qualitatively with earlier studies in rate models, which showed that synaptic depression, similar to neuronal adaptation [10,85], can induce movement of bump attractors [42,43,86,87]. In particular, simple rate models exhibit a regime where the bump state moves with constant speed along the attractor manifold [42]. We did not find any such directed movement in our networks, which could be due to fast spiking noise which is able to cancel directed bump movement [85].

Extensions and shortcomings
The coefficients of Eq (4) give clear predictions as to how drift and diffusion will depend on the shape of the bump state and the neural transfer function F. The relation is not trivial, since the pre-factors C i and the normalization constant S also depend on the bump shape. For the diffusion strength Eq (5), we explored this relation numerically, by artificially varying the shape of the firing rate profile (while extrapolating other quantities). Although a more thorough analysis remains to be performed, a preliminary analysis shows (see S6 Fig) that diffusion increases both with bump width and top firing rate, consistent with earlier findings [11,32].
Our theory can be used to predict the shape and effect of drift fields that are generated by localized external inputs due to distractor inputs; see Section Linking theory to experiments: Distractors and network size. Any localized external input (excitatory or inhibitory) will cause a deviation Δϕ i from the steady-state firing rates, which, in turn, generates a drift field by Eq (7). This could predict the strength and location of external inputs that are needed to induce continuous shifts of the bump center at given speeds, for example when these attractor networks are designed to track external inputs (see e.g. [10,88]). It should be noted that in our simple approximation of this distractor scheme, we assume the system to remain at approximately steady-state, i.e. that the bump shape is unaffected by the additional external input, except for a shift of the center position. For example, we expect additional feedback inhibition (through the increased firing of excitatory neurons caused by the distractor input) to decrease bump firing rates. A more in depth study and comparison to simulations will be left for further work.
Our networks of spiking integrate-and-fire neurons are tuned to display balanced inhibition and excitation in the inhibition dominated uniform state [53,89], while the bump state relies on positive currents, mediated through strong recurrent excitatory connections (cf. [44] for an analysis). Similar to other spiking network models of this class, this mean-driven bump state shows relatively low variability of neuronal inter-spike-intervals of neurons in the bump center [90,91] (see also next paragraph). Nevertheless, neurons at the flanks of the bump still display variable firing, with statistics close to that expected of spike trains with Poisson statistics (see S7 Fig), which may be because the flank's position slightly jitters. Since the non-zero contributions to the diffusion strength are constrained to these flanks (cf. Fig 1D), the simple theoretical assumption of Poisson statistics of neuronal firing still matches the spiking network quite well. As discussed in Short-term plasticity controls diffusion, we find that our theory overestimates the diffusion as bump movement slows down for small values of U-this may be due to a decrease in firing irregularity in stable bumps in particular in the flank neurons, at which the Poisson assumption becomes inaccurate.
More recent bump attractor approaches allow networks to perform working memory function with a high firing variability also during the delay period [3], in better agreement with experimental evidence [92]. These networks show bi-stability, where both stable states show balanced excitation and inhibition [90] and the higher self-sustained activity in the delay activity is evoked by an increase in fluctuations of the input currents (noise-driven) rather than an increase in the mean input [93]. This was also reported for a ring-attractor network (with distance-dependent connections between all populations), where facilitation and depression are crucial for irregularity of neuronal activity in the self-sustained state [46]. Application of our approach to these setups is left for future work.

Analysis of drift and diffusion with STP
For the following, we define a concatenated 3 � N dimensional column vector of state variables y = (s T , u T , x T ) T of the system Eq (3). Given a (numerical) solution of the stable firing rate pro-file� 0 we can calculate the stable fixed point of this system by setting the l.h.s. of Eq (3) to zero. This yields steady-state solutions for the synaptic activations, facilitation and depression variables y 0 = (s 0 , u 0 , x 0 ): We then linearize the system Eq (3) at the fixed point y 0 , introducing a change of variables consisting of perturbations around the fixed point: y = y 0 + δ y = y 0 + (δ s T , δ u T , δ x T ) and ϕ i = ϕ 0,i + δϕ i . To reach a self-consistent linear system, we further assume a separation of time scales between the neuronal dynamics and the synaptic variables, in that the neuronal firing rate changes as an immediate function of the (slow) input. This allows replacing Finally, keeping only linear orders in all perturbations, we arrive at the linearized system equivalent of Eq (3): Here, dots between vectors indicate element-wise multiplication, the operator D : R n ! R n�n creates diagonal matrices from vectors, and W = (w ij ) is the synaptic weight matrix of the network. Projection of dynamics onto the attractor manifold. To project the dynamical system Eq (13) onto movement of the center position φ of the firing rate profile, we assume that N is large enough to treat the center position φ as a continuous variable. We also assume that the network implements a ring-attractor: the system dynamics are such that the firing rate profilẽ � 0 can be freely shifted to different positions along the ring, changing the center position φ, while retaining the same shape. All other possible directions of change in this system are assumed to be constrained by the system dynamics. In the system at hand, this implies that the matrix K of Eq (13), which captures the linearized dynamics around any of these fixed points, will have a zero eigenvalue corresponding to the eigenvector of a change of the dynamical variables under a change of position φ, while all other eigenvalues are negative [39].
Formally, the column eigenvector to the eigenvalue 0 is given by changes in the state variables as the bump center position φ is translated along the manifold: Let e l be the associated row left-eigenvector (also to eigenvalue 0) of K, normalized such that: e l � e r ¼ 1: In Section 1 of S1 Text, we show that the eigenvector e l projects the system Eq (13) onto dynamics of of the center position: Under the linearized ring-attractor dynamics K, the center position is thus not subject to any dynamics, making it susceptible to any displacements by noise.
Calculation of the left eigenvector e l . If the matrix K is symmetric, the left and right eigenvectors e l and e r for the same eigenvalue 0 are the transpose of each other. Unfortunately, here this is not the case (see Eq (13)), and we need to compute the unknown vector e l , which will depend on the coefficients of the known vector e r . In particular, we look for a parametrized vector y 0 (y) = (t T (y), v T (y), z T (y)) T that for y = e r fulfills the transposed eigenvalue equation of the left eigenvector: In Section 2 of S1 Text, we derive variables y 0 that fulfill the transposed dynamics _ y 0 ¼ K T y 0 and for which it holds that _ y 0 ðe r Þ ¼ 0, thus fulfilling the condition Eq (17). In this case we know that (due to uniqueness of the 1-dimensional eigenspace associated to the 0 eigenvalue) the vector y 0T is proportional to e l : where S is a proportionality constant and dφ is the change of the steady-state input arriving at neuron i under shifts of the center position φ.
Finally, the proportionality constant S can be calculated by using Eq (18) in Eq (15) (see Section 3 of S1 Text for details): i is the linear change of the firing rate of neuron i at its steady-state input J 0,i .

Diffusion.
To be able to describe diffusion on the continuous attractor, we need to extend the model by a treatment of the noise induced into the system through the variable process of neuronal spike emission. Starting from Eq (3), we assume that neurons i fire according to independent Poisson processes ξ i (t) = ∑ k δ(t − t i,k ), where t i,k is a Poisson point process with timedependent rate ϕ i . The variability of the point process ξ i (t) introduces noise in the synaptic variables. We assume that the shot-noise (jump-like) nature of this process is negligible, given that we average all individual contributions over the network (see below), allowing us to capture the neurally induced variability simply as white noise with variance proportional to the incoming firing rates [48,53] where η i are white Gaussian noise processes with mean hη i i = 0, and correlation function hη i (t)η j (t 0 )i = δ(t − t 0 )δ ij . This model of ξ i (t) preserves the mean and the auto-correlation function of the original Poisson processes. Here, we introduce diffusive noise for each synaptic variable separately, but later average their linear contributions over the large population, when projecting onto movement along the continuous manifold (see below, and also [39], Supplementary Material] for a discussion).
Substituting the noisy processes ξ i (t) for ϕ i (t) in Eq (3) results in the following system of 3�N coupled Ito-SDEs: Stability of continuous attractor networks under the control of short-term plasticity Note that the noise inputs η i to the synaptic variables for neuron i are all identical, since they result from the same presynaptic spike train.
Linearizing this system around the noise-free steady-state Eq (12) and considering only the unperturbed noise (we neglect multiplicative noise terms by replacing the terms ffi ffi ffi ffi � i p ! ffi ffi ffi ffi ffi ffiffi � 0;i p ), we arrive at the linearized system equivalent of Eq (20): Note that the same vector of white noisesZ � ðZ 1 ; . . . ; Z n Þ T appears three times.
Left-multiplying this system with the eigenvector e l yields a stochastic differential equation for the center position (cf. Eq (16)): Through the normalization by S (Eq (18)), which sums over all neurons, the individual contributions e l,k become small as the number of neurons N increases (this scaling is made explicit in System size scaling). Thus, for large networks we average the small contributions of many single noise sources, which validates the diffusion approximation above. In Section 4 of S1 Text, we show that we can rewrite Eq (22) by introducing a single Gaussian white noise process with intensity B (Eq (5) of the main text), that matches the correlation function of the summed noises: where η is a white Gaussian noise process with hηi = 0 and hη(t)η(t 0 )i = δ(t − t 0 ). Note, that the value of B is the same under changes of the center position φ: these correspond to index-shifting (mod N) all vectors in Eq (5), which leaves the sum invariant. Drift. While the diffusion coefficient calculated above is invariant with respect to shifts of the bump center, the directed drift introduced by frozen variability depends on the momentary bump center position φ. In the following we compare the heterogeneous network with bump centered at φ to a homogeneous network (without frozen noise) with the bump also centered at φ. The unperturbed firing rate profile in the homogeneous network with bump at φ will be denoted by� 0 ðφÞ, which is the standard profile� 0 but centered at φ. Since we choose the standard profile to be centered at −π, we have� 0 ¼� 0 ðÀ pÞ.
We want to derive a compact expression for the directed drift of the bump in the heterogeneous network with frozen noise. Given a bump center position φ, we first shift the origin of the coordinate system that describes the angular position on the ring of neurons such that the firing rate profile is centered at the standard position φ 0 = −π. In a system with frozen variability, the actual firing rate profile of the bump is where D�ðφÞ summarize the linear firing rate perturbations caused by a small amount of heterogeneities. These firing rate perturbations stem from any deviation of the neural system from the "baseline" case and change with the center position φ of the bump. The resulting drift field derived from a linearization of the dynamics will thus depend on the center position.
In subsection Frozen noise we calculate the perturbations induced by random network connectivity, as well as heterogeneous leak reversal-potentials in excitatory neurons of the spiking network. The firing rate perturbations Eq (24) add an additional term in the linearized equations Eq (21): As before, we left-multiply by the left eigenvector e l , thereby projecting the dynamics onto changes of the center position. This eliminates the linear response kernel K and yields a driftterm in the SDE Eq (23) (see Section 5 of S1 Text for details): Here, ϕ 0,i is the firing rate of the ith neuron in a homogeneous network with the bump centered at −π and Δϕ i (φ) is the firing rate change of this neuron caused by heterogeneities where the heterogeneities are calculated under the assumption that (before shift of the coordinate system) the bump is at φ.
In the above equation, we have assumed that the number of neurons N is large enough to treat the center position as a continuous variable φ 2 [−π, π) with the associated drift-field A (φ) in Eq (7). In practice, we calculate this drift field according to the first term in Eq (26) for each realizable center position φ k ¼ k 2p N À p (for 0 � k < N), which yields a discretized field. It is important to note that this field will vary nearly continuously with changes in these discretized center positions. Intuitively, the sum weighs the vector D�ðφ k Þ of firing-rate perturbations with a smooth function of the smoothly varying firing-rate profile� 0 (the coefficients in the sum). Shifts in the center position φ k yield (to first order) index-shifts in the vector of firing-rate perturbations (see Frozen noise), equivalent to index-shifts of the vector of firing rates � 0 . Thus, small changes in center positions will lead to small changes in the summands of Eq (26). While our results validate the approach, a more rigorous proof of these arguments will be left for future work.

Spiking network model
Spiking simulations are based on a variation of a popular ring-attractor model of visuospatial working memory of [11] (and used with variations in [27,29,32,36,47]). The recurrent excitatory connections of the original network model have been simplified, to allow for faster simulation as well as analytical derivations of the recurrent synaptic activation. The implementation details are given below, however the major changes are: 1) all recurrent excitatory conductances are voltage independent; 2) a model of synaptic short-term plasticity via facilitation and depression [49,94,95] is used to dynamically regulate the weights of the incoming spike-trains 3) recurrent excitatory conductances are computed as linear filters of the weighted incoming spike trains instead of the second-order kinetics for NMDA saturation used in [11].
Neuron model. Neurons are modeled by leaky integrate-and-fire dynamics with conductance based synaptic transmission [11,50]. The network consists of recurrently connected populations of N E excitatory and N I inhibitory neurons, both additionally receiving external spiking input with spike times generated by N ext independent, homogeneous Poisson processes, with rates ν ext . We assume that external excitatory inputs are mediated by fast AMPA receptors, while, for simplicity, recurrent excitatory currents are mediated only by slower NMDA channels (as in [11]).
The dynamics of neurons in both excitatory and inhibitory populations are governed by the following system of differential equations indexed by i 2 {0, . . ., N E/I − 1}: where P 2 {L,Ext,I,E}, V denotes voltages (membrane potential) and I denotes currents. Here, C m is the membrane capacitance and V L , V E , V I are the reversal potentials for leak, excitatory currents, and inhibitory currents, respectively. The parameters g P for P 2 {L,Ext,I,E} are fixed scales for leak (L), external input (Ext) and recurrent excitatory (E) and inhibitory (I) synaptic conductances, which are dynamically gated by the unit-less gating variables s P i ðtÞ. These gating variables are described in detail below, however we set the leak conductance gating variable to s L i ¼ 1. For excitatory neurons, we refer to the excitatory and inhibitory conductance scales by g EE � g E and g EI � g I , respectively. Similarly, for inhibitory neurons, we refer to the excitatory and inhibitory conductance scales by g IE � g E and g II � g I , respectively. The model neuron dynamics (Eq 27) are integrated until their voltage reaches a threshold V thr . At any such time, the respective neuron emits a spike and its membrane potential is reset to the value V res . After each spike, voltages are clamped to V res for a refractory period of τ ref .
See the Tables in S1 and S2 Tables. for parameter values used in simulations.
Synaptic gating variables and short-term plasticity. The unit-less synaptic gating variables s P i ðtÞ for P 2 {Ext,I} (external and inhibitory currents) are exponential traces of the spike trains of all presynaptic neurons j with firing times t j : where pre(P) indicates all neurons presynaptic to the neuron i for the the connection type P. The factors w P ij are unit-less synaptic efficacies for the connection from neuron j to neuron i. For the excitatory gating variables of inhibitory neurons s IE i (IE denotes connections from E to I neurons) we also use the linear model of Eq (28) with time constant τ IE = τ E .
For excitatory to excitatory conductances, we use a well established model of synaptic short-term plasticity (STP) [49,94,95] which provides dynamic scaling of synaptic efficacies depending on presynaptic firing. This yields two additional dynamical variables, the facilitating synaptic efficacy u j (t), as well as the fraction of available synaptic resources x j (t) of the outgoing connections of a presynaptic neuron j, which are implemented according to the following differential equation: Here, the indices u À j and x À j indicate that for the incremental update of the variables upon spike arrival, we use the values of the respective variables immediately before the spike arrival [95]. Note that the variable U appears in the equation for u(t) both as the steady-state value in the absence of spikes and as a scale for the update per spike.
The dynamics of recurrent excitatory-to-excitatory transmission with STP are then given by gating variables that linearly filter the incoming spikes scaled by facilitation and depression: Here, pre(EE) indicates all excitatory neurons that make synaptic connections to the neuron i. See 'S2 Table' for synaptic parameters used in simulations. Note that a synapse j that has been inactive for a long time is described by variables x À j ¼ 1 and u À j ¼ U and s j = 0 so that the initial strength of the synaptic connection is Uw EE ij [49]. The system of Eqs (29)-(31) is a spiking variant of the rate-based dynamics of Eq (3), with s EE i a variable related to the input J i (cf. Eq (2)). In Subsection Firing rate approximation we will make this link explicit. Network connectivity. All connections except for the recurrent excitatory connections are all-to-all and uniform, with unit-less connection strengths set to w I ij ¼ w ext ij ¼ 1 and for inhibitory neurons additionally w E ij ¼ 1. The recurrent excitatory connections are distance-dependent and symmetric. Each neuron of the excitatory population with index i 2 {0, . . ., N E − 1} is assigned an angular position y i ¼ i � 2p N E 2 ½0; 2pÞ. Recurrent excitatory connection weights w EE ij from neuron j to neuron i are then given by the Gaussian function w EE (θ) as (see the Table in  S2 Table for parameters used in simulations): Additionally, for each neuron we keep the integral over all recurrent connection weights normalized, resulting in the normalization condition 1 2p R p À p dφw EE ðφÞ ¼ 1: This normalization ensures that varying the maximum weight w + will not change the total recurrent excitatory input if all excitatory neurons fire at the same rate. Here, we choose w + as a free parameter constraining the baseline connection weight to: Firing rate approximation. We first replace the synaptic activation variables s P (V, t) for P 2 {I, ext} by their expectation values under input with Poisson statistics. We assume that the inhibitory population fires at rates ν I . For the linear synapses this yields For the recurrent excitatory-to-excitatory synapses with short-term plasticity, we set the differential Eq (29) to zero, and also average them over the Poisson statistics. Akin to the "meanfield" model of [49], we average the steady-state values of facilitation and depression separately over the Poisson statistics. This implicitly assumes that facilitation and depression are statistically independent, with respect to the distributions of spike times-while this is not strictly true, the approximations work well, as has been previously reported [49]. This allows a fairly straightforward evaluation of the mean steady-state value of the combined facilitation and depression variables hu j x j i, under the assumption that the neuron j fires at a mean rate ν j with Poisson statistics, and yields rate approximations of the steady-state values similar to Eq (12): We now assume that the excitatory population of N E neurons fires at the steady-state rates ϕ j (0 � j < N). To calculate the synaptic activation of excitatory-to-excitatory connections hs EE i i, we set Eq (30) to zero, and average over Poisson statistics (again neglecting correlations), which yieldshs j i = τ E hu j x j iϕ j and hs EE i i ¼ Let the the normalized steady-state input J i be: The steady-state input Eq (36) links the general framework of Eq (2) to the spiking network. The additional factor 1/N E is introduced to make the scaling of the excitatory-to-excitatory conductance with the size of the excitatory population N E explicit, which will be used in System size scaling. To see this, we assume that the excitatory conductance scale of excitatory neurons g EE is scaled such that the total conductance is invariant under changes of N E [96]: g EE ¼g EE =N E , for some fixed valueg EE . This yields the total excitatory-to-excitatory conductance g EE s EE i ¼g EE J i with J i as introduced above, where the scaling with N E is now shifted to the input variable J i .
For the synaptic activation of excitatory to inhibitory connections, we get the mean activations: We then follow [50] to reduce the differential equations of Eq (27) to a dimensionless form. The main difference consists in the absence of the voltage dependent NMDA conductance, which is achieved by setting the two associated parameters β ! 0, γ ! 0 in [50], to arrive at: Þt ext ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi t i N ext n ext p : ð40Þ where T ext ¼ N ext t ext g ext g L ; T I ¼ N I t I g I g L are effective timescales of external and inhibitory inputs, and T E ¼ N E g E g L is a dimensionless scale for the excitatory conductance. Here, μ i is the bias of the membrane potential due to synaptic inputs, and σ i measures the scale of fluctuations in the membrane potential due to random spike arrival approximated by the Gaussian process η i .
The mean firing rates F and mean voltages hV i i of populations of neurons governed by this type of differential equation can then be approximated by: Derivatives of the rate prediction. Here we calculate derivatives of the input-output relation (Eq (42)) that will be used below in Frozen noise.
The expressions for drift and diffusion (see Analysis of drift and diffusion with STP) contain the derivative � 0 i ¼ dF dJ jJ i of the input-output relation F (Eq (42)) with respect to the recurrent excitatory input J i . Note, that F depends on J i through all three arguments μ i , σ i and τ i . First, we define X(u) � exp(u 2 )[1 + erf(u)], and the shorthand F i = F[μ i , σ i , τ i ]. The derivative can then be readily evaluated as (to shorten the notation in the following, we skip noting the evaluation points for derivatives in the following): where α/β stands as a placeholder for either function, and the expressions for α and β are given in Eqs (43) and (44).
A second expression involving the derivative of Eq (42) is which appears in the theory when estimating firing rate perturbations caused by frozen heterogeneities in the leak potentials of excitatory neurons (see Eq (54)). The resulting derivatives are almost similar, which can be seen by the fact that replacing V L ! V L þ D L i in Eq (27) only leads to an additional term D L i in Eq (39). Thus, for neuron i the derivative can be evaluated to In practice, given a vector ϕ i,0 of firing rates in the attractor state, as well as the mean firing rate of inhibitory neurons ν I , we evaluate the right hand side of Eqs (46) and (47) by replacing F i ! ϕ i,0 . This allows efficiently calculating the derivatives without having to perform any numerical integration. The two terms will be exactly equal if ϕ 0,i is a self-consistent solution of Eq (42) for firing rates of the excitatory neurons across the network. We used numerical estimates of ϕ i,0 and ν I that were measured from simulations and were very close to firing-rate predictions for all networks we investigated.
Optimization of network parameters. We used an optimization procedure [97] to retune network parameters to produce approximately similar bump shapes as the parameters of short-term plasticity are varied. Briefly, we replace the network activity ϕ j in the total input J i of Eq (36) by a parametrization Approximating sums 1 where J i (g) indicates that the total input depends on the parameters g 0 , g 1 , g σ , g r of the parametrization g.
We then substitute this relation in Eq (42) to arrive at a self-consistency relation between the parametrized network activity g(θ i ) at the position of neuron i and the firing-rate F predicted by the theory: The argument g indicates the dependence of quantities upon the parameters of the bump parametrization Eq (48). The explicit dependence of the voltage hV i i on g is obtained by substituting ϕ i ! g(θ i ) in Eq (45). We then optimized networks to fulfill Eq (49). First, we imposed the following targets for the parameters of g: g 0 = 0.1Hz, g 1 = 40Hz, ν E,basal = 0.5Hz, ν I,basal = 3Hz. For all networks we chose w + = 4.0, g r = 2.5. The following parameters were then optimized: ν I , g σ , g EE (excitatory conductance g E on excitatory neurons); g IE (excitatory conductance g E on inhibitory neurons); g EI (inhibitory conductance g I on excitatory neurons); g II (inhibitory conductance g I on inhibitory neurons). The basal firing rates (firing rates in the uniform state of the network, prior to being cued) yielded two equations from Eq (49) by setting w + = 1. This left 4 free parameters, which were constrained by evaluating Eq (49) at 4 points as described in [97]. The basal firing rates were chosen to be fairly low to make the uniform state more stable (as in [44]). This procedure does not yield a fixed value for g σ , since g σ is optimized for and is not set as a target value. We thus iterated the following until a solution was found with g σ � 0.5: a) change the width of the recurrent weights w σ ; b) optimize network parameters as described here; c) optimize the expected bump shape for the new network parameters to predict g σ . The resulting parameter values are given in Table in S2 Table. Frozen noise Random and heterogeneous connectivity. Introducing random connectivity, we replace the recurrent weights in Eq (36) by: Here, p ij 2 {0, 1} are Bernoulli variables, with P(p ij = 1) = p, where the connectivity parameter p 2 (0, 1] controls the overall sparsity of recurrent excitatory connections. For p = 1 the entire network is all-to-all connected. Additionally, we provide derivations for additive synaptic heterogeneities D w ij ¼ Z ij s w (as in [38]), where {η ij |1 � i, j � N E } are independent, normally distributed random variables with zero mean and unit variance. We did not investigate this type of heterogeneity in the main text, since increasing σ w lead to a loss of the attractor state before creating large enough directed drifts to be comparable to the other sources of frozen noise considered here-most of the small effects were "hidden" behind diffusive displacement [85]. Nevertheless, we included this case in the analysis here for completeness.
Let the center position of the bump be φ k ¼ k 2p N À p (for 0 � k < N). Subject to the perturbed weights, the recurrent steady-state excitatory input J i (φ k ) Eq (36) to any excitatory neuron can be written as the unperturbed input J 0,i (φ k ) plus an additional input J struct i ðφ k Þ arising from the perturbed connectivity. Note that the synaptic steady-state activations s 0,j (φ k ) change with varying bump centers-in the following, we denote s k 0;j � s 0;j ðφ k Þ: |ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl } We calculate the firing rate perturbations δϕ i (φ k ) resulting from the additional input by a linear expansion around the steady-state firing rates ϕ 0,i (φ k ) ! ϕ 0,i (φ k ) + δϕ i (φ k ). These evaluate to: See Derivatives of the rate prediction for the derivation of the function dF dJ ðJ 0;i Þ for the spiking network used in the main text.
In the sum of Eq (7), we keep the firing rate profile� 0 centered at φ 0 while calculating the drift for varying center positions. To accommodate the shifted indices resulting from moving center positions, we re-index the summands to yields the perturbations ϕ 0,i ! ϕ 0,i + Δϕ i (φ k ) used there: Heterogeneous leak reversal potentials. We further investigated random distributions of the leak reversal potential V L . These are implemented by the substitution where the D L i are independent normally distributed variables with zero mean, i.e.
The parameter σ L controls the standard deviation of these random variables, and thus the noise level of the leak heterogeneities.
Let φ k ¼ k 2p N À p for 0 � k < N be the center position of the bump. First, note that the heterogeneities D L i do not depend on the center position φ k , since they are single neuron properties. As in the last section, we calculate the firing rate perturbations δϕ i (φ k ) resulting from the additional input by a linear expansion around the steady-state firing Here, dF i dD L i ðJ i ðφ k ÞÞ is the derivative of the input-output relation of neuron i in a bump centered at φ k , with respect to the leak perturbation. We introduced d� 0;i dD L ðφ k Þ as a shorthand notation for this derivative, since it is evaluated at the steady-state input J i,0 (φ k ). For the spiking network of the main text, this is derived in Derivatives of the rate prediction.
In the sum of Eq (7), we keep the firing rate profile� 0 centered at φ 0 while calculating the drift for varying center positions. As in the last section, we re-index the sum to yield the perturbations ϕ 0,i ! ϕ 0,i + Δϕ i (φ k ) used there: Squared field magnitude. Using the equation of the drift field in Eq (7), and the firing rate perturbations Eqs (51)- (54), it is straight forward to see that for any center position φ the expected drift field averaged over the noise parameters is 0, since all single firing rate perturbations vanish in expectation. In the following we calculate the variance of the drift field averaged over noise realizations, which turns out to be additive with respect to the two noise sources.
We begin by calculating the correlations between frozen noises caused by random connectivity and leak heterogeneities. For the Bernoulli distributed variables p ij it holds that hp ij i = p, hp ij p lk i = δ il δ jk p + (1 − δ il δ jk )p 2 . For the other independent random variables it holds that Again, the weight heterogeneities D w ij are only included for completeness-all analyses of the main text assume that σ w = 0.
For the correlations between the perturbations we then know that (for brevity, we omit the dependence on the center position φ): Starting from Eq (7), we use as a firing rate perturbation the sum of firing rate perturbations from both Eqs (51) and (54). With the pre-factor C i ¼ dJ 0;i dφ 1þt u � 0;i ðUt u � 0;i þ2Þ ðU� 0;i ðt u t x � 0;i þt u þt x Þþ1Þ 2 , the expected squared field averaged over ensemble of frozen noises is then: One can see directly that the two last terms are invariant under shifts of the bump center φ, since these introduce symmetric shifts of the indexes i. Similarly, it is easy to see that the first term is also invariant. Let φ 0 be shifted to the right by one index from φ. It then holds that: The final equation holds since, in ring-attractor networks, w EE ij consists of index-shifted rows of the same vector (see e.g. Network connectivity for the spiking network weights).
In summary, hA(φ) 2 i frozen will evaluate to the same quantity hA 2 i frozen for all center positions φ. In the main text, we use this fact to estimate hA 2 i frozen from simulations, by additionally averaging over the all center positions and interchanging the ensemble and positional averages: Thus, we can compare the value of hA 2 i frozen to the mean squared drift field over all center positions, averaged over instantiations of noises. System size scaling. Generally, sums over the discretized intervals [−π, π) as they appear in Eqs (5) and (7) will scale with the number N chosen for the discretization of the positions on the continuous ring φðiÞ ¼ i N 2p À p. Consider two discretizations of the ring, partitioned into N 1 and N 2 uniformly spaced bins of width 2p N 1 and 2p N 2 . We can then approximate integrals over any continuous (Riemann integrable) function f on the ring by the two Riemann sums: where, i 2p N 1 � φ 1;i < i þ 1 ð Þ 2p N 1 (for N 2 and φ 2,i analogously) are points in the bins [98]. Numerical quantities for the results of the main text have been calculated for N E = 800. In the following we denote all of these quantities with an asterisk (�). To generalize these results to arbitrary system size N, we replace sums over N bins by scaled sums over N E bins using the relation Eq (57): First, we find that the normalization constant scales as S ¼ N N E S � , and thus (dots indicate the summands, which are omitted for clarity) for the diffusion strength B (cf. Eq (5)): For the drift magnitude we turn to the expected squared drift magnitude calculated earlier (cf. Eq (56)), for which we find that (setting σ w ! 0 for simplicity, as throughout the main text): Note, that we could not resolve this scaling in dependence of hA 2 i � frozen , since the two sources of frozen noise (connectivity and leak heterogeneity) show different scaling with N.

Numerical methods
Spiking simulations. All network simulations and models were implemented in the NEST simulator [99]. Neuronal dynamics are integrated by the Runge-Kutta-Fehlberg method as implemented in the GSL library [100] (gsl_odeiv_step_rkf45)-this forward integration scheme is used in the NEST simulator for all conductance-based models (at the time of writing). The short-term plasticity model is integrated exactly, based on inter-spike intervals. Code for network simulations is available at https://github.com/EPFL-LCN/pub-seeholzer2018.
Simulation protocol. In all experiments (except those involving bi-stability, see below) spiking networks were simulated for a transient initial period of t initial = 500ms. To center the network in an attractor state at a given angle −π � φ < π, we gave an initial cue signal by stimulating neurons (0.2 � N E neurons for networks with facilitation parameter U > 0.1 and 0.18 � N E neurons for U � 0.1) centered at φ by strong excitatory input mediated by additional Poisson firing onto AMPA receptors (0.5s at 3kHz followed by 0.5s at 1.5kHz) with connections scaled down by a factor of g signal = 0.5. The external input ceased at t = t off = 1.5s. For simulations to estimate the diffusion we simulated until t max = 15s, yielding 13.5s of delay activity after the cue offset. For simulations to estimate drift we set t max = 8s, yielding 6.5s of delay activity after the cue offset.
For simulations exploring the bi-stability between the uniform state and a bump state ( Fig  1B1), we added an additional input prior to the spontaneous state. We stimulated simultaneously 20 excitatory neurons around 4 equally spaced cue points each (80 neurons in total, 500ms, 1.5kHz, AMPA connections scaled by a factor g signal = 2). This was applied to settle networks into the uniform state more stably-without this perturbation, networks sometimes approached the bump state after being uniformly initialized. In both figures, we show population activity only after this initial stimulus was applied.
Estimation of centers and mean bump shapes. To estimate centers of bump states, simulations were run until t = t max and spikes were recorded from the excitatory population and converted to firing rates by convolving them with an exponential kernel (τ = 100ms) [101] and then sampled at resolution 1ms. This results in vectors of firing rates ν j (t), 0 � j � N E − 1 for every time t. We calculated the population center φ(t) for time t by measuring the phase of the first spatial Fourier coefficient of the firing rates. This is given by φðtÞ ¼ arg For all analyses below, we identify t = 0 to be the time t = t off of the initial cue.
To measure the mean bump shapes, we first rectified the vectors ν j (t) for every t by rotating the vector until φ(t) = 0. We then sampled the rectified firing rates starting from 1s after cue offset at intervals of 20ms, which were used to calculate the mean firing rates. S1 Fig shows mean rates for each simulation averaged over the � 1000 repetitions performed in the diffusion estimation (below).
Exclusion of bump trajectories. Sometimes bump trajectories would leave the attractor state and return to the uniform state. We identified these trajectories in all experiments by identifying maximal firing rates across the population that dropped below 10Hz during the delay period. The such identified repetitions were excluded from the analyses, which occurred mostly in networks with no facilitation for τ x = 150ms, τ u = 650ms: at U = 1, we excluded 222/ 1000 repetitions from the diffusion estimation, while for all other U � 0.8 at most 15/1000 were excluded. Increasing the depression time constant also lead to less stable attractor states: for τ x = 200ms, τ u = 650ms and U = 0.8, we had to exclude 250/1000 repetitions. During the simulations for drift estimation, we observed that frozen noise also leads to less stable bumps under weak facilitation for random and sparse connectivity (p � 1) and high leak variability (σ L � 0). Diffusion estimation. Diffusion was estimated for each combination of network parameters by simulating 1000 repetitions (10 initial cue positions, 100 repetitions each) of 13.5s of delay activity. Center positions φ k (t) were estimated for each repetition k as described above. We then calculated for each repetition the offset relative to the position at 500ms by Δφ k (t) = φ k (t − 500ms) − φ k (500ms), effectively discarding the first 500ms after cue-offset. The timedependent variance of K repetitions (excluding those repetitions in which the bump state was lost, see above) was then calculated as VðtÞ ¼ 1 K P k Dφ 2 k ðtÞ. The diffusion strength can then be estimated from the slope of a linear least-squares regression (using the Scipy method scipy. stats.linregress [102]) to the variance as a function of time: V(t) � D 0 + D � t, where the intercept D 0 is included to account for initial transients. We estimated confidence intervals by bootstrapping [103]: sampling K elements out of the K repetitions with replacement (5000 samples) and estimating the confidence level of 0.95 by the bias corrected and accelerated bootstrap implemented in scikits-bootstrap [104]. As a control, we calculated confidence intervals for D additionally by Jackknifing: after building a distribution of estimates of D on K oneleft-out samples of all repetitions, the standard error of the mean can be calculated and is multiplied by 1.96 to obtain the 95% confidence interval [105]-confidence intervals obtained by this method were almost indistinguishable from confidence intervals obtained by bootstrapping.
Drift estimation. Drift was estimated numerically for each combination of network and frozen noise parameters by simulating 400 repetitions (20 initial cue positions, 20 repetitions each) of 6.5s of delay activity. Centers positions φ k (t) were estimated for all K repetitions (excluding those repetitions in which the bump state was lost, see above) as explained above.
We then computed displacements in time by computing a set of discrete differences where we chose dt = 1.5s and t 0 2 {500ms, 700ms, 900ms, . . ., 1900ms}. All differences are calculated with periodic boundary conditions on the circle [−π, π), i.e. the maximal difference was π/dt. We then calculated a binned mean (100 bins on the ring, unless mentioned otherwise) of differences calculated for all K trajectories, to approximate the drift-fields as a function of positions on the ring.
Mutual information measure. We are estimating the mutual information between a set of initial positions x 2 [0, 2π) and associated final positions y(x) 2 [0, 2π) of the trajectories of a continuous attractor network over a fixed delay period of T. For our results, we take T = 6.5s. We constructed binned and normalized histograms (with bin size n = 100, but see below) as approximate probability distributions of initial positions p i ¼ p i À 1 ½ � 2p n � x < i 2p n À � and all final positions q i ¼ p i À 1 ½ � 2p n � y < i 2p n À � (with bins indexed by 1 � i � n), as well as the bivariate probability distribution r ij ¼ p i À 1 ½ � 2p n � x < i 2p n ; j À 1 ½ � 2p n � yðxÞ < j 2p n À � . Using these, we can calculate the mutual information as [56,57] MI ¼ P n i¼1 P n j¼1 r ij log 2 r ij p i q j � � . Note, that the sum effectively counts only nonzero entries of r ij (trajectories that started in bin i and ended in bin j): these imply that p i 6 ¼ 0 (a trajectory started in bin i) and q j 6 ¼ 0 (a trajectory ended in bin j), which makes the sum well defined. Although the value of MI depends on the number of bins n, in Figs 5 and 6 we normalize MI to that of the reference network (U = 1, no frozen noise, see Short-term plasticity controls memory retention), which leaves the resulting plot nearly invariant under a change of bin numbers. Numerical integration of Langevin equations. Numerically integration of the homogeneous Langevin equations (Eq (4)) describing drift and diffusion of bump positions φ 2 [−π, π) (with circular boundary conditions) has been implemented as a C extension in Cython [106] to the Python language [107]. Since the drift fields A(φ) are estimated on a discretization of the interval [−π, π) into N bins, we first interpolate drift fields A given as N discretized values to obtain continuous fields-interpolations are obtained using cubic splines on periodic boundary conditions using the class gsl_interp_cspline_periodic of the Gnu Scientific Library [100].
For forward integration of the Langevin equation Eq (4) from time t = 0, we start from an initial position φ 0 = φ(t = 0). Given a time resolution dt (unless otherwise stated we use dt = 0.1s) and a maximal time t max we repeat the following operations until we reach t = t max : ffi ffi ffi ffi ffi ffi ffi dtB p � r; φ ! ððφ þ pÞ mod 2pÞ À p: Here, for each iteration r is a random number drawn from a normal distribution with zero mean and unit variance (hri = 0 and hr 2 i = 1). The last step is performed to implement the circular boundary conditions on [−π, π). Code implementing this numerical integration scheme is available at https://github.com/ EPFL-LCN/pub-seeholzer2018-langevin.
Distractor analysis. For the distractor analysis in Fig 7, we let 40 neurons centered at the distractor position φ D ¼ 360 � N j À 180 � fire at rates increased by 20Hz, yielding a vector of firing rate perturbations Δϕ 0,i = 20Hz if |i − j| � 20 and Δϕ 0,i = 0Hz otherwise. The vectors Δϕ 0,i for each distractor position φ D are then used in Eq (7) to calculate the corresponding drift fields.
To calculate the final position φ 1 after 250ms of presenting the distractor, we generate 1000 trajectories starting from φ 0 = 0 by integrating the Langevin equation Eq (4) for 250ms (dt = 0.01), the final positions of which are used to measure mean and standard deviation of φ 1 . For the broader bump in Fig 7D, we stretched (and interpolated) the firing rates ϕ 0 as well as the associated vectors J 0 and � 0 0 along the x-axis to obtain vectors for bumps of the desired width, and then re-calculated the values of dJ 0 dφ .
Supporting information S1 Fig. Spiking networks produce similar stable firing rate profiles across parameters. For each choice of short-term plasticity parameters U, τ u , and τ x , we tuned the recurrent conductances (g EE , g EI , g IE , g II ) and the width σ w of the distance-dependent weights (cf. Eq (32)) such that the "bump" shape of the stable firing rate profile is close to a generalized Gaussian n y ð Þ ¼ g 0 þ g 1 exp À � jyj g s � g r � � with parameters g 0 = 0.1Hz, g 1 = 40.0Hz, g σ = 0.5, g r = 2.5. See Optimization of network parameters in Materials and methods for details, S2 Table. for parameter values after tuning, and S1 Table. for parameters that stay constant. A After tuning, the resulting firing rate profiles for different parameter values of U and τ u are very similar. Averaged mean firing rates in bump state, measured from � 1000 spiking simulations. A1-A3 Remaining slight parameter-dependent changes of bump shapes, measured by fitting the generalized Gaussian ν(θ) to the measured firing rate profiles displayed in A. A1 Top firing rate  (7)) and fields extracted from simulations (mean over 18-20 networks, error bars show 95% confidence of the mean). Both frozen noise parameters (σ L and 1 − p) are plotted on the same x-axis. B Normalized RMSE: each RMSE is normalized by the range (max − min) of the joint data of simulated and predicted fields it is calculated on. Colors as in A. C Average RMSE (same data as in A) plotted as a function of the mean expected field magnitude (estimated separately for each network, then averaged). Colors as in A. D Worst (top) and best (bottom) match between predicted field (blue line) and field extracted from simulations (black line) of the group with the largest mean RMSE in panels A, C (U = 1, 1 − p = 0.75). Shaded areas show 1 standard deviation of points included in the binned mean estimate (100 bins) of the extracted field.  (5) with bump solutions � 0 ¼ g 1 expðÀ j x g s j g r Þ. The values of dJ 0 dφ and � 0 0 were calculated by fitting and extrapolating (linearly, for ϕ 0 > 40.31Hz) curves � 0 ! � 0 0 and ϕ 0 ! J 0 that were obtained from the numerical values extracted for g 1 = 40.31Hz, g σ = 0.51 by theory (see Firing rate approximation in Materials and methods). Thus, any nonlinearity or saturation of the inputs and input-output relation for ϕ 0 > 40.31Hz was not included. This approximate analysis shows that the major dependence of the diffusion expected in the system is on the bump width g σ , although a minor dependence on g 1 is seen.  [92]) for two attractor networks with different STP parameters. All measures were computed on spike-trains measured over a period of 4s, recorded 500ms after offset of the external input which was centered at angle 0. Across STP parameters, networks display similarly reduced CVs for increased mean firing rates, leading to large CVs for neurons located in the flanks of the firing rate profile and low CVs for neurons located near the center. A Networks with large diffusion coefficient (U = 0.8, τ u = 650ms, τ x = 200ms) that underwent non-stationary diffusion during the recording of spikes: the measured mean firing rates (gray line) differ visibly from the firing rates estimated after centering the firing rate distribution at each point in time. Due to this non-stationarity, CVs at intermediate firing rates appear elevated, while the local CV (CV 2 ) shows values close to stationary networks (see B). B The same network as in A, with strong facilitation (U = 0.1). Reduced diffusion leads to a nearly stationary firing rate profile, and coincident CV and  (7) shows zero crossings as τ x is increased beyond facilitation-dependent critical values. B Diffusion strength B of Eq (5) without the normalization constant (equal to B � S 2 ). C The full diffusion strength B of Eq (5) shows diverging values at the same critical points of tau x . Color legend on the right hand side shows values of U. (TIF) S1 Text. Detailed mathematical derivations. (PDF) S1 Table. Parameters for spiking simulations. Parameter values are modified from [11] and [50]. For recurrent conductances see the table in S2 Table. (PDF) S2 Table. Conductance and connectivity parameters for spiking simulations. For all networks we set w + = 4.0. Recurrent conductance parameters are given for combinations of shortterm plasticity parameters according to the following notation. g EE : excitatory conductance g E on excitatory neurons; g IE : excitatory conductance g E on inhibitory neurons; g EI : inhibitory conductance g I on excitatory neurons; g II : inhibitory conductance g I on inhibitory neurons. (PDF)