Optimal information loading into working memory explains dynamic coding in the prefrontal cortex

Significance The prefrontal cortex (PFC) is known to play a key role in working memory maintenance. However, the PFC has been shown to exhibit unexpectedly rich and complex dynamics during even the simplest working memory tasks—a puzzling phenomenon known as “dynamic coding.” Using mathematical analyses and simulations of task-optimized neural networks, we develop a theory of optimal loading of stimulus information for working memory maintenance and show that dynamic coding in fact naturally arises from this principle. We develop a direct neural measure of optimal information loading, with which we confirm a key prediction of the theory in neural recordings from monkey PFC. Our results show that dynamic coding is a fundamental and functionally useful feature of working memory maintenance.


S1 Supplementary Figures
Supplementary Fig. 1 | Dynamics of network models of working memory.a, Neural network dynamics in a bump a ractor network (1) performing the task shown in Fig. 1a.Le : trajectory in neural state space in a single cue condition during the cue period (pale purple line, ending in pale purple circle) and delay period (dark purple line).Purple arrow heads indicate direction of travel along the trajectory, black cross shows a ractor state.Center: time course of relative (i.e.mean-centered) firing rates of one neuron for two cue conditions (purple vs. blue, see also inset).Yellow lines indicate cue onset and o set times.Right: cross-temporal decoding of neural activity produced by the network across all 6 cue conditions.Pink rectangles indicate generalized decoding between the cue/early delay period and the late delay period and cyan square indicates generalized decoding between time points in the late delay period.The gray tick on the color bar indicates chance-level decoding.b, Same as a but for a discrete a ractors model (1)(2)(3).c, Same as a but for a linear integrator model with transient dynamics that are orthogonal to the a ractor subspace (4).d, Same as a but for feedforward network model (5,6).e, Same as a but for a network whose parameters (including recurrent, input, and readout weights) were optimized to perform the task shown in Fig. 1a  Supplementary Fig. 2 | A ractor network dynamics with or without constraints on the initial condition of the dynamics.a, Illustration of an a ractor network with symmetric connections.b-d, Analysis of neural responses in symmetric a ractor networks (such as shown in a).b, Sub-threshold activity (colored trajectories) for all 6 cue conditions (color coded as in Fig. 5e) with initial conditions optimized within the full state space (Methods S2. 3.1).Open circles show the optimized initial conditions and crosses show stable fixed points.We show neural activity projected onto the top two principal components of the persistent subspace.c, Analysis of neural responses when initial conditions are constrained to lie within the 5-dimensional persistent subspace.Top le : distribution of Pearson correlations between initial and final mean-centered neural firing rates across all 6 cue conditions and 10 networks (same as Fig. 2c, but for persistent subspace-constrained inputs, corresponding to green line in Fig. 2f).Top right: sub-threshold activity for 2 cue conditions in an example network (colored trajectories; same as Fig. 2d, but for persistent subspace-constrained inputs, corresponding to green line in Fig. 2f).Open circles (with arrows pointing to them from the origin) show the optimized initial conditions, black crosses show stable fixed points, dashed gray line is the identity line.Horizontal axis (persistent PC1) shows neural activity projected on to the 1st principal component (PC1) of network activities at the end of the delay period (across the 2 conditions shown), vertical axis (initial PC1 (orthogonalized)) shows projection to PC1 of initial neural activities orthogonalized to persistent PC1.Bo om: same as b, but for persistent subspace-constrained inputs, corresponding to green line in Fig. 2f.d, Same as c, but for persistent nullspace-constrained inputs.Note that the distribution of Pearson correlations of neural firing rates (top le ) is distinct from a delta function at 0 because we constrained the initial conditions in the space of sub-threshold activities (rather than firing rates).In the bo om panel, which shows sub-threshold activity, we see that indeed all the colored circles overlap at the origin, indicating near-orthogonality of the initial conditions to the persistent subspace.e-h, Same as a-d but for a ractor networks without a symmetric connection constraint (i.e.panels f, g, and h, respectively correspond to the networks shown by the black, green, and red lines in Fig. 2).Note initial conditions being near the origin in f mean that they are near-orthogonal to the persistent subspace (as in d, but without constraining them explicitly to be in the persistent nullspace).for 6 cue conditions (color coded as in Fig. 5e).Open circles show the optimized initial conditions and black crosses show fixed points.We show neural activity projected onto the top two principal components of the persistent subspace.Thus, all circles being near the origin means that initial conditions are near-orthogonal to this subspace (cf.Supplementary Fig. 2f).b, Tuning curves at t = 1 s for 6 example neurons (colored curves) whose preferred angles (colored crosses) correspond to the 6 cue conditions shown in a. c, Distribution of Pearson correlations between initial and final meancentered neural firing rates across the 6 cue conditions and 10 networks (cf.Fig. 2i).d, Cross-temporal decoding of neural firing rate activity (cf.Fig. 2k).Note that only the first second of the delay period is shown on both axes because the dynamics of these networks, using a tanh nonlinearity, are faster than those shown in other figures (e.g.Fig. 2), using a ReLu nonlinearity (but the same time constant; Methods S2.2, and Table S1).e, Overlap (mean±1 s.d.across 10 networks) of the 2 locally most persistent (green), most amplifying (red), or random directions (black), obtained using a local linearization around the origin, with the 'persistent subspace' and 'persistent nullspace' of the original nonlinear dynamics, obtained without linearization, and the subspace spanned by the 'optimal' initial conditions of the original nonlinear dynamics (cf.Supplementary Fig. 7a, bo om; see Methods S2.4.2 and S2.7.3).We used 2-dimensional subspaces from the local linearization because we found empirically that the ring a ractor lay in a 2-dimensional subspace (see also a).Supplementary Fig. 4 | Analysis of the total energy produced by di erent initial conditions in linear networks.a, The norm of neural activity integrated over time (i.e. a measure of total energy used by the network) for each of 1000 random initial conditions (10 initial conditions for each of 100, 100-neuron networks) relative to the energy produced by the most amplifying initial condition, plo ed as a function of their overlap with the persistent mode for symmetric (top) and unconstrained (bo om) linear integrator networks.A positive value on the y-axis means that the total energy produced by the given random initial condition is greater than that produced by the most amplifying initial condition.Initial conditions are scaled so that they all produce the same level of persistent activity (i.e. the same level of performance) a er 2 s of simulation.b, Same as a, but initial conditions are plo ed as a function of their overlap with the most amplifying mode.Note that overlap with the most amplifying mode (but not in general with the most persistent mode) is strongly predictive of total energy (with an inverse relationship between the two).c, Overlap (mean±1 s.d.across the 100 networks from a and b) of optimal initial conditions (Eq.S74), producing an overlap of 1 with the persistent mode a er a given delay length (x-axis) while using the minimal total energy over time (Eq.S73), with either persistent (green), most amplifying (red), or random (black) directions, for symmetric (top) and unconstrained (bo om) networks.In unconstrained networks, for very short delay lengths, initial conditions must align exactly with the persistent mode, by necessity (green lines at 0 s).For longer delay lengths, initial conditions make greater use of the most amplifying direction (red lines).d, Total energy (mean across the 100 networks from a and b; we do not show error bars for visual clarity) for dynamics starting from initial conditions that produce an overlap of 1 with the persistent mode a er a given delay length (x-axis) in symmetric (top) and unconstrained (bo om) networks.Initial conditions were chosen to be optimal (blue; i.e. using the least energy, cf.panel c), or aligned with the most persistent (green), most amplifying (red), or a random direction (black).In unconstrained networks, for very short delay lengths, initialising along the most persistent mode achieves near-optimal energy-e iciency (green is close to blue), but for longer delay lengths, initialising along the most amplifying mode becomes more energy e icient (red is closer to blue).(Note that for symmetric networks, top, we have o set the curves for the most amplifying, persistent, and optimal directions because these 3 directions are the same and therefore produce the same total energy.)Supplementary Fig. 5 | Analysis of linear networks of di erent sizes.a, Distributions of absolute overlap with the persistent mode for persistent (pale green), most amplifying (pale red), or random initial conditions (gray) across 100 randomly sampled linear symmetric (top) and unconstrained networks (bo om) consisting of either 10 (solid), 100 (dashed), or 1000 (do ed) neurons (cf.Fig. 3d).The persistent initial conditions produced delta functions at 1 (arrows).
Results for persistent and most amplifying initial conditions are identical in symmetric networks (top).b, Time course of mean (across the 100 networks from a) absolute overlap with the persistent mode when starting network dynamics from persistent (green), most amplifying (red), or random initial conditions (black) in symmetric (top) and unconstrained networks (bo om) consisting of either 10 (solid), 100 (dashed), or 1000 (do ed) neurons (cf.Fig. 3e).Results for persistent and most amplifying initial conditions are identical in symmetric networks (top).c, Mean (across 100 networks) overlap of initial conditions that were optimized so as to generate maximal persistent activity in 100-neuron noisy symmetric (top) and unconstrained (bo om) networks with 100 orthogonal modes ordered by their persistence (green) or amplification (red) (i.e.corresponding to the rank ordered eigenvectors of the weight matrix, green, or of the observability Gramian of the dynamics, red; Methods S2.7.1).In symmetric networks (top), the optimized initial conditions overlap only with the most amplifying mode and no other mode (note that the most persistent mode is identical to the most amplifying mode in this case).In unconstrained networks (bo om), optimized initial conditions overlap strongly with the most amplifying mode and only weakly with other modes.(The non-zero overlap with the most persistent mode is simply due to the fact that there is a non-zero overlap between the most persistent and amplifying mode in random networks, and it is at the level that would be expected based on this overlap.)d, Time course of mean (across 100 networks) absolute overlap with the persistent mode for the same 100-neuron noisy symmetric (top) and unconstrained networks (bo om) as those shown in c when the network is started from optimized initial conditions (blue), and for comparison for the most amplifying (red dashed) initial conditions (cf.Fig. 3e).Note the close agreement between the two indicating that the most amplifying mode is indeed optimal in these networks.Horizontal black bar on x-axis shows the time period in which we applied the cost function to optimize the initial conditions (Methods S2.7.3).2, obtained using a local linearization around the origin, with the 'persistent subspace' and 'persistent nullspace' of the original nonlinear dynamics, obtained without linearization (as used in Fig. 2f and l, red and green), and the 5-dimensional subspace spanned by the 6 'optimal' initial conditions of the original nonlinear dynamics (used in Fig. 2b-e, h-k, and f and l, black).For comparison, we also show the overlap (mean±1 s.d.across 100 networks) of the single most persistent (pale green), most amplifying (pale red), and random (gray) direction with the optimal initial condition of the linear networks from Supplementary Fig. 5c,d ('optimal (lin.model)').b, Time course of the overlap (mean±1 s.d.across 10 networks, s.d.only shown in top panel for visual clarity) of the linearized dynamics of symmetric (top) and unconstrained networks (bo om) with the subspace spanned by their most persistent modes when started from initial conditions that were optimized for the decoding accuracy of the nonlinear dynamics while constrained to be within the locally most persistent (green), most amplifying (red), or a random subspace (black).The linear dynamics, the persistent subspace wrt.which overlap is measured, and the subspaces within which initial conditions were constrained while being optimized, were all based on a local linearization of the nonlinear dynamics around the origin.Compare with Fig. 3e for the analogous plots for linear networks.For reference, blue line shows overlap of the same linearized dynamics when started from the initial conditions directly optimized for the decoding accuracy of the nonlinear dynamics without subspace constraints (used in Fig. 2b-e, h-k, and f and l, black).For consistency with Fig. 3b-e (where initial conditions were constrained to have unit norm), we scaled activity by the norm of the initial condition (which was constrained to be 3 here; Methods S2.Supplementary Fig. 7 | Linear analyses of the nonlinear a ractor networks of Fig. 2 (cont'd).d, Percent variance of responses explained (mean±1 s.d.across 10 networks) by the subspace spanned by either the 25% (i.e. 5) most persistent (green) or 25% (i.e. 5) most amplifying (red) modes as a function of time for 20-dimensional linear neural networks fi ed to the neural responses generated by the symmetric (top) and unconstrained (bo om) nonlinear networks when started from the same (optimized) initial conditions analyzed in b-c: constrained to be within the locally most persistent (far le ), most amplifying (center le ), or a random subspace (center right), as determined by the local linearization of the dynamics, or without subspace constraints (far right).Gray lines show chance level overlap defined as the expected overlap with a randomly chosen subspace occupying 25% of the full space (i.e. 5 dimensions).Compare with Fig. 4c for the analogous plots for linear networks (though with non-instantaneous inputs, and performance-matched levels of noise, see also Section S7) and with Fig. 5f and Supplementary Fig. 10f,g for analogous plots of linear neural networks fi ed to experimental data.
integrator model (Murray et al., 2017) feedforward model (Ganguli et al., 2008)   Supplementary Fig. 10 | Supplemental analysis of experimental data and comparison to models.a, Crosstemporal decoding analysis for monkey K (cf.Fig. 5c for the same analysis for monkey T and for explanation of plo ing scheme and annotations).b, Subspace overlap between di erent task epochs, measured as the percent variance explained  5f) and the control analyses shown in panels f and g of this figure (rows 5-6).Top 3 rows show predictions for the sign of each comparison under di erent information loading strategies in unconstrained linear networks (Fig. 4c): using inputs aligned with random directions (1st row), persistent directions (2nd row), or the most amplifying directions (3rd row).In the column headings, pers., amp., and ch.respectively refer to overlap with most persistent, most amplifying and random subspaces (chance), t 0 refers to the beginning of the analysis time window, i.e. cue onset (rows 1-5) or 1 s before the timing of the go cue (row 6), and t 1 = t 0 + 1 s refers to the end of the analysis time window.The colored numbers above each column correspond to the comparisons shown in the inset above the table.Gray indicates no significant di erence between data points, red and blue indicate a significant di erence for both monkeys where the first data point is respectively greater or smaller than the second data point, and pale red indicates a significant di erence for one of the two monkeys (see Methods S2.8).12d, Supplementary Fig. 13d, and Supplementary Fig. 10b).Diagonal elements show the PVE within each task epoch.We show results for cue-delay (le two panels) and just-in-time trained networks (right two panels) trained with either a regularisation strength of α (2)  nonlin = 0.00005 (le panel for each model, as in Fig. 6) or α (2)  nonlin = 0.0005 (right panel for each model, as in panels a-b).d, Neural activity plo ed in the top two PCs of delay-epoch activity for all 6 initial conditions for cue-delay and just-in-time trained networks for each of the network-regularization combinations shown in c (cf. Supplementary Fig. 2b-d and f-h

S2.1 Experimental materials and methods
Experimental methods have been described before (8) and largely followed those used in our previous publications (9)(10)(11).We briefly summarize the methods below.

S2.1.1 Subjects and apparatus
We used two female macaques (monkey T, Macaca mula a, 5 kg; monkey K, Macaca fuscata, 8 kg).Both monkeys were housed individually.Prior to this study, Monkey K participated in another prefrontal neural recording study with di erent objectives using di erent tasks from the present study (12).Monkey K was labelled as monkey C in that prior study.The light/dark cycle was 12/12 hr.(light, from 8:30 a.m. to 8:30 p.m.).The monkeys sat quietly in a primate chair in a dark, sound-a enuated shield room.During both training and neural recording sessions, we restrained the monkeys' head movement non-invasively using a thermoplastic head cap as described in (13).This head cap is made of a standard thermoplastic splint material (MT-APU, 3.2 mm thick, CIVCO Radiotherapy, IA., USA), and was molded out so that it conformed to the contours of the animals' scalp, cheek bone, and occipital ridge.Visual stimuli were presented on a 17 inch TFT monitor placed 50 cm from the monkeys' eyes.Eye movements were sampled at 120 Hz using an infrared eye tracking system (ETL-200, ISCAN, MA.).Eye fixation was controlled within a 6.5 • imaginary square window.TEMPO so ware (Reflective Computing, WA.) was used to control behavioral tasks.All experimental procedures were approved by the Animal Research Commi ee at the Graduate School of Frontier Biosciences, Osaka University, Japan and were in full compliance with the guidelines of the National BioResource Project 'Japanese Macaques'.Experimental work performed in non-human primates that was not funded by Wellcome may not adhere to the principles outlined in the NC3Rs guidance on Non-human Primate Accommodation, Care and Use.

S2.1.2 Behavioral task
The monkeys were trained on a memory-guided saccade task requiring them to remember the location of a visual stimulus cue on a screen and to make a correct eye movement a er a delay period (Fig. 1a).Specifically, this task required monkeys to fixate on a central ring for a period of 2.6-7.4 s followed by a stimulus cue (a white square) appearing in one of six predetermined locations for 0.25 s.A er a variable delay period of 1.4-7.5 s, the fixation ring was replaced by placeholders at all six possible stimulus cue locations (go cue).Monkeys were required to make a saccade within 0.5 s to the placeholder where the original stimulus cue was presented and maintain their gaze for 0.25 s for monkey T and either 0.25 s or 0.6 s for monkey K (these two gaze maintenance times were switched in di erent blocks for monkey K) to receive a juice reward.The monkeys were extensively trained, with close to perfect performance (monkey T, 96.1%; monkey K, 96.3%, mean across sessions).Fixation break errors were excluded from the calculation of percent correct rate.

S2.1.3 Recordings
A er training was completed, we conducted an aseptic surgery under general anesthesia.We stereotypically implanted a plastic recording chamber on the lateral surface of the prefrontal cortex, under the guidance of structural MRI images (Supplementary Fig. 9).In monkey T, we implanted a cylindrical chamber (RC-T-S-P, internal diameter 12.7 mm, Gray Ma er Research, MT.) in the right hemisphere (AP = 33, ML = 14.5;AP, anterior-posterior; ML, medio-lateral).A 32channel semi-chronic microdrive system (SC-32, Gray Ma er Research) was mounted inside this chamber.In monkey K, we implanted a cuboid chamber (width 12 mm, depth 16 mm, height, 15 mm, S-company ltd., Tokyo, Japan) over the principal sulcus in the le hemisphere.
We collected neural data in a total of 48 daily sessions (21 in monkey T; 27 in monkey K).In monkey T, we used the 32-ch microdrive (SC-32) that housed 32 single-contact tungsten electrodes with inter-electrode spacing of 1.5 mm.In monkey K, we used a 32-ch linear microelectrode array (Plexon U-Probe, Plexon, TX.) with an interelectrode spacing of 150 µm along a single sha .We positioned the U-Probe by using a custom-made grid (width 12 mm, depth 16 mm, height, 10 mm) which had a total of 165 holes with 1 mm spacing.We advanced the U-Probe by a custom-made hydraulic microdrive (S-company ltd.).
Raw extracellular neural signals were amplified and recorded in reference to a titanium bone screw at the vertex (in monkey T) or the sha of the linear array (monkey K) using a neural signal amplifier RZ2 Bioamp Processor (Tucker-Davis Technologies, FL.).Behavioral data (task-event information and eye-movement information) were also sent to the RZ2 Bioamp.Neural data acquisition was performed at a sampling frequency of 24414.08Hz, and behavioral data acquisition at 1017.25 Hz.For analysis of spiking activity, the raw neural signal was filtered (300 Hz to 6 kHz) for o line sorting (O line Sorter, Plexon).In monkey T, approximately three hours before each recording session, we took the monkey to the testing room and advanced each electrode in the SC-32 by a minimum of 62.5 µm in order to ensure recording of new neurons.We then put the monkey back in the home cage until we brought it out again for the recording session.In monkey K, we adopted the method of the U-Probe insertion reported in (14).We first punctuated the dura using a guide tube (a shortened 23 gauge needle), and inserted the U-Probe array slowly, usually with a step of 500 µm.We kept monitoring electrocardiogram (pulsatory fluctuation) on superficial electrodes to identify the point of cortical entry.We usually le 3-5 superficial channels outside the cortex.A er array insertion, we waited 1-1.5 hours until the recorded single-unit and multiunit activities indicated that the electrode array was stably positioned in the cortex.While waiting, the monkey watched nature and animal video clips and received a small snack on a monkey chair.
In monkey T only, to determine location of the frontal eye field (FEF), and confirm that our recording area was outside it, intracortical microstimulations (22 biphasic pulses, 0.2 ms duration at 333 Hz, ≤ 150 µA) were applied through microelectrodes.When eye movements were elicited below 50 µA, the site was considered to be in the low-threshold FEF.In monkey T, our recording area did not include the low-threshold FEF.

S2.1.4 Pre-processing
We excluded neurons that were recorded in fewer than 10 trials for any cue condition.For each monkey, we pooled neurons from all recording sessions to create pseudopopulations of 438 neurons for Monkey K (a er we removed 1 neuron from monkey K's dataset due to an insu icient number of trials) and 625 neurons for Monkey T (no neurons were removed from monkey T's dataset).To compute neural firing rates, we convolved spike trains with a Gaussian kernel with a standard deviation of 25 ms.Trial-averaged trajectories of time-varying mean firing rates were computed separately for each neuron and each cue condition.For analysis methods that used cross-validation (see below), we split trials into separate train and test sets with a 1:1 train:test ratio, and computed trial-averaged trajectories for each training and test set (using 1:1 splits).For non-cross validated analyses, we either computed trial averages based on all the data, or on a subset of the data (see below).We aligned neural activity to either stimulus or go cue onset (see also below in Methods S2.7) and shi ed activity by -50 ms to allow for the delay in time for information about these cues to enter PFC.For consistency with our simulations (see below), we subsampled neural firing rates at a 1-ms time resolution.

S2.2 Neural network models: overview
All our simulated networks (Figs.2-4 and 6 and Supplementary Figs.2-5, 7 and 11-13) evolved according to a canonical model of stochastic recurrent neural circuit dynamics (15,16): where ) corresponds to the vector of (unitless) trial-averaged raw somatic membrane potentials of the N neurons of the network (15) in cue condition c = 1, … , C (initialised at x (c) (t 0 ) at the beginning of the simulation t 0 , which could be at or before stimulus onset at t = 0).r (c) (t) is their momentary firing rates, with f(x) being the activation function that converts membrane potentials to firing rates, τ is the e ective time constant of the cell), W is the recurrent weight matrix (shown e.g. in Fig. 2a and g), h (c) is the input given to the network depending on the stimulus cue, g is the stimulus-cue-independent go cue that occurs at the go time t go , m h (t) and m g (t) are box car 'masking' kernels such that the stimulus and go cues are only e ective within a limited period at the beginning and end of the trial, respectively, b is a cue-independent bias, σ is the standard deviation of the noise process, and η (c) (t) is a sample from a standard (mean 0 and variance 1) Gaussian white (temporally and spatially) noise process.
Networks shown in di erent figures corresponded to di erent special cases of Eqs.S1 and S2 (see Table S1).Specifically, for linear networks f(x) = x was the identity function.For nonlinear networks f i (x) = [x i ] + was the rectified linear (ReLU) activation function applied element-wise (except for the ring a ractor networks where we used f i (x) = tanh(x i ); Supplementary Fig. 3).Given that the focus of our study was optimal information loading, stimulus inputs were either optimized numerically (Fig. 2, Fig. 6, Supplementary Figs. 2 and 3, Supplementary Fig. 5c,d, and Supplementary Figs.[11][12][13], or set to analytically computed values as dictated by our mathematical analysis (Figs. 3 and 4, Supplementary Fig. 4c,d, and Supplementary Fig. 5a,b), or as a baseline, set to random values (Figs. 3 and 4, Supplementary Fig. 4a,b, and Supplementary Fig. 5a,b).For networks used to study the e ects of instantaneous initial conditions (Figs. 2 and 3 and Supplementary Figs.2-5 and 7), the stimulus masking kernel was zero and instead the initial condition was set to the stimulus input; for other networks (Fig. 4, Fig. 5e-f, Fig. 6 and Supplementary Figs.11-13) the stimulus masking kernel was a boxcar between 0 and 0.25 s.For task-optimized networks (Fig. 6 and Supplementary Figs.11-13), the go cue masking kernel m g (t) was a boxcar starting at go cue onset, t go and lasting for 0.5 s, for all other networks it was set to 0 everywhere.The networks used to analyse the dynamics of information loading (Fig. 3, Supplementary Fig. 4, and Supplementary Fig. 5a,b) were deterministic by se ing σ = 0, all other networks used noisy dynamics (see Table S1).We solved the dynamics of Symbol Fig. 2 Fig. 3a nonlinear a linear a linear a linear a Hz neural activation function ∼ N 0, 1 N e s weight matrix K (0, 0.25) g K (0, 0.25) g K (0, 0.
linear a linear a linear a Hz neural activation function ∼ N 0, 1 N e s weight matrix iid.
linear a linear a linear a nonlinear a Hz neural activation function Table S1 | Parameters used in the simulations of our models.a For nonlinear networks, f i (x) = [x i ] + was the rectified linear (ReLU) activation function.For linear networks f i (x) = x i .The only exception to this was when we created ring a ractor networks (Supplementary Fig. 3) in which we used a tanh nonlinearity f i (x) = tanh(x i ).See also text.
b See text for details.c Inputs were optimized either with both a norm constraint and an overall energy constraint (c 1 ); only an overall energy constraint (c 2 ); only a norm constraint (c 3 ); or so that the dynamics produced the mathematically minimal overall energy (c 4 , see Section S5).See text for more details.
d For the symmetric network, we used W = 0.375 0.625 0.625 0.375 ; for the unconstrained network, we used with θ = 0.1.e For the symmetric networks, we enforced W ← 1 2 (W + W ). For all networks we also shi ed the obtained weight matrix by the identity matrix multiplied by a constant so that the largest real part in the eigenvalues of W is exactly 1 (i.e., the largest eigenvalue of the associated Jacobian would therefore be 0 due to the leak term), and we rejected any W's for which the top eigenvalue (the eigenvalue with largest real part) had an imaginary component.For Fig. 4, to provide a slightly be er agreement between the model dynamics and the experimental recordings, we rejected any W's for which the inner product between the most amplifying mode and persistent mode was greater than 0.2 (i.e.we only kept W's that were relatively mathematically non-normal).f We used 3 possible input directions (which all had a Euclidean norm of 1): inputs either aligned with the most persistent mode (x p ), the most amplifying mode (x a ), or a random direction (x r ).For the symmetric model, x p = x a = [0.707,0.707] and we used x r = [0.98,0.18] .For the unconstrained model, x p = [0.995,0.0998] , x a = [0.1537,0.9881] and we used In the table, t go refers to the timing of the go cue (see text).
h For training, we used C = 36 cue conditions.For our subsequent analyses (Supplementary Fig. 3), we used C = 6 cue conditions to be consistent with the other models.
Eqs. S1 and S2 using a first-order Euler-Maruyama approximation between t 0 and the simulation end time, t max , with a discretization time step of 1 ms.
For analysis methods that used cross-validation (see below), for each cue condition, we simulated network dynamics twice with independent realizations of η (c) (t), to serve as train and test data.For other analyses, we used a single set of simulated trajectories.All analyses involving networks with randomly generated (or initialized) connectivities that also did not require re-fi ing their responses with other networks (Fig. 2, Fig. 4, Fig. 6c,d, Supplementary Figs.1-3, Supplementary Fig. 7a-c, and Supplementary Figs.11-13) were repeated a total of n = 100 times, consisting of 10 di erent networks and 10 di erent simulations.For those analyses that did require the re-fi ing of nonlinear networks' responses with linear deterministic networks (Supplementary Fig. 7d and Fig. 6e), we used one simulation of the original (stochastic nonlinear) network, so n = 10 simulations in total.

S2.3.1 Nonlinear networks with instantaneous inputs
Following classical theoretical approaches to a ractor network dynamics, we first used nonlinear neural networks in which stimulus inputs acted instantaneously to determine the initial conditions of the dynamics Fig. 2 and Supplementary Figs. 2, 3 and 7.These networks were optimized using a 'just-in-time' cost function (Methods S2.3.3)under one or two constraints.First, for all these networks, we constrained stimulus inputs to have a Euclidean norm of 3 so that we could compare information loading strategies fairly when inputs were constrained to lie in certain subspaces (see also below): either the persistent subspace, persistent nullspace, locally persistent subspace, locally most amplifying subspace, or a random subspace (Fig. 2f,l, Supplementary Fig. 2, and Supplementary Fig. 7). .We also obtained qualitatively very similar results without this norm constraint, with only a more general energy-based penalty (16-19) (Fig. 6 and Supplementary Figs. 3 and 11-13, see also Methods S2.3.3).Second, for symmetric networks, we enforced W ← 1 2 (W + W ).These networks were trained in two epochs.For the first 1000 training iterations, we optimized all free parameters.A er this, we confirmed that our trained networks did indeed have a ractors (i.e. that they were a ractor networks) and determined where these a ractors were in state space by finding the stable fixed points of the networks' dynamics (Fig. 2d,j, Supplementary Fig. 2b,f, and Supplementary Fig. 3a)-see below.We then continued for another 1000 training iterations (without any firing rate regularization, α (2)  nonlin = 0 in Eq.S3) with only optimizing the initial conditions, h (c) , while keeping the other parameters, W (Fig. 2a,g) and b, fixed at the values obtained at the end of the first 1000 iterations.We did this so that we could fairly compare di erent initial conditions that are constrained to lie in di erent subspaces but which otherwise rely on the same underlying network dynamics.We considered three possible scenarios for introducing additional constraints on the initial conditions (beside the one on their norm, see above): they were either projected and then restricted to the persistent subspace or to the persistent nullspace (see Methods S2.7.1 for how these subspaces were computed), or there was no such constraint applied so that they could utilize any direction in the full state space of the network.In addition, to understand the link between the linearized (Methods S2.4.2) and original (above) forms of the dynamics of these networks, we also considered three more constraints on the initial conditions: constraining them to the most persistent, most amplifying, or a random subspace of the linearized dynamics (Supplementary Fig. 7a-c).

S2.3.2 Nonlinear networks with temporally extended inputs
To more closely follow the experimental paradigms which we modelled, we also used nonlinear networks in which stimuli provided temporally extended inputs (Fig. 6 and Supplementary Figs.11-13).To construct these networks, stimulus inputs and the weight matrix were freely optimized (Methods S2.3.3),without any constraints, and optimization proceeded for a full 2000 iterations, without dividing training into di erent epochs.

S2.3.3 Cost functions and training for nonlinear networks
To investigate how di erent cost functions impact network dynamics (Fig. 6), we trained networks using one of four cost functions: a 'cue-delay' cost, a 'full-delay', a 'just-in-time' cost, and an 'a er-go' cost.These costs only di ered in terms of the time period in which we applied the cost function.The general form of the cost function we used was a cross entropy loss plus a regularisation term: where T 1 and T 2 determine the time period in which we applied the cost, α (1)  nonlin and α (2)  nonlin control the relative contributions of the cross-entropy loss and firing rate regularisation, W out ∈ R 6×N and b out ∈ R 6 include the 6 sets of 'readout' weights and biases, respectively, and y (c) ∈ R 6 is a one-hot vector where y (c) i = 1 if i = c 0 otherwise defining the 'target' output for each cue condition.We initialized elements of the network parameters W, b, h (c) , as well as the readout parameters W out and b out from a Gaussian distribution with mean 0 and variance 1/N, and then optimized using gradient descent with Adam optimization (20), where gradients were obtained from back-propagation through time.The angle brackets, • , denote averaging over batch sizes of 50 random realisations of r (c) .We used a learning rate of 0.0005.
See Table S2 for how we set the parameters of Eq.S3 (T 1 , T 2 , t 0 , α (1)  nonlin and α (2)  nonlin ) depending on the cost function and the level of regularization.Briefly, the cue-delay cost included both the cue (between stimulus cue onset and o set) and the delay period (between stimulus cue o set and go cue onset), the full-delay cost the included delay period but not the cue period, the just-in-time cost started between stimulus onset and the earliest go time and ended at the onset of the go cue, and the a er-go cost started at go cue onset and lasted for the duration of the go cue (0.5 s).For simulating the random delay task (Fig. 6 and Supplementary Figs.11-13), analogous to what animals need to solve (see below), we sampled the go time uniformly between t go = 0.75 s and t go = 2 s.For just-in-time (Fig. 2 and Supplementary Fig. 2) and a er-go trained networks (Supplementary Fig. 12), we also used a fixed delay task with a simulation end time of t max = 2 s or a go time of t go = 2 s, respectively.For the other cost functions, networks trained on the fixed delay task yielded very similar dynamics to their counterparts trained on the variable delay task.We set α (1)  nonlin and α (2)  nonlin so that networks could reliably learn the task (at performance levels comparable across di erent se ings) while also exhibiting relatively stable dynamics (i.e. if α (1)  nonlin /α (2)  nonlin is too large, the network dynamics can explode whereas if α (1)  nonlin /α (2)  nonlin is too small, the network is not able to learn the task).Note that vanishing gradients during training also impacted the value of α (1)  nonlin that was required for di erent networks to exhibit similar performance (Fig. 6c).Nevertheless, α (2)  nonlin was varied by an order of magnitude between Fig. 6 and Supplementary Figs.11-13 to specifically test the robustness of our results to this parameter.

S2.3.4 Optimized ring a ractor networks
When training to create ring a ractor networks (Supplementary Fig. 3), we made three modifications to the nonlinear networks described above.First, in line with other approaches for optimizing recurrent neural networks (19,(21)(22)(23), we used a hyperbolic tangent nonlinearity because the saturation of this nonlinearity greatly encouraged a continuous a ractor to form compared with a ReLu nonlinearity.Second, we trained networks with 36 cue conditions, and then subsequently restricted our analyses to 6 evenly spaced cue conditions to keep consistency with our other analyses.Third, we used a cost function that measured estimation (or fine, rather than coarse, discrimination) performance across those 36 conditions, thus encouraging a ring a ractor to form: where θ is the population vector-decoded stimulus angle, such that atan2 y, x gives the angle that the vector [x, y] makes with the x-axis, W x out , W y out ∈ R 1×N are 2 sets of 'readout' weights defining the plane in which decoded angles are defined, and θ (c) = 2 π c 36 is the target angle for cue condition c.All other terms were the same as those defined in Methods S2.3.3.See also Section S8 for a derivation showing how the cost function Eq.S4 relates to Fisher information.

S2.4 Linear networks
For the dynamical equations of linear networks, see Methods S2.2.Linear networks were either constructed 'de novo' (Figs. 3 and 4 and Supplementary Figs. 4 and 5), obtained by a local linearization of canonical nonlinear dynamical systems (Supplementary Fig. 6) or of nonlinear neural network dynamics (Supplementary Fig. 3e and Supplementary Fig. 7a-c), or they were fi ed to neural responses obtained from experiments (Fig. 5f, and Supplementary Fig. 10f,g) or the simulation of nonlinear networks (Fig. 6e and Supplementary Fig. 7d).

S2.4.1 De novo linear networks
We used de novo linear networks to develop an analytical understanding of the dynamics of optimal information loading.These networks included small 2-neuron networks with hand-picked parameters (see Table S1) chosen to illustrate the di erences between normal (symmetric) and non-normal (unconstrained) dynamics and the e ects of di erent initial conditions (Fig. 3a-c), as well as large networks (with 10, 100, or 1000 neurons) with randomly generated parameters (Fig. 3d,e, Fig. 4, Supplementary Fig. 4, and Supplementary Fig. 5a,b; see Table S1).We always set the largest eigenvalue of the weight matrix to be exactly 1 (thus se ing the largest eigenvalue of the associated Jacobian to 0 due to the leak term) so that these networks had an integrating or 'persistent' mode (4, 24-26) (see Table S1).
Initial conditions (Fig. 3, Supplementary Fig. 4, and Supplementary Fig. 5a,b) or temporally extended inputs (Fig. 4) were determined by computing the most persistent and amplifying direction(s) based on the Jacobian of the dynamics (Figs. 3  and 4, and Supplementary Fig. 5a,b, see Methods S2.7.1; for how initial conditions were determined in Supplementary Fig. 4c,d see Section S5.8).For the networks in Fig. 4, we also added a small amount of noise to the input to allow for some transient dynamics for all input directions (see Fig. 4c at 0 s).Alternatively, we optimized initial conditions for maximal asymptotic overlap with the most persistent mode (Supplementary Fig. 5c,d; see below).For se ing the noise level, σ, in these networks, we considered two scenarios: noise matched (Fig. 4a, light green and gray) and performance matched (Fig. 4a, dark green and black).For noise matched simulations, we first determined the highest value of σ that still allowed us to obtain 100% decodability (using a delay-trained decoder) for all networks when receiving inputs aligned with the most amplifying mode (Fig. 4a, red).This resulted in σ = 0.1 for symmetric models, and σ = 0.17 for unconstrained models.We then used the same σ for simulations using inputs aligned with the most persistent and random directions.
For performance matched simulations, we used a di erent value of σ for each possible input direction so that all models achieved 100% decodability using a delay-trained decoder.For symmetric models, this required σ = 0.1 for inputs aligned with either the persistent or most amplifying modes, and σ = 0.005 for random inputs.For unconstrained models, this required σ = 0.17 for inputs aligned with the most amplifying mode, σ = 0.02 for inputs aligned with the persistent mode, and σ = 0.005 for random inputs.(Note that, consistent with our theory, smaller noise levels were necessary to achieve the same desired level of performance for input directions that were predicted to be increasingly suboptimal by our analysis.) To demonstrate that the initial conditions along the most amplifying directions, obtained by control theoretic analyses, were indeed optimal for maximising the overlap with the most persistent mode (the measure of optimality we used for these networks, Fig. 3c,e), we also used a direct numerical optimization approach, analogous to that used to optimize initial conditions in our nonlinear networks (Figs. 2 and 6, see also Methods S2.3.3).Specifically, we optimized h (c) (constrained to have unit Euclidean norm) with gradient descent using Adam optimization (20) with gradients obtained from backpropagation through time using the following cost function where v 1 is the eigenvector associated with eigenvalue 0 of the Jacobian (i.e. the most persistent mode).We used a learning rate of 0.0001.We performed the above training procedure independently for 100 random noisy networks (either symmetric or unconstrained) and we show averaged results in Supplementary Fig. 5c,d.We also used random initial conditions as controls.These had elements that were either sampled from a standard normal distribution (re-scaled to have unit Euclidean norm) in large networks (Fig. 3d,e, Fig. 4, and Supplementary Fig. 5a,b), or in the case of 2-neuron networks, quasi-randomly chosen (with unit Euclidean norm) for illustrative purposes (Fig. 3a-c).

S2.4.2 Local linearization of nonlinear dynamics
To be er understand how the dynamics of optimal information loading that we identified in linear networks apply to nonlinear a ractor dynamics, we performed a local linearization of our simulated nonlinear networks (Supplementary Fig. 3e, Supplementary Fig. 6, and Supplementary Fig. 7a-c).This approach required access to the 'true' dynamical equations of the nonlinear networks-which we had by construction.
We performed local linearizations of the original nonlinear network dynamics in x-space (the space of variables in which the dynamics was defined, Eq.S1) around the origin (we found empirically that initial conditions were distributed close to the origin)-which served as the reference point with respect to which the norm of optimized initial conditions was constrained in the networks we linearized (Methods S2.3; analogous to our analysis of information loading in linear networks, Fig. 3a-e, and see also Methods S2.7.2).As the ReLU firing rate nonlinearity of these networks is non-di erentiable at exactly the origin, we computed the 'average' Jacobian of the system in the immediate vicinity of the origin instead (this allowed us to use the same linearization and the same set of amplifying modes for all initial conditions; we obtained similar results by linearizing separately for each initial condition).Because the derivative of each ReLU is 0 or 1 in half of the activity space of the network, this resulted in the Jacobian J = 1 2 W * − I, where W * is the weight matrix of the original nonlinear network.Note that one obtains the same result even without averaging, by regarding the ReLu nonlinearity as the limiting case of the so -ReLu nonlinearity: [x] + = lim β→∞ 1 β ln 1 + e β x , of which the derivative at x = 0 is 1 2 (at any value of the inverse temperature, β) and thus results in the same Jacobian as above.We confirmed that the resulting dynamics were always stable (largest real eigenvalue of J was less than 0).We then used this system to identify the locally (around the origin) most amplifying or most persistent modes (Supplementary Fig. 7a).
For simulating these linearized networks (Supplementary Fig. 7b), we then used the Jacobian we thus obtained to map the resulting linearized dynamics to a deterministic integrator with the e ective weight matrix W = (J + I −λ max I, where λ max is the largest real eigenvalue of J. Thus, the resulting dynamics were always marginally stable (largest real eigenvalue of J was exactly 0).(Note that for subsequent analyses involving most persistent and amplifying modes, we used the original weight matrix, see more in Methods S2.7.1.Nevertheless, the most persistent modes of the weight matrices we used for simulation and those we used in subsequent analyses were identical, as they only relied on the eigenvectors of the weight matrix, or the Jacobian, and the rank order of their associated eigenvalues, which this stabilization did not a ect.We also checked numerically that making the system marginally stable only had very minor e ects on the most amplifying modes, with correlations between the most amplifying modes of the original and simulated dynamics being above 0.9.Thus, in these respects, our simulations were representative of the dynamics of the original systems.)The bias parameters, b, were the same as in the original nonlinear networks.The initial conditions, h (c) , were either the ones we originally optimized for the nonlinear dynamics without any constraints (beside a constraint on their norm), or they were optimized while constraining them to the most persistent, most amplifying, or a randomly chosen subspace of these linearized dynamics (all were of the same dimensionality for a fair comparison, Supplementary Fig. 7).
For ring a ractor networks (Supplementary Fig. 3), which used a tanh nonlinearity (Methods S2.3.4), the associated linearized system around the origin was given by the Jacobian J = W − I, which we then used to identify the locally most amplifying and persistent modes (Supplementary Fig. 3e).
We used the same approach to linearize the dynamics of the canonical minimal nonlinear a ractor dynamics that we used to gain insights into information loading in nonlinear systems (Methods S2.6, see also Section S6 and Supplementary Fig. 6).In this case, the Jacobian was well defined at the origin, so there was no need to average it.For consistency with the notation and terminology we use in the rest of this paper, and without loss of generality (as linear dynamical systems and linear neural networks are isomorphic), we refer to the resulting linear dynamical system as a 'linear neural network' and define it by its 'e ective' weight matrix (defined via the Jacobian as above).Initial conditions were magnitude-matched and chosen to align with the most persistent or the most amplifying direction extracted from the Jacobian (Methods S2.7.1), or chosen randomly, or varied systematically to cover the whole range of possible directions.There were no other parameters for these linearized 'networks'.

S2.4.3 Fi ing linear neural networks to neural responses
In order to be able to apply our theoretically derived measures of optimal information loading without having access to the true dynamics of the system, we also created linear neural networks whose parameters were fi ed to experimental data (see below).As a control, we repeated the same fi ing procedure with simulated nonlinear networks to validate that our approach provides meaningful results when 1. we do not have access to the true dynamics but only to samples of activities generated by those dynamics, and 2. we also cannot assume that the true dynamics are linear.
We fi ed deterministic linear neural networks to 1 s of trial-averaged neural activity (experimentally recorded, or simulated by a nonlinear neural network model).For the main analyses (Fig. 5d-f, Fig. 6e, and Supplementary Fig. 7d), we used data starting from the onset of the stimulus cue.For the control analysis of late delay experimental recordings (Supplementary Fig. 10g), we used the final 1 s of neural activity just prior to the go cue.For the shu le control (Fig. 5d; dark gray, and Supplementary Fig. 10f), we again used data starting from stimulus onset but randomly shu led neural activity across time and proceeded by fi ing this shu led data instead.
For fi ing high dimensional neural data, we first performed principal components analysis on neural activity (dimensions: neurons, data points: time points, indexed by t, and cue conditions, indexed by c), and projected it through the principal components (PCs): x (c) * (t) = P r (c) * (t), where the columns of P are top 20 principal components of the data, and r (c) * (t) is trial averaged neural responses (mean-centered, see above) at time t in condition c.These top 20 PCs captured approximately 75% and 76% of variance for monkeys K and T, respectively during the cue and early delay period (Fig. 5d-f), 70% and 60% of variance for monkeys K and T, respectively during the late delay period (Supplementary Fig. 10g), and over 95% of the variance for all simulated neural activities (Fig. 6e).The projected neural activity time courses of the neural data, x (c) * (t), served as the targets that needed to be matched (a er a suitable linear transformation with 'read-out' matrix C ∈ R 20×20 ) by the neural activity time courses generated by the fi ed neural network's dynamics in the corresponding cue conditions, x (c) (t) (Eqs.S1 and S2).For fi ing the parameters of the network (W, h (c) , b) and the readout matrix (C), we used the following cost function: where D is a diagonal matrix with the variances explained by the corresponding PCs in P on the diagonal (encouraging the optimization procedure to prioritize fi ing the top PCs), • F is the Frobenius norm of a matrix.
Also note that we had no constraints on W to define stable dynamics.Nevertheless, when fi ing experimental recordings, and responses generated by nonlinear a ractor networks, we found that the largest real eigenvalue of the fi ed W was typically within the 0.95 ≤ λ max ≤ 1.05 range, i.e. the dynamics were near marginal stability, in line with the dynamics of our de novo linear neural networks (Methods S2.4.1), as well as of those that we obtained by local linearization (Methods S2.4.2).The only exception was when fi ing the responses of nonlinear networks trained on an a er-go-time cost (Methods S2. 3.3) which resulted in dynamics without a ractors and, consequently, the fi ed linear dynamics typically had λ max > 1.05.
We used Adam (20) to perform gradient descent optimization of W, h (c) , b, and C with gradients obtained from backpropagation through time, and a learning rate of 0.0001.We initialized elements of all of these parameters from a Gaussian distribution with mean 0 and variance 1/20 and we set the regularisation parameter to α lin = 1/12.
The stimulus-masking kernel (m h (t), Table S1) was matched to how the responses being fi ed were obtained: with temporally extended or instantaneous inputs.Specifically, when fi ing responses to temporally extended inputs (experimentally measured, Fig. 5f and Supplementary Fig. 10f, or simulated, Fig. 6e), the masking kernel of the fi ed linear network matched the cue period.When fi ing responses generated by networks driven by instantaneous inputs (Supplementary Fig. 7d), or when fi ing the late delay period of experimental recordings (during which no stimulus is present, Supplementary Fig. 10g), the stimulus masking kernel was set to zero, and instead the initial condition of the fi ed linear network was tuned to match the responses (see below).
In most cases (Fig. 5f, Fig. 6e, and Supplementary Fig. 10f), we set the initial condition x (c) (t 0 ) = 0.There were two exceptions to this.First, when fi ing the late delay dynamics in the experimental recordings (Supplementary Fig. 10g), we set x (c) (t 0 ) = C −1 x (c) * (t 0 ) (i.e.we fixed the initial condition of the latent dynamics to the data as no stimulus is present during the late delay period; we also observed qualitatively similar results when we included x (c) (t 0 ) as a separate optimizable parameter in this case).Second, when fi ing simulated data from models that used instantaneous stimulus inputs (Supplementary Fig. 7d), we set x (c) (t 0 ) = h (c) .

S2.5 Previous working memory models
We used the following dynamics for implementing all previous neural network models of working memory: where all symbols refer to the same (or a closely analogous, see below) quantity as in Eqs.S1 and S2.Note that we use this notation to best expose the similarities with and di erences from the dynamics of our networks (Eqs.S1 and S2), rather than the original notation used for describing these models (1,4,6), but the dynamics are nevertheless identical to those previously published.Overall, these dynamics are closely analogous to those that we used earlier for our networks with the following di erences.First, for us, dynamics were defined in x-space, with r being an instantaneous function of x.Here, the dynamics are defined instead in r-space (Supplementary Fig. 1a-d and Supplementary Fig. 8), with x being an instantaneous function of r. (There are slightly di erent assumptions underlying these rate-based formulations of neural network dynamics when deriving them as approximations of the dynamics of spiking neural networks (27), and the two become identical in the case of linear dynamics.)As a result, time constants, τ r , biases, b r , and the variance of noise, σ r (as well as noise itself, η (c) r (t)), are defined for r rather than x.For nonlinear variants of these networks, there are also di erences for the choice of single neuron nonlinearities, f(•).Furthermore, some of these networks distinguish between excitatory and inhibitory cells, with di erent time constants, and noise standard deviations.Thus, each of these parameters is represented as a diagonal matrix, τ r and σ r , respectively, with each element on the diagonal storing one of two possible values of that parameter depending on the type (excitatory or inhibitory) of the corresponding neuron (τ E r and σ E r , or τ I r and σ I r , respectively).Most importantly, all of these networks used a set of parameters which were handcra ed to produce the required type of dynamics, rather than optimized for a function (or to fit data) as in the case of our networks.In line with our analyses of experimental data and task-optimized networks (Figs. 5 and 6), simulations started at t 0 = −0.5 s, i.e. 0.5 s before stimulus cue onset (defined as t = 0), the stimulus cue lasted for 0.25 s, and the go cue appeared at t go = 2 s and lasted for 0.5 s. (Note that for these networks we considered the fixed-delay variant of the task as that is what these networks were originally constructed to solve.)As with our networks (Methods S2.2), we solved the dynamics of Eqs.S10 and S11 using a first-order Euler-Maruyama approximation between t 0 and the simulation end time with a discretization time step of 1 ms.
For analysis methods that used cross-validation (see below), we simulated network dynamics twice (for each cue condition) with independent realizations of η (c) r (t), to serve as (trial-averaged) train and test data.For other analyses, we used a single set of simulated trajectories.All analyses involving these networks were repeated n = 10 times, using 10 di erent simulations (non-cross-validated) or simulation-pairs (cross-validated), each time with independent samples of η (c) r (t).Table S3 provides the values of most network and other parameters used for simulating each model.In the following we provide the additional details for each of these models that are not included in Table S3.

S2.5.1 Classical bump a ractor model
The bump a ractor model that we used (Supplementary Fig. 1a) has been described previously (see Ref. 1).The model contained separate excitatory and inhibitory populations.As in the discrete a ractors model, the weight matrix was of the form where the elements of W IE ,W EI , and W II were set to 6.8/N, 8/N, and 1.7/N, respectively.The excitatory sub-matrix W EE had a circulant form: Stimulus cue inputs were also analogous to those used in the discrete a ractors models and were set to for cues c = 1, … , 6 and cells i = 1, … , N/2 (i.e., as above, inputs were only delivered to the excitatory neurons).

S2.5.2 Discrete a ractors model
The discrete a ractors model that we used (Supplementary Fig. 1b) has been described previously (see the methods of Ref. 1).The model contained separate excitatory and inhibitory populations.
The weight matrix was of the form Parameters used in network simulations of previous models Symbol Supplementary Fig. 1a Supplementary Fig. 1b Supplementary Fig. 1c and Supplementary Fig. 8a,b,d K (0, 0.25) c K (0, 0.25) c K (0, 0.25) c K (0, 0.25) c stimulus masking kernel m g (t) K t go , t go + 0.5 c K t go , t go + 0.5 c K t go , t go + 0.5 c K t go , t go + 0. Table S3 | Parameters used in previous models.
a For nonlinear networks, . For linear networks f i (x) = x i .
b See text for details.
where the elements of W IE ,W EI , and W II were set to 2.4/N, 8/N, and 2.6/N, respectively.The excitatory sub-matrix W EE was constructed by dividing the population of excitatory cells into six clusters (of 9 neurons each), with each cluster corresponding to one of the stimulus cue conditions.Connections within each cluster were strong, with a value of 30/N.Connections between neurons belonging to clusters that corresponded to adjacent stimulus cues were weaker, with a value of 2.5/N.All other connections were very weak, with a value of 0.02/N.This resulted in a block circulant structure for W EE .
Stimulus cue inputs were set to for cues c = 1, … , 6 and cells i = 1, … , N/2 (i.e.inputs were only delivered to the excitatory neurons).

S2.5.3 Linear integrator model
The linear integrator model that we used (Supplementary Fig. 1c and Supplementary Fig. 8a,d) has been described previously (see Ref. 4).There were no separate excitatory and inhibitory populations in this model, and the weight matrix was constructed such that network dynamics were non-normal, non-oscillatory, and stable with a single two-dimensional neutrally stable subspace (i.e. a plane a ractor).We achieved this by defining W via its eigen-decomposition: where the eigenvectors (columns of V, denoted as v j , for j = 1, … , N, with elements v ij , for i, j = 1, … , N) were generated by the following process: 1. Generating a random vector: ∼ N (0, 1) (S18) 2. Making the first 10% of vectors overlapping so that the resulting matrix is non-normal: where ik iid.
3. Making the the other 90% of vectors orthogonal: Unit normalizing each vector: (S22) and the eigenvalues (λ i , for i = 1, … , N, the diagonal elements of the diagonal matrix D) were generated by the following process 1. Generating random (real) values: ∼ Uniform(0, 0.8) (S23) 2. Creating a pair of neutrally stable eigenmodes: The stimulus cue inputs were set to for cues c = 1, … , 6, and we considered two forms for K: either 1c and Supplementary Fig. 8a,d; as in the original formulation (4)) or K = v r1 , v r2 , v r3 (Supplementary Fig. 8b,e), where r 1 , r 2 , r 3 , r 4 were randomly drawn integers over the range 1 to N − 2. The first formulation of K ensured that stimulus cue inputs partially align with the persistent subspace, whereas the second formulation of K ensured that stimulus cue inputs align only with random directions.

S2.5.4 Feedforward network model
The linear feedforward network model that we used (Supplementary Fig. 1d and Supplementary Fig. 8c,f) has been described previously (see Refs. 5,6).(For pedagogical purposes, we used the simplest set up consisting of a feedforward chain of neurons, see below.However, using a more general network model that contained 'hidden' feedforward chains (5) did not a ect our analyses except for Supplementary Fig. 10e which, in contrast to the simple feedforward chain, could display overlap values greater than 0.5.)There were no separate excitatory and inhibitory populations in this model, and the weight matrix included a single chain running from neuron 1 to neuron N: for cell-pairs i, j = 1, … , N.
The stimulus cues provided random inputs delivered to only the first 10 neurons so that each input could pass through the feedforward network: ∼ N (0, 1) (S27) for cues c = 1, … , 6 and cells i = 1, … , 10.

S2.6 Canonical nonlinear systems with two stable fixed points
In order to illustrate the applicability of our analysis of optimal information loading in linear dynamical systems to the behaviour of nonlinear dynamical systems, we first studied two variants (either symmetric or non-symmetric) of a canonical nonlinear system that can exhibit two stable fixed points.(These systems are closely related to the damped, unforced Duing oscillator which is a classic example of a [non-symmetric] system that can exhibit two stable fixed points.Additionally, the analysis of these systems also holds for the Du ing oscillator.) The dynamics of the first system (which has a symmetric Jacobian matrix) are governed by and the dynamics of the second system (which has a non-symmetric Jacobian matrix) are governed by: We used a cubic polynomial in Eqs.S28 and S29 because it is the lowest order polynomial that allows a system to exhibit 2 stable fixed points.Both systems exhibit 3 fixed points: both have a saddle point at the origin and both have 2 asymptotically stable fixed points at (±1, 0) (see Supplementary Fig. 6 for the state space dynamics of these two systems).
We solved the dynamics of Eqs.S28 and S29 using a first-order Euler approximation starting from t = 0 with a discretization time step of 0.02 and a time constant of τ c = 0.4 (note time was unitless for this model).

S2.7 Analysis methods
Here we describe methods that we used to analyse neural data.Whenever applicable, the same processing and analysis steps were applied to both experimentally recorded and model simulated data.As a first step in all our analyses, in line with previous work analysing neural population dynamics (28), we removed the stimulus cue-independent time-varying mean activity from each neuron's firing rate time series (see Fig. 5a for an example).(This was done separately for training and test data for cross-validated analyses, see below.)In most of our analyses, neural activities were aligned to stimulus cue onset defined to be at t = 0.However, due to the variable delay duration of the task (Fig. 1a), experimentally recorded neural activities were also aligned to go cue onset for analyses that required incorporating the late delay and go epochs (i.e.beyond the first 1.65 s a er the stimulus cue onset; Fig. 5b-c, Supplementary Fig. 10a-c,g).For simulated neural activities, this was not necessary, as we always simulated our networks in a fixed-delay task for ease of analysis, even if they were optimized for a variable-delay task in accordance with how our experimental monkey subjects were trained.

S2.7.1 Identifying amplifying, persistent, and other subspaces in network dynamics
In order to understand the dynamics of neural networks with potentially complex and high-dimensional dynamics, and the way these dynamics depend on initial conditions, we identified specific subspaces within the full state space of these networks that were of particular relevance for our analyses.These subspaces served dual roles.First, as 'intervention tools', to ascertain their causal roles in high dimensional network dynamics, we used them to constrain the initial conditions of the dynamics of our networks (see also Methods S2.7.2).Second, as 'measurement tools', to reveal key aspects of the high-dimensional dynamics of neural networks, we used them to project high-dimensional neural trajectories into these lower dimensional subspaces (see also Methods S2.7.3).
Our main analyses relied on identifying the most persistent and most amplifying modes of a network.This required dynamics that were linear-either by construction, or by (locally) linearizing or linearly fi ing dynamics that were originally nonlinear (see Table S1).We computed the most persistent mode(s) in one of two di erent ways.First, for networks that were either guaranteed to have stable dynamics by construction (i.e.those constructed de novo; Figs. 3 and 4 and Supplementary Figs. 4, 5 and 8), or were confirmed to be always stable in practice (i.e.those constructed by local linearization; Supplementary Fig. 3e, Supplementary Fig. 6, and Supplementary Fig. 7), we simply used the eigenvector(s) of the weight matrix W associated with the eigenvalue(s) that had the largest real part(s).Second, for networks that were fi ed to nonlinear dynamics or recorded data, and whose dynamics could thus not be guaranteed to be stable (Fig. 5f, Fig. 6e, Supplementary Fig. 7d, and Supplementary Fig. 10f-h), we used the eigenvectors of W associated with the largest real eigenvalues that were less than or equal to 1 + δ (with δ = 0.05) (i.e.we find the slowest, or most persistent, modes of the network-the δ was mostly relevant only for the a er-go-time networks of Fig. 6 and Supplementary Fig. 12 which exhibited eigenvalues substantially greater than 1 and se ing δ less than 0.05 did not substantially change our results).(Note that an eigenvalue of 1 for W corresponds to an eigenvalue of 0 for the associated Jacobian of the dynamics due to the leak term.) For computing the most amplifying modes, we performed an eigen-decomposition of the associated Observability Gramian Q (29,30).Specifically, we obtained Q by solving the following Lyapunov equation: where W is the 'stabilized' weight matrix of the dynamics (and the −I terms represent the e ect of the leak on the Jacobian of the dynamics, Eq.S1) and C is the read-out matrix of the network.The most amplifying mode(s) of the network are given as the eigenvector(s) of Q associated with the largest eigenvalue(s).Again, for networks that were guaranteed to have stable dynamics by construction (Figs. 3 and 4, Supplementary Fig. 7a-c, Supplementary Fig. 5, and Supplementary Fig. 8), W = W − I, where W is the original weight matrix of the dynamics and = 0.01 (to ensure dynamical stability).For other networks, i.e. either linear networks fi ed to experimental data (Fig. 5f and Supplementary Fig. 10f-h), linear networks fi ed to simulated nonlinear dynamics (Fig. 6e and Supplementary Fig. 7d), or local linearizations of nonlinear dynamics (Supplementary Fig. 6, Supplementary Fig. 3e, and Supplementary Fig. 7a-c), we used W = W unless the largest eigenvalue λ max of W was greater than or equal 1, in which case we used W = W−(λ max − 1 + ) I, to ensure that the linear dynamics with W were stable (which is required for calculating Q).For networks obtained by fi ing neural responses (experimentally recorded or simulated; Fig. 5f, Fig. 6e, Supplementary Fig. 7d, and Supplementary Fig. 10f-h), C was obtained by fi ing those responses (Methods S2.4.3), as we wanted to understand how the fi ed dynamics taking place in a latent space can generate the most discriminable fluctuations in (the principal components of) the neural responses to which they are related by this read-out matrix (although using C = I did not change our results substantially).For all other networks (Figs. 3 and 4, Supplementary Fig. 7a-c, Supplementary Fig. 3e, Supplementary Fig. 5, and Supplementary Fig. 8), we simply used C = I, as the activity of these networks was supposed to be read out in the same space within which their dynamics took place.
We also applied methods which did not rely on the linearization (or linear fi ing) of network dynamics.Our goal was to develop basic intuitions for how much the dynamics of the di erent simulated nonlinear networks of Fig. 2 and Supple-mentary Fig. 2 used the persistent subspace of their dynamics.For this, we determined the 'persistent subspace' as the subspace spanned by the 5 principal components of the final 500 ms of neural activities (x) across all 6 cue conditions, corresponding to 6 distinct a ractors, and the 'persistent nullspace' of the network as the 45-dimensional subspace orthogonal to the persistent subspace.For plots showing the projection of network activities within the persistent subspace (Supplementary Fig. 2b,f and Supplementary Fig. 2c-d and g-h, bo om) we used the first two principal components of the full, five-dimensional persistent subspace of the network, as determined above.For plots showing the projection of network activities to persistent vs. cue-aligned directions (Fig. 2d,j, and Supplementary Fig. 2c-d and g-h, top right), 'persistent PC1' was determined as the direction spanning the two persistent states corresponding to the two cue conditions being illustrated (i.e. as above, spanning the final 500 ms of neural activities across the two cue conditions), and 'initial PC1 (orthogonalized)' was determined as the the direction spanning the two initial conditions corresponding to the two cue conditions being illustrated, orthogonalized with respect to the corresponding persistent PC1.

S2.7.2 Subspace-constrained initial conditions
When using the subspaces identified above as 'intervention tools', to constrain the initial conditions of our networks, we either used the single top most persistent or amplifying mode for linear networks with low-dimensional coding spaces (including the linearized canonical nonlinear a ractor dynamical system; Figs. 3 and 4 and Supplementary Figs. 5 and 6), or numerically optimized initial conditions within the corresponding higher-dimensional subspaces (Fig. 2f,l, Supplementary Fig. 2c,d,g,h, Supplementary Fig. 7; see also Methods S2.3 and Methods S2.4).When the persistent subspace was extracted from neural responses (rather than from the dynamical equations of the network, Methods S2.7.1;Fig. 2f,l, Supplementary Fig. 2c,d,g,h, Supplementary Fig. 7a) we used di erent sets of simulations to generate data from which we could estimate the persistent subspace (as explained above), and to analyse network dynamics when initialized within these subspaces.In all cases, for a fair comparison, the magnitude of initial conditions was fixed (Methods S2.3.1,Methods S2.4.1), and only their direction was a ected by constraining them to one of these subspaces.

S2.7.3 Measures of subspace overlap
In order to measure the overlap of high dimensional neural dynamics with the subspaces we identified, we used one of two methods.First, for analysing network dynamics across two conditions chosen to correspond to 'opposite' stimulus cues (Fig. 2d,j, Fig. 3c,d,e, Supplementary Fig. 2c,d,g,h, Supplementary Fig. 6c,d, and Supplementary Fig. 5), such that the coding part of the persistent subspace was one-dimensional, we simply measured the projection of neural dynamics onto the first eigenvector (i.e. the eigenvector associated with the largest real eigenvalue) of the corresponding subspace using a dot product: where u may correspond to the most persistent, or the most amplifying mode, or the first PC of the persistent-orthogonalized cue subspace (as defined above).We also used the same measure for visualising the quality of fit of linear neural network dynamics to experimental data (Methods S2.4.3) with u being the first PC of the full state space of neural firings rates (Fig. 5e).In those cases, when u had to be estimated from neural responses (Fig. 2d,j, Fig. 5e, Supplementary Fig. 2c,d,g,h), we used a cross-validated approach, using di erent subsets of the data to determine u and x(t) (from a single split of the data).In other cases, u was determined from the truly deterministic dynamics of the system and thus there was no need for cross-validation.
Second, to measure subspace overlaps for d-dimensional neural activities across multiple conditions and time points within coarser time bins (Fig. 4c, Fig. 5f, Fig. 6e, Supplementary Fig. 7d, Supplementary Fig. 8d-e, Supplementary Fig. 11c, Supplementary Fig. 12d, Supplementary Fig. 13d, and Supplementary Fig. 10b,c,f,g), thus corresponding to high-dimensional coding sub-spaces, we used the following properly normalized measure: across-condition variance explained t, t = Tr U t Σ(t) U t Tr P (t) Σ(t) P(t) where Σ(t) is the covariance matrix of neural activities across conditions and raw (1-ms) time points within time bin t, the columns of P(t) are the first principal components of neural activities within time bin t (i.e. the eigenvectors of Σ(t) associated with the largest eigenvalues), and U t is the subspace of interest with respect to which overlaps are computed (which itself may or may not depend on time, see below).The time resolution of t and t (i.e. the duration of time bins within which data was used to compute the corresponding terms at a given t or t ), the choice of U t , and the number of vectors used for constructing U t and P(t) depended on the analysis (see below).
Specifically, for measuring subspace overlap between neural activity and persistent vs. amplifying modes (Fig. 4c, Fig. 5f, Fig. 6e, Supplementary Fig. 7d, Supplementary Fig. 8d-e, and Supplementary Fig. 10f,g), we set U t = U where the columns of U are the first d/4 eigenvectors of the most persistent or amplifying subspace (orthogonalized using a QR decomposition for the most persistent modes-this was not necessary for most amplifying modes which are orthogonal by construction), or d/4 randomly chosen orthonormal vectors as a control (shown as 'chance'; computed analytically as 1/4 for 'de novo' linear networks (Fig. 4c and Supplementary Fig. 8d-e), and numerically for fi ed linear networks, also yielding values of approximately 1/4, Fig. 5f, Fig. 6e, Supplementary Fig. 7d, and Supplementary Fig. 10f).P(t) contained the first d/4 principal components.In this case, a value of 1 for this metric implies that the d/4 directions of greatest variability in neural activity overlap exactly with the d/4-dimensional subspace spanned by U. The time resolution of t was 20 ms (for clarity, bins to be plo ed were subsampled in the corresponding figures).Note that when this analysis was performed on linear networks fi ed to neural data (experimentally recorded or simulated), U, P(t), and Σ(t) were all obtained from the same fi ed linear network (i.e.no cross-validation).Specifically the parameters of the network were used to determine U (see Methods S2.4.3), and the neural responses these fi ed linear dynamics generated (rather than the original neural responses that were fit by the linear model) were used to determine Σ(t) and thus P(t).See Methods S2.8 for computing the significance of these overlaps (and their di erences).When analysing optimized ring a ractor networks (Supplementary Fig. 3e), we used 2-dimensional subspaces (rather than d/4-dimensional subspaces) because we found empirically that the obtained ring a ractors lay in a 2-dimensional subspace.
For analyzing subspace sharing between di erent task epochs (Supplementary Fig. 11c, Supplementary Fig. 12d, Supplementary Fig. 13d, and Supplementary Fig. 10b), U t contained the top k principal components (PCs) of neural activity within the time bin indexed by t (we used k = 10 for the monkey data and k = 4 for our models because the models typically exhibited lower dimensional dynamics), while P(t) included all PCs within the time bin indexed by t.For these, we performed principal components analysis with dimensions corresponding to neurons and data points corresponding to time points and cue conditions.

S2.7.4 Linear decoding
We fi ed decoders using linear discriminant analysis to decode the stimulus cue identity from neural firing rates (Fig. 2e,f,k,l, Fig. 4a,b, Fig. 5b,c, Fig. 6c,d, Supplementary Fig. 7c, Supplementary Fig. 3d, Supplementary Fig. 8a-c,Supplementary Fig. 10a,h, Supplementary Fig. 11a,b, Supplementary Fig. 12b,c, and Supplementary Fig. 13b,c).We constrained the decoders to be 2-dimensional (in line with previous studies (4)) because this was a su icient dimensionality to decode responses.We primarily considered two types of decoding analyses: we either trained decoders on late delay activity and tested on all time points ('delay-trained decoder', e.g.Fig. 4a), or we trained decoders separately at every time point and tested on all times ('full cross-temporal decoding', e.g.Fig. 4b).In all cases, we measured decoding performance in a crossvalidated way, using separate sets of neural trajectories to train and test the decoder, and we show results averaged over 10 random 1:1 train:test splits of the data.For delay-trained decoders, training data consisted of pooling neural activity over a 500 ms time interval (the time interval is shown by a horizontal black bar in all relevant figures), and tested the thus-trained decoder with data in each 1 ms time bins across the trial (for clarity, test bins to be plo ed were subsampled every 10 ms in the corresponding figures).For full cross-temporal decoding, we binned neural responses into 10 ms time bins and trained and tested on all pairs of time bins (specifically, we plo ed mean decoding performance across the 10 1-ms raw time bins corresponding to each 10-ms testing bin).We used a shrinkage (inverse regularisation parameter on the Euclidean norm of decoding coe icients) of 0.5.Chance level decoding was defined as 1/C, where C = 2 or 6 is the number of cue conditions that need to be decoded (Tables S1 and S3).

S2.7.5 ality of fit for linear models fi ed to neural responses
When fi ing linear models to neural data (experimentally recorded or simulated; Methods S2. 4.3) we used a cross-validated approach for measuring the quality of our fits, with a random 1:1 train:test split of the data (Fig. 5d).For this, we first fi ed the model on training data (x (c) * = x (c) train in Eq.S9).The quality of fit was then computed on the test data, x (c) test , as the fraction of variance of x (c)  test (t) explained by the simulated responses (a er the appropriate projection, i.e.C x (c) (t)), across all 20 dimensions weighted by D (all parameters, including P, C and D, were set to their values obtained by fi ing the training data).In other words, we computed the Pearson R 2 with respect to the identity line using the mean squared error, ε 2 in Eq.S8, with the momentary error in Eq.S9 computed using x (c) * = x (c) test .Once the quality of fit for this split was thus established, we conducted all further analysis involving fi ed linear models with the model that was fit to the training half of this split.
As a meaningful lower bound on our quality of fit measure, we also computed the same measure (i.e.fi ing a linear neural networks to training data and calculating the quality of fit using test data) for 100 di erent time-shu led controls of the original train:test split of the data (Methods S2. 4.3), such that we shu led time bins coherently between the training and the test data, across neurons and conditions (Fig. 5d, dark gray).
To calibrate how much our fits were limited by the noisiness of the data, we also computed the quality of fit directly between x (c) train (t) and x (c) test (t) (i.e. using the mean squared error, ε 2 in Eq.S8, with the momentary error redefined as e (c) (t) = x (c) train (t) − x (c) test (t)) for 100 random 1:1 train:test splits of the data (Fig. 5d, light gray).The extent to which the R 2 computed with this control was below 1 reflected the inherent (sampling) noise of the experimental data that limited the quality of fit obtainable with any parametric model, including ours that was based on linear dynamics.Moreover, a cross-validated R 2 computed with our fits that was higher than the R 2 obtained with this control (Fig. 5d dark and light blue vs. light gray) meant that the inherent assumption of linear dynamics in our model acted as a useful regularizer to prevent the overfi ing that this overly flexible control inevitably su ered from.See more in Methods S2.8 on statistical testing for our quality of fit measure.
When fi ing to simulated neural data, we obtained high quality of fits using the same measure (R 2 > 0.95).

S2.7.6 Overlap between the coding populations during the cue and delay epochs
To test whether separate neural populations encode stimulus information during the cue and delay epochs (Supplementary Fig. 10e), we trained (non-cross validated) decoders to decode cue identity using logistic regression on either cue-epoch activity ('cue-trained'; the first 250 ms of activity a er cue onset) or delay-epoch activity ('delay-trained'; 1250-1500 ms a er cue onset).We used an L2 regularisation penalty of 0.5 (we also tested other regularisation strengths and observed no substantial changes in our results).We took the absolute value of decoder weights as a measure of how strongly neurons contributed to decodability (either positively or negatively).We then binarized the absolute 'cue-trained' and 'delaytrained' weights using their respective median values as the binarization threshold.(This binarization reduces a potential bias e ect from large or small weight values in our analysis.)Our measure of overlap between the coding populations during the cue and delay epochs, was then simply the inverse normalized Hamming distance between these two sets of binarized weights: where w cue n,c ('cue trained') and w delay n,c ('delay trained') is the binarized weight of neuron n in cue condition c during the cue and delay epochs, respectively, and • n,c denotes taking the mean across neurons and cue conditions.For completely overlapping populations, this measure takes a values of 1, for completely non-overlapping populations, it takes a values of 0, and for random overlap (shown as 'chance') it takes a values of 0.5.
For the shu le controls, we randomly permuted the neuron indices of the delay-trained weights (such that using the median as a threshold thus resulted in values close to 0.5, i.e. chance level; Supplementary Fig. 10e).We show results (for both the original analysis and shu le control) for 10 random halves of the data (equivalent to the training halves of 10 di erent 1:1 train:test splits).We also tested a variety of percentile values other than the median and our results did not change substantially (choosing a threshold other than the median causes both the data and shu le controls to have overlap values lower than those that we obtained with the median as the threshold, but it does not substantially a ect the di erence between them).

S2.7.7 Finding fixed / slow points
For finding the fixed / slow points of nonlinear network dynamics (Fig. 2d,j, Supplementary Fig. 2b,f, Supplementary Fig. 3a, and Supplementary Fig. 11d), we used a slow-point analysis method (21) that searches for an x for which the L2 norm of the gradient determined by the autonomous dynamics of the network is below a threshold.Note that this was only possible in model neural networks as the method requires access to the equations (and parameters) defining the true (nonlinear) dynamics of a system.Specifically, for network dynamics governed by (cf.Eqs.S1 and S2) for some function ψ, we sought to find points x * such that ψ(x * ) 2 is small.To achieve this, we drew 1000 x's from a spherical Gaussian distribution with mean 0 and variance 10 (the large variance helps to ensure that we cover a large part of state space) and we optimized each x to minimize ψ(x) 2 using gradient descent with gradients obtained by back-propagation with an Adam optimizer (20).We used an adaptive learning rate (which we found worked substantially be er than a fixed learning rate in this scenario) that started at 0.1 and halved every 1000 training iterations (we used 5000 training iterations in total).Finally, we identified the x's obtained at the end of optimization as asymptotically stable fixed points, x * , if ψ(x) 2 < 0.001 and if the largest real part in the eigenvalues of the linearization of ψ(x) around x * was less than 0.

S2.7.8 Correlations between initial and final neural firing rates
To measure correlations between initial and final simulated activities, we used the Pearson correlation coe icient (with respect to the identity line) between initial and final mean-centered firing rates across neurons within the same simulation (i.e.no cross-validation; Fig. 2b,h; insets).Histograms show the distribution of this correlation across 6 cue conditions (and the 10 di erent networks, each simulated 10 times, see above) using a kernel-density estimate (Fig. 2c,i, Supplementary Fig. 2c,d,g,h, and Supplementary Fig. 3c).

S2.7.9 Measuring non-normality using Henrici's index
When measuring the non-normality of our fi ed linear models to our neural recordings (Fig. 5d-f) we used Henrici's index (31 where J = W − I is the Jacobian of the dynamics, λ i are the eigenvalues of J, and • F is the Frobenius norm of a matrix.An index of 0 implies that the system is normal whereas an index of 1 implies that the system is maximally non-normal.

S2.8 Statistics
We performed statistical hypothesis testing in two cases.
First, we tested whether the quality of fit of linear models to experimental data was su iciently high using permutation tests.To construct the distribution of our test statistic (cross-validated R 2 , see also Methods S2.7.5) under the null hypothesis, we used n = 200 di erent random time shu les of the data (Fig. 5d, dark gray), such that we shu led time bins coherently between the training and the test data, across neurons and conditions, and for each shu le used the same random 1:1 train:test split as for the original (unshu led) data.For additional calibration, we also constructed the distribution of our test statistic under the alternative hypothesis that all cross-validated errors were due to sampling noise di erences between the train and test data.For this, we used n = 200 random 1:1 train:test splits of the (original, unshu led) data, and measured the quality of fit directly between the test data and the training data (rather than a model fi ed to the training data, see also Methods S2.7.5;Fig. 5d, light gray).In both cases, we computed the two-tailed p-value of the test statistic as computed on the real data (Fig. 5d, blue lines) with respect to the corresponding reference distribution.
Second, we also used a permutation test-based approach to test whether the experimentally observed overlaps with persistent and amplifying modes (or their di erences) were significantly di erent from those expected by chance.For testing the significance of overlaps in a given time step, we constructed the distribution of our test statistics (the overlap measures; Methods S2.7.3) under the null hypothesis by generating n = 200 random subspaces within the space spanned by the 20 PCs we extracted from the data (Methods S2.4.3),dimensionality matched to the persistent and amplifying subspaces (i.e. 5 orthogonal dimensions), and computed the same subspace overlap measures for the data in the given time step with respect to these random subspaces (Fig. 5f and Supplementary Fig. 10f-g; gray line and shading).For testing the significance of di erences between overlaps (amplifying vs. persistent at a given time step, or amplifying or persistent between two di erent time steps), our test statistic was this di erence (i.e. a paired test), and our null distribution was constructed by measuring it for n = 200 pairs of random subspace overlaps at the appropriate time step(s).Once again, in all these cases we computed the two-tailed p-value of the test statistic as computed on the real data (Fig. 5f and Supplementary Fig. 10f-g, green and red lines) with respect to the corresponding reference distribution.
Note that we did not compute p-values across multiple splits of the data because this led to p-value inflation as we increased the number of splits.Instead, we repeated all relevant analyses on 10 di erent random 1:1 train:test splits to see if our results were robust to the choice of data split.Indeed, we obtained qualitatively and quantitatively (in terms of p-values for quality of fits, and overlaps) similar results for all these splits.Permutation tests do not assume that the data follows any pre-defined distribution.No statistical methods were used to predetermine experimental sample sizes.Sample sizes for permutation tests (n above) were chosen so as to be able to determine p-values to a precision of 0.01 (quality of fits) or 0.01 (subspace overlaps).

S3 Supplementary text
The main insight of our work is that optimal information loading relies on inputs aligned with the most amplifying mode of network dynamics, which in turn accounts for the combination of stable and dynamic coding seen in experiments.In Section S4, we first show that this combination does not arise in previously proposed models of working memory with hand-cra ed connectivities, but does in task-optimized neural networks.To understand why network optimization leads to this behavior, in Section S5, we provide mathematically rigorous derivations of optimal information loading for linear network dynamics.However, these derivations-and indeed, the very concept of the most amplifying mode-do not readily apply to nonlinear networks, which are likely to be more relevant for understanding cortical circuits.Therefore, in Section S6 we generalize the concept of the most amplifying mode to nonlinear dynamical systems (through a local linearization around the ground state) and show that in a simple canonical nonlinear a ractor system it o ers the same qualitative advantage for information loading as in linear networks.In Section S7, we demonstrate the same advantage in high-dimensional multi-a ractor nonlinear networks, show that the abstract notion of information loading performance we used in simple systems predicts a practically more relevant measure of performance in these networks, and that the dynamical signatures we established for identifying optimal information loading in linear networks can still be used even when the real dynamics of a circuit are nonlinear, and only indirectly accessible to us by the neural responses they generate.As there are no readily available analytical techniques that could be directly applied to the problem of information loading in nonlinear dynamical systems, our analyses in Sections S6 and S7 are mostly based on numerical simulations of specific nonlinear dynamical systems where we have ground truth knowledge of the dynamics and the information loading strategy that is optimal for them.Finally, in Section S8, we show with analytical derivations that optimizing a ring a ractor with a negative cosine loss on the di erence between true and decoded angles (as we do in the main text) is approximately equivalent to optimizing the population Fisher information about angle in the network.

S4 Comparison of previous models of working memory with task-optimized networks
We simulated four, previously proposed representative models of working memory, as well as one of the task-optimized networks developed here for comparison (the just-in-time network in Fig. 6d, right).
First, we simulated the 'bump' a ractor network of Ref. 1: a recurrent network in which neurons are indexed by their preferred directions and are connected recurrently by a circulant weight matrix, such that the connection strength between two excitatory neurons only depended (as a circular Gaussian) on the di erence of their preferred directions, and there was a single inhibitory neuron providing global inhibition (Methods S2.5).(Note that this network thus represents a paradigmatic case of quasi-symmetric connectivity.)In the absence of inputs, this network is able to maintain the memory of a saccade direction via a 'bump' of activities: its intrinsic dynamics (mediated by the recurrent interactions between its neurons) converge to a steady state pa ern in which the response of neurons is described by a Gaussian-like ('bump') function of the di erence between their preferred direction and the maintained direction.To understand information loading and static vs. dynamic coding in this network, it is useful to adopt a state space view of its dynamics (cf.Fig. 3).Information about the to-be-memorised direction (the stimulus cue) is loaded into this network by a transient external input.During the cue period, this transient input drives the network (Supplementary Fig. 1a, le , pale purple line and arrow) into a suitable state (Supplementary Fig. 1a, le , pale purple circle).Then, during the delay period, in the absence of the cue, its intrinsic dynamics (Supplementary Fig. 1a, le , dark purple line), are 'a racted' into a cue-specific steady state (corresponding to a bump of activities; Supplementary Fig. 1a, le , black cross).This naturally accounts for selective persistent activity (Supplementary Fig. 1a, center).However, the state into which the external input drives network activities, i.e. the initial state during the delay period, already has large overlap with the steady state to which the dynamics eventually converge (Supplementary Fig. 1a, le , pale purple line points in the direction of the black cross).In other words, the network performs simple 'pa ern completion' (32), whereby this overlap is only slightly improved until it becomes perfect (Supplementary Fig. 1a, le ), with the response of each neuron quickly reaching its steady-state and maintaining it throughout most of the delay period (Supplementary Fig. 1a, center).As a result, neurons show limited transient activity during the delay period (Supplementary Fig. 1a, center), and cross-temporal decoding reveals stable coding throughout the whole trial, lacking the characteristic dynamic coding seen in experimental data (compare Fig. 1b to Supplementary Fig. 1a, right).
We also simulated the closely related 'discrete' a ractor network of Ref. 1.In this network, neurons are assigned to a finite number of discrete clusters (one cluster for each cue condition), with strong recurrent connections between neurons belonging to the same cluster, and weaker connection between neurons belonging to di erent clusters (Methods S2.5).Thus, instead of maintaining bumps of activity, the steady states of this network correspond to one of the clusters being active while all others exhibiting weak background activity.Other than that, the same principles of simple pa ern completion apply to information loading that we saw in the bump a ractor network (Supplementary Fig. 1a).Therefore, this network is also characterized by the absence of dynamic coding (Supplementary Fig. 1b,c).
Recent work has suggested that linear integrator networks, that are based on similar principles as nonlinear a ractor networks but use linear dynamics, can exhibit activity transients similar to those seen in experiments (4) (see also (25)).In line with earlier work, when simulating such models, we also found activity transients to emerge when cue-dependent inputs included a random component in addition to the 'usual' steady state-aligned component that ensured pa ern completion (Methods S2.5).Because the random component in these networks is designed to be orthogonal to the steady statealigned component, the dynamics of the network take time to relax to the subspace spanned by the cue-dependent steady states (the coding subspace; Supplementary Fig. 1c, le ), and neurons exhibit transient activities during this relaxation (Supplementary Fig. 1c, middle).However, precisely because this relaxation thus takes place in a subspace orthogonal to the persistent subspace, it does not a ect stimulus coding during the delay.Thus a delay-trained decoder will always generalize perfectly to the cue period.Indeed, we found that cross-temporal coding demonstrated strongly stable coding (Supplementary Fig. 1c, right, also see Supplementary Fig. 8a)-just as in the classical a ractor networks seen above.We were able to achieve more dynamic coding in a modified version of this model, in which the random component of the inputs completely dominated (Methods S2.5.3,Supplementary Fig. 8b).In other words, this modified model consisted of unconstrained (non-normal) connectivity and purely random inputs (similar to Fig. 4, 'random').Nevertheless, our neural measure of optimal information loading, based on the time course of the overlap of activities with the most amplifying and most persistent modes (Supplementary Fig. 8d,e) still revealed a fundamental di erence from experimental data (cf.Fig. 5f) and task-optimized neural networks (cf.Fig. 6e).
A major class of alternative models of working memory use essentially feedforward dynamics (5,6).We thus simulated a simple, purely feedforward network that captured the essence of such dynamics (Methods S2.5).As expected, this network did not exhibit any steady state and inputs that stimulated the beginning of the feedforward chain resulted in long transients that rapidly transitioned into activity pa erns that were orthogonal to these inputs (because they involved the firing of an altogether di erent group of neurons at the end of the chain)-a er which, ultimately, all activities decayed to zero (Supplementary Fig. 1d, le and middle).(Note that the orthogonality of successively expressed activity pa erns, and the eventual decay of activities, would remain the same even in the more general case when the nodes of the feedforward chain are distributed pa erns of neural activities, rather than individual neurons.)As a consequence, this network showed strong dynamic coding with cross-temporal decoding being high only between neighbouring time-points, but no stable coding during the delay period (Supplementary Fig. 1d, right).
Finally, for comparison, we show the same plots for a representative example of one of our task-optimized networks (Supplementary Fig. 1e; see also Fig. 6, 'just-in-time').In this network, inputs are also largely orthogonal to the late delay activities, as in the feed-forward networks, but eventually converge to an a ractor maintaining information during the late delay period, as in a ractor networks (Supplementary Fig. 1e, le ).As a consequence, neurons show rich transient dynamics (Supplementary Fig. 1e, middle), and the network exhibits the experimentally described combination of dynamic and stable coding (Supplementary Fig. 1e, right).

S5 Analysis of linear integrator networks
In this section we provide analytical derivations showing that optimal information loading is achieved by the most amplifying mode in linear integrator networks.We first define linear integrator networks and introduce the basic mathematical concepts for understanding their dynamics (Section S5.1).We then define the persistent mode, a direction in the state space of linear integrator networks that is critical for characterising their performance (Section S5.2).We also define the energy of the system as a quantity useful for both constraining optimisation and defining optimal information loading (Section S5.3).We then define optimal information loading in five seemingly di erent ways (via five di erent constrained optimization problems), but show that all definitions lead to the same solution (at least approximately): 1.The initial condition with a fixed norm that achieves the highest asymptotic overlap with the persistent mode (Section S5.4).
2. The initial condition with minimal norm that achieves a fixed asymptotic overlap with the persistent mode (Section S5.5).
3. The most amplifying mode, defined as the initial condition with a fixed norm that evokes the largest asymptotic energy (Section S5.6).
extended inputs).In the previous subsection we saw that asymptotically the system's state is always aligned with the persistent mode, independent of its initial condition.However, the precise point along that mode, i.e. the system's overlap with the persistent mode, still depends on the initial condition.
We define the momentary overlap with the persistent mode as Using the eigen-coe icients (Eq.S39), we once again rewrite this in closed form as an explicit function of the initial condition: and take the asymptotic limit (Eq.S43) to obtain (see also Ref. 21 for an analogous derivation): In order to study optimal information loading, we want to identify the initial condition, x(0), that achieves the maximal asymptotic overlap with the persistent mode.As the system has linear dynamics, it is obvious that p(t) simply scales linearly with the norm of x(0).Therefore, to make our question meaningful, we have to introduce some additional constraint that prevents the trivial solution of an input with an infinitely large norm.We first consider simply a norm-constraint directly on x(0).This leads to solving the following optimization problem of which the solution is trivially Thus, optimal information loading is achieved by the first le eigenvector of J (which, as we saw above, will not be identical to the first right eigenvector, v (1) , i.e. the persistent mode itself, in the general case, when W-and thus J-is non-normal).
S5.5 Optimal information loading: achieving fixed asymptotic persistent overlap with a minimal initial norm Defining optimal information loading instead as the input that achieves a fixed asymptotic overlap with the minimal norm, i.e. as the solution to the optimization problem trivially leads to the same solution (up to multiplicative scaling) as that which we saw in Section S5.4, as it amounts to optimising the same Lagrangian: x * (0) = ±u (1) (S62) We will also consider below a third way to constrain the optimization problem, via its 'energy' (time integrated-norm), rather than just the norm of its initial condition (Section S5.8).

S5.6 Optimal information loading: maximising asymptotic energy with fixed initial norm
The asymptotic energy we introduced above (Eq.S55) allows us to relate optimal information loading to a well-known concept in control theory: the most amplifying mode, which is the eigenvector of the observability Gramian associated with the largest eigenvalue (30,33,34).Crucially, the most amplifying mode can also be defined as the initial condition with a unit norm that evokes the largest asymptotic energy, i.e. as the solution to the optimization problem which is straightforwardly given as Comparing Eq.S64 with Eq.S60 reveals that, in the case of 1-dimensional linear integrators, the most amplifying mode achieves optimal information loading as it is equivalent to the top le eigenvector.The intuition behind this result is simple: • Optimal information loading is defined (Section S5.4) as using an initial condition that (asymptotically) generates maximal activity along the persistent mode.
• The most amplifying mode is formally defined as the direction for initial conditions that results in the largest overall fluctuations (on an infinite time horizon; Eq.S63).
• In linear integrators, all fluctuations are eventually quenched, except those that are expressed along the persistent mode of the system (Eq.S43).
• Therefore, if an input direction generates large overall fluctuations, it must be because it achieves a large asymptotic overlap with the persistent mode.
• Therefore, the most amplifying mode must achieve the largest overlap with the persistent mode, i.e. it is the optimal information loading direction.
More importantly, as we show in Section S5.7, the most amplifying mode generalises to cases (multiple integration directions in non-perfect integrators) in which the le eigenvectors are not relevant, thus motivating our use of the set of most amplifying directions rather than le eigenvectors to define optimal information loading throughout our analyses.

S5.7 Optimal information loading: noise-robust decoding
We now study the more general problem of optimal inputs for general, stable linear dynamics, which are not necessarily exactly marginally stable, and may in fact have multiple dimensions along which they (approximately) integrate.To be closer to the actual (linear) networks we study in the main text (Fig. 4, Supplementary Fig. 5c,d, and Supplementary Fig. 8a,b,d,e), we consider stochastic dynamics of the form (cf. Eq.S35, but also see Eq. S2): where η(t) is zero-mean, unit-variance white Gaussian noise, so that the total noise injected into the system has variance σ 2 .We assume that the network has a ained its (input-free) steady-state distribution before the stimulus arrives at t = 0, and the input x(0) is an instantaneous additive 'push' of the network state from whatever stochastic state it takes at t = 0 (rather than directly defining the network's state at t = 0 as before).
Considering stochastic network dynamics allows us to meaningfully define optimality in terms of decoding performance (rather than the somewhat arbitrary objective of maximising overlap with a particular direction in state space) -also in closer analogy with the main text.Following Ref. 6, we quantify decoding performance using the Fisher information.For this, we first note that as the dynamics is stochastic, the state of the system at any time is described by a probability distribution.Given that the dynamics is linear and the noise is Gaussian (Eq.S65), this distribution is also Gaussian: where C = σ 2 ∞ 0 e J t e J t dt is the stationary noise covariance of neural responses.(Note that the mean of the state distribution is the same as that for deterministic dynamics, Eq.S41, so that noise only contributes to the covariance, and conversely, the covariance does not depend on the input x(0).) The ability to discriminate two inputs in the x(0) direction, x(0) and x(0) (1 + ∆), at a later time t, is upper bounded by the Kullback-Leibler (KL) divergence between the two corresponding state distributions at that time.For normal distributions that only di er in their means (Eq.S66), the KL-divergence can be wri en simply as: where I(t) is the Fisher information matrix of the system at time t: Indeed, this upper bound on decoding performance is saturated by a simple linear decoder, x(t) = W out x(t) with readout weights Note that the optimal decoder is thus time-dependent.
Finally, to create a single scalar objective that can be optimised, we measure the 'total' decodability of the system over a suitably long (technically, infinite) time horizon: where Ī is the total Fisher information matrix of the system (also called the 'Spatial Fisher memory matrix' in Ref. 6): which is symmetric and positive definite by construction.Note that Eq.S70 measures the (upper bound on) the diagonal of what we call the 'cross-temporal decoding matrix' in the main text, i.e. the performance of a time-dependent optimal linear decoder, whereas we (optimise and) use time-independent decoders for measuring decoding performance in stochastic integrators (roughly corresponding to a horizontal slice of the cross-temporal decoding matrix, Fig. 4a-b).As, in general, the optimal decoder is indeed time-dependent (Eq.S69), a single time-independent linear decoder will not be able to a ain optimal performance.Nevertheless, for the integrators we study, DKL is dominated by persistent activity, which in turn can be decoded by a fixed decoder, and so a time-independent linear decoder is able to achieve near-optimal performance. 1e can now define optimal information loading as the x(0) that maximizes x * (0) = argmin This is trivially solved by the top eigenvector of Ī, i.e. the eigenvector with the largest eigenvalue.Moreover, because Ī is symmetric and positive definite (see above), this can be generalized to the first k optimal information loading directions being the top k (orthogonal) eigenvectors of Ī. Importantly, we observe that Ī, as defined in Eq.S71, is the so-called observability Gramian of the system (33), and thus its top k eigenvectors are equivalent to the k most amplifying directions of the system used in our analyses.(Indeed, note the close analogy between Eq.S71 and Eq.S54.)As a technical complication, Eq.S71 defines the observability Gramian with (the square root of) C −1 as the so-called 'readout matrix', whereas for consistency across our analyses (see below), we defined the most amplifying modes as the top eigenvectors of the observability Gramian with simply an identity readout matrix.We confirmed numerically that for the stochastic networks shown in (Fig. 4, Supplementary Fig. 5c,d, and Supplementary Fig. 8a,b,d,e), the top eigenvectors of the observability Gramians defined with these two di erent readout matrices were highly correlated (average 0.95 for the top k = 5 eigenvectors used in our analyses).For all other analyses of most amplifying modes (Fig. 3, Fig. 5f, Fig. 6e, Supplementary Fig. 3e Supplementary Fig. 7, and Supplementary Fig. 10f-h) we used deterministic networks (constructed de-novo, or fi ed to stochastic nonlinear networks, or to experimental recordings; Methods S2.4.1 and S2.4.3) and so the choice of an identity readout matrix was justified by a lack of a meaningful definition of C.

S5.8 Optimal information loading: minimising energy with fixed persistent overlap
In our analyses of optimal information loading so far (Eqs.S59, S63 and S72), we have been considering a constraint on the norm of initial conditions in order to make optimization meaningful (as without some constraint, scaling the initial condition will always scale our chosen measures of performance due to the linearity of the dynamics).We now consider a biologically more relevant constraint that is be er aligned with the optimization objectives used for most of our numerically optimized neural networks (Fig. 6 and Supplementary Figs. 3 and 11-13) and also standard in previous studies optimising recurrent neural networks (16)(17)(18)(19): the energy of the system as defined in Eq.S48.Using the duality of objective functions and constraints in constrained optimization, we seek the initial condition that minimizes energy, ε t while maintaining a fixed amount of overlap with the persistent mode, p t (Eq.S56), at the end of a time period of length t.This means we want to solve the following optimization problem: the solution to which is where ρ i (t) = Ξ i1 e λi t , and Ω(t) is defined in Eq.S52 (S75) In the limit when the delay period is very long, t → ∞, we obtain where V Q and Λ Q are respectively the matrix of eigenvectors (as columns) and diagonal matrix of associated eigenvalues of Q, both ordered in decreasing order of the eigenvalues (which are all real, as Q is positive definite, Sections S5.3 and S5.6), and the approximate proportionality in Eq.S77 is due to ρ ∞ = lim t→∞ ρ(t) ∝ (1, 0, … 0) (in analogy with Eq.S43), u (1) being identical to the top eigenvector of Q (Section S5.6), and the eigenvectors of Q being orthonormal, because it is symmetric positive definite (Eq.S54).Were this proportionality exact, Eq.S76 could be simply solved as In other words, the optimal information loading direction of an integrator network that achieves a fixed level of (asymptotic) persistent overlap while minimising the energy (or, conversely, that maximises the asymptotic persistent overlap given a fixed energy budget) is again approximately the top le eigenvector of the dynamics, i.e. (Section S5.6) the most amplifying mode.
In practice, Eq.S77 only holds approximately, with the ratio of the subsequent elements of ρ to the first element being very small, but not quite zero.However, the final weighting of the eigenvectors of Q in expressing x * (0) depends on the product of ρ with Λ −1 Q (Eq.S76), and the (diagonal) elements of Λ −1 Q (which are the inverses of the diagonal elements of Λ Q ) are ordered exactly oppositely to the elements of ρ.This partially cancels the preferential weighting of the top eigenvector of Q by ρ, and also makes Eq.S78 approximate.Nevertheless, in numerical simulations we found that Eq.S78 held to good precision in random linear integrator networks (such as those in Fig. 3d,e, Fig. 4 and Supplementary Figs. 4 and 5).In Supplementary Fig. 4a,b, we show the relationship between the energy produced by random initial conditions, scaled so that they all result in the same asymptotic persistent overlap, and their overlap with the most persistent versus most amplifying mode.We find that, in general, the overlap with the most amplifying mode (but not with the most persistent mode) is strongly and inversely related to the energy.
While our analytical derivations above concerned the asymptotic limit of an infinitely long delay period, Supplementary Fig. 4c-d show numerical simulations with systematically varied finite delay lengths.In particular, Supplementary Fig. 4c shows that the initial conditions that are optimal for di erent finite delays (Eq.S74), again in the sense of achieving a fixed persistent overlap by the end of the delay with minimal energy measured over the delay (Eq.S73), show an intuitive dependence on the length of the delay (Supplementary Fig. 4c, bo om).For very short delays, it is best to use inputs that are already aligned with the persistent mode, as there is no time for the dynamics of the system to contribute and gradually align the state with the persistent mode.However, for longer delays, the dynamics can have an increasing contribution, thus increasingly preferring the most amplifying mode over the persistent mode, in qualitative agreement with our analytical results.(We find that even for long delays, the overlap of optimal inputs is not all and none with the most amplifying and persistent modes, respectively, in line with our analysis of the approximate nature of Eq.S78.)In line with these results, the persistent mode achieves optimal to near-optimal performance (i.e.minimal energy) at short delays, but the most amplifying mode performs be er (uses less energy) at longer delays (Supplementary Fig. 4d, bo om).(For symmetric networks, Supplementary Fig. 4c-d, top, which we only show for the completeness of analogy with Fig. 3d-e, no such trade-o is observed for the trivial reason that the most persistent and amplifying modes are identical.)

S6 Analysis of canonical nonlinear systems with two stable fixed points
For linear systems that can exhibit persistent activity, we showed that inputs that align with the most amplifying mode generate the most noise-robust (and thus most easily decodable if using multiple inputs) persistent activity (Section S5, see also Figs. 3 and 4 and Supplementary Fig. 5).To develop insights as to how this result might generalise to nonlinear a ractor systems, here we study two variants (either symmetric, Supplementary Fig. 6, top, or non-symmetric, Supplementary Fig. 6, bo om) of a canonical minimal nonlinear a ractor system, based on the 'unforced Du ing oscillator' (35) (Eqs.S28 and S29; see Supplementary Fig. 6a for flow fields).Specifically, both variants exhibit 3 fixed points: a saddle point at the origin and 2 asymptotically stable fixed points (i.e. a ractors) at (±1, 0) (Supplementary Fig. 6a, black crosses).Thus, the (one dimensional) 'coding subspace' of the system along which the two a ractors can be best distinguished is the line connecting the two a ractors, c = (1, 0) (Supplementary Fig. 6a, x 1 axes).These two a ractors divide the state space of the system into two halves, their respective basins of a raction, which are separated by a manifold, the so-called 'separatrix' (which in this case is one-dimensional; Supplementary Fig. 6a, dashed black lines).This means that the a ractor to which the system ultimately converges from a particular initial state simply depends on which side of the separatrix that initial state lies.

S6.1 A local linearization approach based on linearizing around the ground state instead of an a ractor
Standard linearization-based approaches to understand the behaviour of nonlinear a ractor systems linearize the system's nonlinear dynamics around an a ractor (21,36).These approaches are valuable for understanding the asymptotic behaviour of the dynamics, once the state of the system is in the su iciently close vicinity of the a ractor, e.g. to understand its stability (21,35,36).However, to study information loading, we need to understand the initial behaviour of the system.Thus, our approach is based on a linearization around the fixed point at the origin (see also Methods S2.4.2), which serves as the 'ground state'-the state which the system is thought to occupy before the 'stimulus' arrives, and thus with respect to which the magnitude of initial conditions is constrained.Indeed, the origin is an ideal ground state in this case (Supplementary Fig. 6a), not only because it is equidistant from the two a ractors, but also because it sits on the separatrix.Thus, as long as the system is in the origin, or-the origin being a saddle point-even if it is perturbed away from the origin along the separatrix, it does not 'choose' between the two a ractors.

S6.2 Most persistent and amplifying directions for nonlinear dynamics
Linearizing Eqs.S28 and S29 around the origin yields the Jacobian matrix J s = 1 0 0 −1 for the symmetric system, and J n = 1 3  0 −1 for the non-symmetric system.Performing the same analyses on these linearized dynamics as those that we established for genuinely linear dynamics (Fig. 3; see also Methods S2.7.1), reveals that the (locally) 'most persistent' mode of the dynamics, v (i.e. the direction associated with the top (i.e. largest real) eigenvalue of the Jacobian), in this case represents an unstable direction of the system around the origin (Supplementary Fig. 6a, thin green lines).Although at first it may seem counterintuitive that an unstable direction is called 'persistent', this is actually consistent with our naming convention for linear dynamics not only based on its mathematical definition (see above), but also because-just like for the linear networks we studied-the persistent direction defined in this way corresponds to the coding direction of the system, i.e. v = c (in both variants).In addition, these analyses also identify the 'most amplifying' mode, a, (Supplementary Fig. 6a, thin red lines).Intuitively, the (locally) most amplifying mode is orthogonal to the separatrix in both systems (note di erent scales for the x 1 and x 2 axes in Supplementary Fig. 6a, causing an apparent lack of orthogonality in the bo om panel), i.e. it is the direction that allows the system to start moving away from the separatrix, and thus choose between the two a ractors, the fastest.For the symmetric system, this direction is actually identical to the most persistent direction, a s = v.However, for the non-symmetric system, the most amplifying mode is at an (approximately 57 • ) angle to the most persistent mode, a n ≈ (0.55, 0.83) (unit normalized).

S6.3 Locally most amplifying modes predict optimal inputs, analogous to linear networks
Numerical simulations of the original nonlinear dynamics confirmed that the particular information loading strategies we identified and compared for genuinely linear dynamics (Fig. 3a-c) lead to qualitatively similar behaviours in these nonlinear systems.Specifically, in analogy with our analysis of linear dynamics (Fig. 3b), given a fixed magnitude with respect to the origin (Supplementary Fig. 6a, pale blue ellipses), we compared three di erent initial conditions that were aligned with either the locally most persistent (Supplementary Fig. 6a, pale green arrows and circles) or most amplifying direction at the origin (Supplementary Fig. 6a, pale red arrows and circles), or with a randomly chosen direction (Supplementary Fig. 6a, gray arrows and circles).We then simulated dynamics starting from each of these initial conditions (Supplementary Fig. 6a, dark green, red, and black arrows) and measured how the overlap of the system's state with the most persistent direction evolved over time (Supplementary Fig. 6b).As expected, there were some unavoidable di erences from the behaviour of the linear system: with these nonlinear a ractor dynamics, eventually all trajectories converge to one of the a ractors, and thus the overlap with the most persistent mode ultimately reaches the same asymptote (at unity) in all cases.As a consequence, the di erences in trajectories corresponding to di erent initial conditions, and their overlap with the most persistent mode, can appear smaller than in the case of linear dynamics (this is especially the case for the non-symmetric system; Supplementary Fig. 6a, bo om).Nevertheless, by measuring the average overlap achieved between t = 100 and 250 while we systematically varied the direction of the initial conditions in the full 360 • range, we found a monotonic relationship between the initial overlap of the state of the system with the most amplifying mode and this measure of 'late' overlap of the system with the most persistent mode Supplementary Fig. 6c, solid lines).This was qualitatively similar to the behavior of the simple linear networks of Fig. 3, where we also found a monotonic (in that case, linear) relationship between initial amplifying and late persistent overlap (Supplementary Fig. 6c, dashed lines; with late overlap measured between t = 0.8 and 2 s).This meant that the most amplifying direction was indeed the optimal direction for 'information loading' in both the nonlinear and linear dynamical systems (Supplementary Fig. 6c, pale red circles and squares).In comparison, with both nonlinear and linear dynamics, random directions or the most persistent direction-when it was di erent from the most amplifying direction (i.e. in the non-symmetric system; Supplementary Fig. 6a-c, bo om)-achieved lower information loading performance by this measure (Supplementary Fig. 6c, gray and pale green circles and squares).
In summary, we have shown that-just like in the case of linear persistent dynamics-the locally most amplifying direction around the ground state of these nonlinear systems represents the optimal information loading strategy, that is be er than 'pushing' the system directly in the direction of the desired a ractor (most persistent mode), or in random directions.The intuition is that the locally most amplifying mode is the direction that allows the system to leave the vicinity of the separatrix and travel towards the desired a ractor the fastest-this gain in speed quickly o sets the initial disadvantage in closeness to the a ractor in comparison to inputs using the most persistent mode.While the size of the e ects we obtained were small in some cases, these depended on our choice of parameters and were in part due to the simplicity and low-dimensionality of these systems.Thus, as the next step, we studied these e ects in high-dimensional neural networks with nonlinear dynamics.S7 Analysis of the nonlinear a ractor networks of Fig. 2 To further understand how our theoretical insights gained from analysing linear networks (Section S5, see also Fig. 3 and Supplementary Fig. 5) may generalize to nonlinear a ractor networks, we performed the same local (around the origin) linearization of the nonlinear a ractor networks of Fig. 2 that we introduced in the previous section (Section S6; see also Methods S2.4.2).

S7.1 The relationship between subspaces extracted from nonlinear dynamics and by local linearization
First, we linearized the dynamics (Eqs.S1 and S2) of both the symmetric and unconstrained networks of Fig. 2 (Methods S2.4.2) and identified the associated most persistent and amplifying modes around the origin based on this local linearization.We then measured the overlap of the (5-dimensional) subspaces spanned by these modes with the subspaces that we extracted from the activities of the original nonlinear networks without linearizing their dynamics (Methods S2.7.1), analysed in Fig. 2f and l.We found that the locally most persistent modes overlapped strongly with the persistent subspace of the nonlinear dynamics in which the a ractors lie (Supplementary Fig. 7a, green, 'persistent subspace') and overlapped poorly with the persistent nullspace (Supplementary Fig. 7a, green, 'persistent nullspace').Subspaces (dimensionally matched) spanned by random modes showed the opposite pa ern: they had small overlaps with the persistent subspace and large overlaps with the persistent nullspace, simply due to the higher dimensionality of the la er (Supplementary Fig. 7a, black, 'persistent subspace' and 'persistent nullspace').Critically, for both locally most persistent and random modes, these overlaps were essentially identical between symmetric and unconstrained networks (Supplementary Fig. 7a, compare top and bo om panels).In contrast, the overlap of the locally most amplifying modes with the persistent subspace and nullspace of the nonlinear dynamics clearly di erentiated between symmetric and unconstrained networks.Specifically, in symmetric networks, the most amplifying modes overlapped strongly with the persistent subspace and poorly with the persistent nullspace-as they were identical to the most persistent modes in this case (Supplementary Fig. 7a, top, red, 'persistent subspace' and 'persistent nullspace').In unconstrained networks, however, they showed a similarly strong overlap with the persistent nullspace as with the persistent subspace (Supplementary Fig. 7a, bo om, red, 'persistent subspace' and 'persistent nullspace').

S7.2 Locally most amplifying modes predict optimal inputs, analogous to linear networks
The results above suggested that our original findings, showing a double dissociation in the e iciency of persistent subspace and nullspace inputs (initial conditions) between symmetric and unconstrained networks (Fig. 2f and l, red and green), are likely explained by how much these inputs align with the locally most amplifying inputs.Indeed, we found that the subspace of numerically optimized inputs (studied in Fig. 2b-e, h-k, and f and l, black) had a substantially larger overlap with the locally most amplifying mode in both classes of networks than with random subspaces, or-in unconstrained networks, in which persistent and amplifying modes are distinguishable-than with the locally most persistent modes (Supplementary Fig. 7a, 'optimal', compare red to black and green).This pa ern of overlaps of optimal inputs with the locally most amplifying, persistent and random subspaces was closely analogous to those that we found for the optimized initial conditions in the purely linear networks of Supplementary Fig. 5c,d,g,h (Supplementary Fig. 7a, 'optimal (lin.model)').We also found a similar pa ern of results when performing local linearizations in optimized ring a ractor networks (Supplementary Fig. 3e).In sum, these results generalise the results of Section S6 to high dimensional nonlinear neural networks with multiple a ractors.They indicate that the locally most amplifying and persistent directions that can be extracted from nonlinear a ractor networks around the ground state play functionally similar roles in information loading to the corresponding modes of linear networks of which we have a firm analytical understanding.

S7.3 Overlap with most persistent mode in linear networks predicts decoding accuracy in nonlinear networks
In deterministic linear networks, to allow analytical insights, we used the 'overlap with persistent modes' as a measure of performance (Fig. 3c-e, Supplementary Fig. 5a,b,e,f).In contrast, the measure of performance in stochastic nonlinear networks that we regarded as ultimately relevant, and directly applicable to experimental data, was based on the accuracy of a linear decoder (Fig. 2f,l, Fig. 5b,c, Fig. 6c,d).To study the relationship between these two measures, we simulated both the original stochastic nonlinear networks of Fig. 2 and the deterministic linear dynamics obtained from their local linearization (introduced above), and measured the time evolution of each of these metrics in the corresponding networks when their dynamics were started from initial conditions that were optimized for the decoding accuracy of the nonlinear dynamics (Methods S2.3.1 and S2.7.2) while constrained to be within the locally most persistent, most amplifying, or a random subspace, as determined by the local linearization, or without any subspace constraints (i.e. as originally done in Fig. 2b-e, h-k, and f and l, black).For a fair comparison, each of the subspaces used for constraining initial conditions were 5-dimensional.
In general, the overlap with the most persistent modes in the locally linearized networks showed somewhat richer dynamics that in the de novo linear networks, in which it remained exactly constant or at most decayed exponentially (Fig. 3c,e).Specifically, the overlap measure we used here showed a brief initial decrease for symmetric networks (Supplementary Fig. 7b, top) and mildly non-monotonic time courses for unconstrained networks (Supplementary Fig. 7b, bo om).These were due to measuring overlap here with a 5-dimensional most persistent subspace (appropriate for networks distinguishing between 6 cue conditions), including some directions that were associated with eigenvalues that could have slightly negative real parts and non-zero complex parts (besides the one that was defined to have a zero eigenvalue, see Methods S2.4.2).In contrast, in Fig. 3, we measured overlap with the single most persistent direction of the dynamics (distinguishing between only two cue conditions) that was constructed to be exactly persistent, i.e. with a zero eigenvalue.Despite these di erences, we found that the overlap measure for the locally linearized networks showed qualitatively similar pa erns as in the de novo large linear networks of Fig. 3e.For symmetric networks (Supplementary Fig. 7b, top), the overlap was largely constant in time for each initial condition, with most persistent and amplifying inputs achieving identical performances Supplementary Fig. 7b, top, red and green), substantially above that of random inputs (Supplementary Fig. 7b, top, black), and on par-and by this measure, even somewhat be er-than that of optimal inputs (Supplementary Fig. 7b, top, blue).For unconstrained networks (Supplementary Fig. 7b, bo om), the overlap for most persistent inputs again remained roughly constant in time (Supplementary Fig. 7b, bo om, green).For most amplifying inputs, overlap started from a lower level initially, but it eventually overtook the level achieved by most persistent inputs (Supplementary Fig. 7b, bo om, red).Random inputs performed worst (Supplementary Fig. 7b, bo om, black), and optimal inputs showed a similar trajectory to that of most amplifying inputs (initially low overlap which then increased to above that achieved by most persistent inputs), though at a somewhat lower performance by this measure (Supplementary Fig. 7b, bo om, blue vs. red).
Not only did these overlap measures of linearized networks retain some important qualitative characteristics of the overlap measures used in de novo linear networks, they also showed a clear relationship with our ultimate measure of interest: decoding accuracy in the corresponding stochastic nonlinear networks (using a 'noise-matched' scenario-see Fig. 4a-in which all networks had the same level of noise in their dynamics; Methods S2.4.1).In symmetric networks, most persistent and amplifying inputs achieved maximal performance (as did the optimal inputs), with random inputs performing substantially more poorly (Supplementary Fig. 7c, top).In unconstrained networks, most amplifying (and optimal) inputs outperformed most persistent inputs, with random inputs performing worst again (Supplementary Fig. 7c, bo om).Thus, the overall rank ordering of di erent inputs for information loading, as measured by eventual decoding accuracy, is well predicted by their rank ordering on the overlap measure.

S7.4 Fi ing linear dynamics reveals information loading strategies in nonlinear networks
We have shown so far that the same principles (and measures) determining the e iciency of di erent information loading strategies that we analytically identified in linear networks also apply to nonlinear networks.However, our approach for this was based on a local linearization of the original nonlinear dynamics, which required knowledge of the true equations governing the dynamics of these networks.This is obviously not available for experimental data.Thus, in order to be able to apply our theoretically derived measures of optimal information loading without having access to the true dynamics of the system, we analyzed experimental data by fi ing neural responses by linear neural networks (Fig. 5d-f; see also Methods S2.4.3).As a consequence, we wanted to validate that this fi ing-based approach provides meaningful results, and has the capacity to reveal information loading strategies in nonlinear networks even when 1. we do not have access to the true dynamics but only to samples of activities generated by those dynamics; and 2. we also cannot assume that the true dynamics are linear.
For this, we repeated the same analyses with our simulated nonlinear networks while using 'ground truth' information loading strategies, i.e. inputs confined to the most persistent, or most amplifying, or random, or optimal subspaces as defined above (Section S7.3).
As in the previous section (Section S7.3), we used a 'noise-matched' scenario-see Fig. 4a-in which all networks had the same level of noise in their dynamics; Methods S2.4.1).A more realistic comparison might require 'performance matched' simulations, as in Fig. 3c, in which di erent networks are allowed to have di erent amounts of noise in their dynamics such that it is instead the asymptotic level of (decoding) performance that is matched between them.However, we found that our main metric of subspace overlap was robust to changes in noise levels, so we expect that all our results hold to high precision in the performance matched regime as well.
For symmetric networks, we found that linear dynamical systems fi ed to responses that resulted from using optimal inputs yielded similar dynamics to those following either most persistent or most amplifying inputs (Supplementary Fig. 7d; top, compare 'persistent', 'amplifying', and 'optimal').Initial overlap was well above chance with both the most amplifying and persistent modes (though higher for the former), such that the overlap with most amplifying modes remained constant, while the overlap with most persistent modes increased over time.In contrast, the initial overlap resulting from random inputs with either persistent or amplifying modes was close to chance (with overlap with amplifying once again remaining constant, while overlap with persistent increasing over time).Note that all of these were unlike the pa erns we saw in experimental data (Fig. 5f).
For unconstrained networks, most amplifying and optimal inputs again resulted in similar dynamics (Supplementary Fig. 7d; bo om, compare 'amplifying' and 'optimal').Initial overlap was high with most amplifying modes but low (at chance) with persistent modes before the dynamics ultimately overlapped strongly with the persistent modes and weakly with the amplifying modes.This was in contrast to dynamics following persistent or random inputs, in which there was no clear di erence in initial overlaps with most amplifying vs. most persistent modes, neither was there a clear decrease in overlap with the most amplifying modes over time (Supplementary Fig. 7d, compare 'persistent' and 'random').Note that experimental data was only consistent with the former but not the la er pa ern of results (Fig. 5f).

S8 Fisher information in task-optimized ring a ractor networks
Here, we note that the cosine term in the cost function (Eq.S4) for the ring a ractor networks that we optimize in Methods S2.3.4,quantifying the precision of the decoded angle, is closely related to a cost measuring the population Fisher information about angle.Intuitively, both cost functions quantify the ability of the network to perform fine discrimination of stimulus angles.To see this more formally, recall that the Fisher information in this case is where, as in Methods S2.3.4,θ (c) is the true stimulus angle on a trial (the target for the decoder), r is the population response, and P r|θ is the distribution of responses the network generates to some stimulus angle θ.In the limit of a su iciently large population, the maximum likelihood estimator, θML achieves the same Fisher information as the full population vector, and is distributed as a (circular) Gaussian (von Mises) distribution centered on the true orientation, θ (c) , with some constant (circular) concentration, κ, so we can write Assuming that the population vector-based decoder we use for training networks, θ (Eq.S5 in Methods S2. 3.4), is an e icient estimator that approximates the maximum likelihood decoder, θ θML2 , and substituting the integral over the distribution of the estimate with an empirical average over its stochastic realizations, we can further rewrite the Fisher information as Eq. S82 is thus identical to the integrand of the first term in Eq.S4 up to a multiplicative constant (which can be incorporated into α (1)  nonlin in Eq.S4), an additive constant (the 1 inside the square bracket in Eq.S4), which does not ma er for optimization, and a sign-flip, because we are maximizing Fisher information in Eq.S82 but minimizing the cost in Eq.S4.Therefore, minimizing (the first term in) Eq.S4 also (approximately) maximizes Eq.S79 (averaged over time points at which the network is decoded and over true stimulus angles), and vice versa.

. 3 |
Dynamics of optimized ring a ractor networks.a, Neural activity (colored trajectories) in a ring a ractor network with unconstrained connectivity and optimized initial conditions (see Methods S2.3.1 and S2.3.4)

Supplementary Fig. 6 |. 7 |
Analysis of canonical nonlinear a ractor systems.a, State space of a canonical nonlinear system with two a ractors and a symmetric (top) and non-symmetric Jacobian (bo om, see also Methods S2.6, Section S6; cf.Fig.3b).Pale blue arrows show flow field dynamics (direction and magnitude of movement in the state space as a function of the momentary state).Black crosses indicate asymptotically stable fixed points (i.e. a ractor states), dashed black line shows the separatrix (the manifold separating the basins of a raction of the two a ractors).Thin green and red lines indicate the locally most persistent and amplifying modes around the origin, respectively (lines are o set slightly in the top panel to aid visualisation).Pale green, red, and gray arrows with open circles at the end indicate most persistent, amplifying, and random initial conditions, respectively.Blue ellipses show the fixed initial condition norm around the origin to highlight the di erent axis scales.Dark green, red, and black arrows show neural dynamics starting from the corresponding initial condition.b, Time course of dynamics of the system along the persistent mode (i.e. the projection onto the green line in a) when started from the persistent (green), most amplifying (red), or random (black) initial conditions for the symmetric (top) and the unconstrained system (bo om).c, Late overlap with the locally persistent mode as a function of initial overlap with the locally most amplifying mode in the canonical nonlinear systems shown in panels a-b (solid gray line) and, for comparison, in the linear networks of Fig.3a-c (dashed gray line) for symmetric (top) and unconstrained systems (bo om).Late overlap is measured as the mean overlap of activity along the persistent mode (panel b, from t = 0.8 to t = 2 for the canonical nonlinear system; Fig.3c, from t = 0.8 s to t = 2 s for the linear networks).Open circles and squares indicate the random (gray), persistent (pale green), and most amplifying (pale red) initial conditions used respectively in panels a and b for the canonical nonlinear system, and in Fig.3b-cfor the linear networks.Linear analyses of the nonlinear a ractor networks of Fig.2.a, Overlap (mean±1 s.d.across 10 networks) of the 5 locally most persistent (green), most amplifying (red), or random directions (black) of the symmetric (top) and unconstrained (bo om) networks from Fig.
4.2).c, Performance (mean±1 s.d.across 10 networks) of a delay-trained decoder (black bar indicates decoder training time period; Methods S2.7.4) on neural activity in stochastic nonlinear symmetric (top) and unconstrained networks (bo om) over time.Colors indicate initial conditions as in b. (Blue line shows same data as black line in Fig. 2f and l).Gray do ed line shows chance level decoding.Green, red, and blue lines are vertically o set slightly in the top panel to aid visualization.Compare with Fig. 4a (noise matched) for the analogous plots for linear networks (though with non-instantaneous inputs).(Captioncontinued on next page.)

. 8 |. 9 |
Analysis of two variants of an integrator model and feedforward model.a, Crosstemporal decoding of model neural activity (cf.Fig.2e,k, Fig.4b, and Fig.5c) for a linear integrator model (4) (cf.Supplementary Fig.1cand Methods S2.5).Yellow lines indicate cue onset, o set, and go times.b, Same as a for the same model but for inputs aligned with purely random directions (as opposed to inputs aligned with both persistent and random directions as in the original formation of Ref. (4)).c, Same as a but for a linear feedforward network model(5,6) (cf.Supplementary Fig.1d).d, Percent variance of responses explained by the subspace spanned by either the 25% most persistent (green) or 25% most amplifying (red) modes as a function of time for the linear integrator model from a (cf.Fig.4c,b, Fig.5f, and Fig.6e).Yellow lines indicate cue onset, o set, and go times.Gray do ed line shows chance level overlap with a subspace spanned by 25 random orthogonal directions.e, Same as d for the same model but for inputs aligned with purely random directions.f, Same as d but for a linear feedforward network model(5,6).Recording locations for the two monkeys.Le : recording locations in monkey K (T1weighted image).In order to image the interior of the chamber, we filled the chamber with cut co ons soaked in iodine.In the upper picture, the yellow arrow indicates the principal sulcus.In the bo om picture, locations of the 11 by 15 grid holes were superimposed over the MR picture.Right: recording locations in monkey T (T2-weighted image).The bo om picture shows the location for the grid of the 32 semi-chronic electrodes.Yellow dots indicate electrode penetrations and recording sites, red dots indicate non-visited sites.trueshuffl.true shuffl.true shuffl.true shuffl.true shuffl.true shuffl.true shuffl.

. 11 |
Cue-delay and just-in-time trained networks.a-b, Same as Fig. 6c green and red, and Fig. 6d le and right, but with a regularisation strength of α (2) nonlin = 0.0005 used during training (Methods S2.3.2). c, Subspace overlap between di erent task epochs, measured as the percent variance explained (PVE) by projecting neural activity from one task epoch (tested) through the top 4 PCs of another task epoch (fi ed; cf.Supplementary Fig. .) Purple traces show state-space trajectories, squares indicate cue onset, open circles indicate cue o set, and crosses indicate asymptotically stable fixed points, colors indicate cue condition as in Fig. 5e.

1
Supplementary Fig.12| A er-go-time trained networks.a, Cost function for a er-go-time training on the fixed delay task (Methods S2.3.3).Cue onset, cue o set, and go cue times are indicated by the yellow vertical lines.The boxcar shows the interval over which stable decoding performance was required (i.e. the cost was only applied a er the go cue).b-c, Same as Fig.6corange and Fig.6dcenter, but with a regularisation strength of α(2)  nonlin = 0.0005 used during training and when either a random (b orange, c le ) or a fixed delay task is used (b blue, c right, Methods S2.7.4).d, Subspace overlap between di erent task epochs, measured as the percent variance explained (PVE) by projecting neural activity from one task epoch (tested; cf.Supplementary Fig.11c, Supplementary Fig.13d, and Supplementary Fig. 10b) through the top 4 PCs of another task epoch (fi ed) for the networks shown in b-c.Diagonal elements show the PVE within each task epoch.e, Neural activity plo ed in the top two PCs of delay-epoch activity for all 6 initial conditions for random delay (le ) and fixed delay (right) trained networks (cf.Supplementary Fig. 2b-d and f-h; and Supplementary Fig. 11d.)Purple traces show state-space trajectories, squares indicate cue onset, open circles indicate cue o set, and colors indicate cue conditions as in Fig. 5e.(Note that the absence of crosses indicates the absence of asymptotically stable fixed points.)Supplementary Fig. 13 | Full-delay trained networks.a, Cost function for full-delay training on the random delay task (Methods S2.3.3).Yellow ticks indicate cue onset and o set times, the yellow bar indicates range of go times in the variable delay task.Boxcars show intervals over which stable decoding performance was required in three example trials with di erent delays (Methods S2.3.3).b-c, Same as Fig. 6c-d, but when training with the full-delay cost with a regularisation strength of α (2) nonlin = 0.00005 (b solid, c le ) or α (2) nonlin = 0.0005 (b dashed, c right, Methods S2.7.4).d, Subspace overlap between di erent task epochs, measured as the percent variance explained (PVE) by projecting neural activity from one task epoch (tested; cf.Supplementary Fig. 11c, Supplementary Fig. 12d, and Supplementary Fig. 10b) through the top 4 PCs of another task epoch (fi ed) for the networks shown in b-c.Diagonal elements show the PVE within each task epoch.

I 2 θ
θ (c) − P θML |θ (c) ∂ 2 ∂θ = θ (c) ln P θML |θ d θML (S80) = κ P θML |θ (c) cos θML − θ (c) d θML (S81) Schematic of 3 di erent hypothetical scenarios for the relationship between cue and late delay activities (panels), illustrated in neural dynamics for 2 neurons and 2 cue conditions.Colored traces show neural trajectories, black squares indicate cue onset, open circles indicate cue o set, and filled circles show late delay activity.Le vs. right: populations encoding the cue during cue and late delay periods are overlapping vs. non-overlapping, respectively.Top vs. bo om: cue and delay activities are non-orthogonal vs. orthogonal, respectively.(Note that we are not showing dynamics for nonoverlapping, non-orthogonal dynamics because no overlap necessarily implies orthogonality.)e, Table S1 |Continued on text page.

Table S2 |
Parameters for nonlinear network optimization.Times T 1 and T 2 are relative to stimulus onset at t = 0. Units are shown in parentheses a er the name of the corresponding parameter.
The time resolution of both t and t was 250 ms, such that the time periods (relative to cue onset) that we used were −500 to −250 ms (spontaneous epoch), 0 to 250 (cue epoch), 1250 to 1500 ms (delay epoch), and the first 250 ms a er the go cue, i.e. t go to t go + 250 ms (go epoch).In this case, U t , P(t) and Σ(t) were obtained by fi ing all the available neural data (i.e.no cross-validation).See also Ref. 7 for an 'alignment index' metric that is closely analogous to this use of this metric.
4or showing how much variance the top 2 delay epoch PCs capture over time (Supplementary Fig.10c), in line with Ref.4, we set U t = U where the columns of U are the first 2 principal components of neural activities over the time period 750 to 250 ms before the go cue, i.e. t go − 0.75 to t go − 0.25 s, and P(t) also includes the top 2 principal components.The resolution for t was 10 ms (for clarity, bins to be plo ed were subsampled in the corresponding figure).In this case, we estimated U and P(t) in a cross-validated way (as in Ref. 4)-we estimated U using training data and P(t) and Σ(t) using test data, and we show results averaged over 10 random 1:1 train:test splits of the data.See also Ref.4for a measure that is closely related to this use of this metric, but uses the number of neurons in the denominator instead of the total variance.