The roles of online and offline replay in planning

Animals and humans replay neural patterns encoding trajectories through their environment, both whilst they solve decision-making tasks and during rest. Both on-task and off-task replay are believed to contribute to flexible decision making, though how their relative contributions differ remains unclear. We investigated this question by using magnetoencephalography (MEG) to study human subjects while they performed a decision-making task that was designed to reveal the decision algorithms employed. We characterised subjects in terms of how flexibly each adjusted their choices to changes in temporal, spatial and reward structure. The more flexible a subject, the more they replayed trajectories during task performance, and this replay was coupled with re-planning of the encoded trajectories. The less flexible a subject, the more they replayed previously preferred trajectories during rest periods between task epochs. The data suggest that online and offline replay both participate in planning but support distinct decision strategies.


Introduction
. Subjects differed in decision flexibility. (a) Experimental task space. Before performing the main task, subjects learned state-reward associations (numbers in black circles) and they were then gradually introduced to the state space in a training session. After performing the main task for two blocks of trials, subjects learned new state-reward associations (numbers in dark gray circles) and then returned to the main task. Before a final block of trials, subjects were informed of a structural task change such that 'House' switched position with 'Tomato', and 'Traffic sign' switched position with 'Frog'. The bird's eye view shown in the figure was never seen by subjects. They only saw where they started from on each trial and, after completing a move, the state to which their move led. The map was connected as a torus (e.g., starting from 'Tomato', moving right led to 'Traffic sign', and moving up or down from the tomato led to 'Pond'). (b) Each trial started from a pseudorandom location from whence subjects were allowed either one ('1-move trial') or two ('2-move trial') consecutive moves (signalled at the start of each set of six trials), before continuing to the next trial. Outcomes were presented as images alone, and the associated reward points were not shown. A key design feature of the map was that in 5 out of 6 trials the optimal (first) move was different depending on whether the trial allowed one or two moves. For instance, given the initial image-reward associations (black) and image positions, the best single move from 'Face' is LEFT (9 points), but when two moves are allowed it is best to move RIGHT and then DOWN (5+9 giving 15 total points). Note that the optimal moves differed also given the second set of image-reward associations. On 'no-feedback' trials (which started all but the first block), outcome images were also not shown (i.e., in the depicted trials, the 'Wrench', 'Tomato' and 'Pond' would appear as empty circles). (c) The proportion of obtainable reward points collected by the experimental subjects, and by three simulated learning algorithms. Each data point corresponds to 18 trials (six 1-move and twelve 2-move trials), with 54 trials per block. The images to which subjects moved were not shown to subjects for the first 12 trials of Blocks II to V (the corresponding 'Without feedback' data points also include data from 6 initial trials with feedback wherein starting locations had not yet repeated, and thus, subjects' choices still reflected little new information). All algorithms were allowed to forget information so as to account for post-change performance drops as best fitted subjects' choices (see Materials and methods for details). Black dashed line: chance performance. Shaded area: SEM. (d) Proportion of first choices that would have allowed collecting maximal reward where one ('1-optimal') or two ('2-optimal') consecutive moves were allowed. Choices are shown separately for what were in actuality 1-move and 2-move trials. Subjects are colour coded from lowest (gold) to highest (red) degree of flexibility in adjusting to one vs. two moves (see text). Dashed line: chance performance (33%, since up and down choices always lead to the same outcome). (e,f) Decrease in collected reward following a reward-contingency (e) and spatial (f) change, as a function of the index of flexibility (IF) computed from panel d. Measures are corrected for the impact of pre-change performance level using linear regression. value derived using a premutation test. The results indicate subjects adjusted their choices advantageously to the number of allotted 23 moves (+0.21, SEM 0.05, p < 0.001, Bootstrap test), though there was evidence again of 24 substantial individual differences (Fig. 1d). Importantly, IF correlated with how well a 25 subject coped with the reward-contingency (Fig. 1e) and position (Fig. 1f) changes as well 26 with how accurately they could sketch maps of the state space at the end of the experiment 27 ( = 0.51, < 0.001, Permutation test; Supplementary Fig. 1). Moreover, examining a 28 subset of 2-move trials in which subjects made their second moves without seeing the 29 consequence of their first moves, indicated subjects with high IF planned two steps into the 30 future (Supplementary Note 1), as would be expected from MB planning. 31 Individual flexibility reflected MF-MB balance 32 These convergent results suggest that IF reflected deployment of a MB planning strategy. To 33 test this formally, we compared how well different model-free and model-based decision 34 algorithms, as well as a combination of both, explained subjects' choices. Importantly, we 35 enhanced these algorithms to maximize their ability to mimic one another (see Materials and 36 methods for details). Thus, for instance, the MF algorithm included separate 1-move and 2-37 move policies. 38 We found that a hybrid of MF and MB algorithms outperformed substantially either of them 39 alone (Bayesian Information Criterion 26 : MF = 40821, MB = 43249, MF-MB hybrid = 40 39908), suggesting that subjects employed a mix of MF and MB planning strategies. 41 Simulating task performance using the hybrid algorithm showed it captured adequately 42 differences that were evident between subjects (correlation between real and simulated IF: 43 = 0.92, < 0.001, Permutation test; Supplementary Fig. 2a). When we examined each 44 subject's best-fitting parameter values, to determine which of these covaried with IF, we 45 found 84% of inter-individual variance was explained by three parameters that control a balance between flexible, model-based, and inflexible, model-free, planning (Supplementary 1 Fig. 2b). Importantly, less flexible subjects had comparable learning rates and a higher 2 model-free inverse temperature parameter (in 2-move trials), indicating that lower flexibility 3 did not reflect a non-specific impairment, but rather, it was associated with enhanced 4 deployment of a model-free algorithm. Thus, our index of flexibility specifically reflected the 5 influence of model-based, as compared to model-free, planning. On-task replay is induced by prediction errors and associated with high flexibility 7 In rodents, reinstatement of past states, potentially in the service of planning, is evident both 8 prior to choices 27 and following observation of outcomes 2 . Thus, we determined firstly at 9 what point states were neurally reinstated during our task. For this purpose, we trained MEG 10 decoders to identify the images subjects were processing (Fig. 2a). Such decoders robustly 11 reveal stimulus representations that are reinstated from memory and contribute to decision 12 processes 25,28 . Crucially, image decoders were trained on MEG data collected prior to 13 subjects having any knowledge about the task, ensuring that the decoding was free of 14 confounds related to other task variables (see Materials and methods). Applying these 15 decoders to MEG signals from the main task, we found no evidence of prospective 16 representation of outcome states (images) to which subjects will transition at choice 17 ( Supplementary Fig. 4a). Instead, we found strong evidence that following outcomes 18 (corresponding to new states to which subjects transitioned), subjects represented the states 19 from which they had just moved ( ̅ = 3.4, = 0.001, Permutation test; Supplementary Fig.   20 4b). Consequently, we examined in detail the MEG data recorded following each outcome 21 for evidence of replay of state sequences that subjects had just traversed. 22 To test for evidence of replay, we applied a measure of "sequenceness" to the decoded MEG 23 time series, a metric we have previously shown is sensitive in detecting replay of experienced 24 and decision-related sequences of states 10,12,25 . Importantly, sequenceness is not sensitive to 25 simultaneous covariation, and thus, it is only found if stimulus representations follow one 26 another in time 25 (as in previous work, we allowed for inter-stimulus lags of up to 200 ms). 27 Thus, following each outcome, we computed sequenceness between the decoded 28 representations of the preceding and the outcome state (Fig. 2b). Additionally, MEG signals 29 recorded following the second outcome in 2-move trials were also tested for sequenceness 30 reflecting the trial's first transition (i.e., between the starting state and first outcome; Fig. 2c). 31 Using an hierarchical Bayesian Gaussian Process approach (see Methods for details) we 32 tested for timepoints at which sequenceness was evident and correlated with individual 33 flexibility. This method directly corrects for comparison across multiple timepoints by 34 accounting for the dependency between them 29 . Since replay is thought to be induced by 35 surprising observations 16,17,30,31 , we also included surprise about the outcome (i.e. the state 36 prediction error inferred by the hybrid algorithm) as a predictor of sequenceness. We found 37 significant sequenceness encoding the last experienced state transition (from 50 to 330 ms 38 and from 820 to 950 ms following outcome onset; Fig. 2b; note that the median split is only 39 for display purposes; analyses depended on the continuous flexibility index) and, at the 40 conclusion of 2-move trials, also the penultimate transition (from 130 ms before to 350 ms 41 following outcome onset; Fig. 2c). These sequences were accelerated in time, with an 42 estimated lag of 130 ms between the images, and were encoded in a 'forward' direction 43 corresponding to the order actually visited. Moreover, later in the post-outcome epoch, the penultimate transition was also replayed backwards (from 440 to 940 ms following outcome 1 onset). 2 Importantly, we found this evidence of replay, across all timepoints, was correlated with IF 3 (mean = 0.17, 95% Credible Interval = 0.13 to 0. 20), with surprise about the outcome 4 (mean = 0.06, CI = 0.03 to 0.10) and with the interaction of these two factors (mean = 5 0.19, CI = 0.15 to 0.22). Thus, sequenceness was predominantly evident following surprising 6 outcomes in subjects with high index of flexibility, consistent with online replay contributing 7 to model-based planning. The plot shows the decodability of starting images from MEG data recorded during the main task at trial onset. Decodability was computed as the probability assigned to the starting image by an 8-way classifier based on each timepoint's spatial MEG pattern, minus chance probability (0.125). (b) Sequenceness corresponding to a transition from the image the subject had just left ('Start image'; in the cartoon at the bottom, the face) to the image to which they arrived ('outcome image'; the tomato) following highly surprising outcomes (i.e., above-mean state prediction error). In the cartoon, the white arrow indicates the actual action taken on the trial; the blue arrow indicates the sequence that is being decoded. For display purposes only, mean time series are shown separately for subjects with high (above median) and low (below median) IF. Positive sequenceness values indicate forward replay and negative values indicate backward replay. As in previous work 25 , sequenceness was averaged over all inter-image time lags from 10 ms to 200 ms, and each timepoint reflects a moving time window of 600 ms centred at the given time (e.g., the 1 s timepoint reflects MEG data from 0.7 s to 1.3 s following outcome). Dashed lines show mean data generated by a Bayesian Gaussian Process analysis, and the dark gray bars indicate timepoints where the 95% Credible Interval excludes zero and Cohen's > 0.1. The top plot shows IF as a function of sequenceness for the timepoint where the average over all subjects was maximal. value derived using a premutation test. (c) Sequenceness following the conclusion of 2-move trials corresponding to a transition from the starting image to the first outcome image. (d) Difference in the probability of subsequently choosing a different transition as a function of sequenceness recorded at the transition's conclusion. For display purposes only, sequenceness is divided into high (i.e., above mean) and low (i.e., below mean). A correlation analysis between sequenceness and probability of policy change showed a similar relationship (Spearman correlation: = −0.04, = 0.02, = 0.04, Bootstrap test). Sequenceness was averaged over the first cluster of significant timepoints from panels b and c, in subjects with non-negligible inferred sequenceness (more than the standard deviation divided by 10; = 25), for the first time the subject chose each trajectory. Probability of changing policy was computed as the frequency of choosing a different move when occupying precisely the same state again. 0 corresponds to the average probability of change (51%). 1 Recent theorising regarding the role of replay in planning argues that replay should be 2 preferentially induced when there is a benefit to changing one's policy 17 . This perspective 3 predicts that, at least in our experiment, subjects should be more disposed to replay 4 trajectories that they might not want to choose again, rather than trajectories whose choice 5 reflects a firm policy. To determine whether decodable on-task replay was associated with 6 policy changes, we tested the relationship between sequenceness corresponding to each move 7 that subjects chose, and the probability of making a different choice when occupying the 8 same state later on. We found that moves after which high forward sequenceness was evident 9 corresponded to moves that were less likely to be re-chosen subsequently (Fig. 2d), and these 10 policy changes increased the proportion of obtained reward ( = +11.1%, = 1.5%, 11 = 0.001). Thus, evidence of online replay was coupled with advantageous re-planning in 12 relation to the same trajectories.

13
Off-task replay is induced by prediction errors and associated with low flexibility 14 We next studied off-task replay, examining MEG data recorded during the 2-minute rest 15 period that preceded each experimental block. Since each block included five frequently 16 repeating starting states, we computed sequenceness for the five most frequent image-to-  Individual flexibility as a function of sequenceness in rest MEG data for the five most frequently experienced image-to-image transitions. For each rest period, sequenceness was averaged over transitions from both the preceding and following blocks of trials. value derived using a premutation test.
Off-task replay can predict subsequently chosen sequences 1 If offline replay is involved in planning, then its content should predict subjects' subsequent 2 choices. To test this, we dissociated the replay of experienced trajectories from that of 3 planned trajectories, focusing on the third rest period after which the optimal image-to-image 4 transitions changed entirely (due to a change in state-reward associations). As subjects had 5 been taught about the reward change before this rest period, this afforded an opportunity to 6 re-plan their choices accordingly during this rest epoch. 7 We first examined the behavioural effect of the state-reward change in more detail. The most 8 frequently chosen transitions in the block that followed the third rest period differed from the 9 transitions most frequently chosen in the preceding block (overlap: = 14%, = 3%), 10 and this policy change was substantially greater than for the other rest periods (overlap: = 11 53%, = 2%). As expected, the newly chosen transitions from the following block were 12 advantageous given the new state-reward associations (reward collected: = 71%, = 13 2%; chance = 60%) and disadvantageous given the state-reward associations that had so far 14 applied ( = 52%, = 2%). 15 Given the behavioural change, we focused our examination of the MEG data on evidence for 16 sequenceness during this crucial third rest period. We found that subjects indeed replayed the 17 transitions they subsequently chose ( = 0.004, = 0.002, = 0.02, Bootstrap test). 18 This replay of subsequently chosen moves indicates subjects utilized a model of the task to 19 re-plan their moves offline 16,17,19 . Our reasoning here is that re-planning in light of the new 20 reward associations, before subjects experienced them in practice, requires a model that 21 specifies how to navigate from one state to another. Indeed, multiple regression analysis 22 showed that low IF was only associated with sequenceness encoding previously chosen the lack of a flexibility enhancement associated with prospective offline replay might indicate 26 that, as might be expected, offline planning is ill-suited for enhancing trial-to-trial flexibility.

28
We find substantial differences in the behaviour of individual subjects in a simple state-based 29 sequential decision-making task that correspond also to a distinction in the nature, and 30 apparent effects, of MEG-recorded on-and off-task replay of state trajectories. These results 31 bolster important behavioural dissociations, as well as provide substantial new insights into 32 the control algorithms that subjects employ. The findings fit comfortably with an evolving 33 literature that addresses human replay and preplay [10][11][12]25,28 .

34
There is an intuitive appeal to the distinction between model-based and model-free reasoning, 35 confirmed by its close association with many well-established psychological distinctions 32 (one-step versus two-step control), preserved performance in the face of changes in the location of rewards or structure, and an ability to reproduce explicitly, after the fact, the 1 transition structure. Furthermore, the task effectively incentivizes flexible model-based 2 reasoning, as this type of reasoning alone allows collection of substantial additional reward 3 (93%) compared to our most successful MF algorithm (80%). These convergent observations 4 suggest that the model-based and model-free distinction we infer from our task rests on solid 5 behavioural grounds. 6 In human subjects, there is a growing number of observations of replay and/or preplay of 7 potential trajectories of states that are associated with the structure of tasks that subjects are 8 performing 10,11,28. However, it has been relatively hard to relate these replay events to 9 ongoing performance. By contrast, there is evidence that rodent preplay has at least some 10 immediate behavioural function 8,27 , and there are elegant theories for how replay should be 11 optimally sequenced and structured in the service of planning 17 . In particular, it has been 12 suggested that replay should prioritize trajectories that can soon be re-encountered, and for 13 which one's policy can be improved. Our results are broadly consistent with this theoretical 14 perspective, showing that new surprising observations precede evidence of corresponding 15 replay, and which in turn predicts appropriate changes in policy. However, rather than 16 preplay immediately prior to choice, we found evidence of on-task replay following feedback 17 alone, suggesting a third potential factor impacting on the timing and content of replaythe 18 need to minimize memory load by embedding new information in ones' policies as soon as it 19 is received. 20 Critically, the timing and content of replay differed across individuals in a manner that links 21 with their dominant mode of planning. More model-based subjects tended to replay 22 trajectories during learning, predominantly reflecting choices they were likely to reconsider. 23 There have been reports of preferential replay of deprecated trajectories in rodents 8,41 . 24 However, those studies are consistent with a more general function for replay (e.g., 25 maintaining the integrity of a map given a biased experience), whereas in our case, replay 26 was closely related to future behaviour. 27 By contrast the decodeable replay of more model-free subjects centred on rest periods, during 28 which DYNA-like mechanisms are hypothesized to compile information about the 29 environment to create an effective model-free policy 17 . This replay of state-to-state transitions 30 suggests that despite a general inability at the end of the task to draw a map accurately, 31 model-free subjects do have implicit access to some form of model, though likely an 32 incomplete one. In any case, generating a policy offline might not be a good strategy for a 33 task that requires trial-to-trial flexibility, consistent with the lack of association here between 34 offline replay and ultimate winnings. 35 Our work has a number of limitations. First, our experiment was not ideally suited to 36 inducing compound representations that link states with those that succeed them, since 37 succession here frequently changed both within and between blocks. However, algorithms 38 that utilize such representations mimic both model-free and model-based behaviour, and 39 future work could utilize our methods to investigate whether and how these algorithms are 40 aided by online and offline forms of replay 41 . Second, the sequenceness measure that we use 41 to determine replay suffers from a restriction of comparing forwards to backwards sequences. 42 There is every reason to expect both forwards and backwards sequences co-exist, so focusing 43 on a relative predominance of one or the other is likely to provide an incomplete picture. The 44 problem measuring forwards and backwards replay against an absolute standard is the issue 45 of a large autocorrelation in the neural decoding, and better ways of correcting for this are 46 desirable in future studies. Nevertheless, despite these shortcomings the work we report here 47 is a further step towards revealing the rich and divergent structure of human choice in 1 sequential decision making tasks. Supplementary Note 1: Planning two steps into the future 1 Having a cognitive model that specifies how states are spatially organized makes it possible 2 to plan several steps into the future. To test whether subjects were able to do that, we 3 challenged subjects with 12 'without-feedback' trials at the beginning of each of the last 4 4 blocks, during which outcome images were not shown. This meant that in 2-move trials 5 subjects had to choose their second move 'blindly', without having seen the image to which 6 their previous move had led (e.g., the tomato in Fig. 1b). We found that subjects performed 7 above chance on these blind second moves (proportion of optimal choices: 0.56, SEM 0.03; 8 chance = 0.45; p < 0.001, Bootstrap test), and this was the case even immediately following  Fig. 3a). 12 This result indicates that more flexible subjects were better able to plan two steps into the 13 future when required. Examining response times suggested flexibility was associated with 14 advance planning also when it was not required. Thus, we found that IF correlated with and methods for details). Validating the decoder on MEG data from the main task showed 21 that chosen moves became gradually more evident over the course of the trial, their 22 decodability peaking 140 ms before a choice was made (Supplementary Fig. 3b). 23 Thus, we used the move decoder to test whether second-move choices began to materialize in 24 the MEG signal even before subjects observed the outcomes of their first moves. We found  Thus, neural and behavioural evidence concur with the notion that flexibility was associated 33 with planning second moves prospectively.

1
Subjects. 40 human subjects, aged 18-33 years, 25 female, were recruited from a subject 2 pool at University College London. Exclusion criteria included age (younger than 18 or older 3 than 35), neurological or psychiatric illness, and current psychoactive drug use. To allow 4 sufficient statistical power for comparisons between subjects, we set the sample size to 5 roughly double that used in recent magnetoencephalography (MEG) studies on dynamics of 6 neural representations 28,42 , and in line with our previous study of individual differences using 7 similar measurements (including 'sequenceness') 25 . Subjects received monetary 8 compensation for their time (£20) in addition to a bonus (between £10 and £20) reflecting 9 how many reward points subjects earned in the experiment task. The experimental protocol 10 was approved by the University of College London local research ethics committee, and 11 informed consent was obtained from all subjects.

12
Experimental design. To study flexibility in decision making, we designed a 2x4 state space 13 where each location was identified by a unique image. Each image was associated with a 14 known number of reward points, ranging between 0 and 10. Subjects' goal was to collect as 15 much reward as possible by moving to images associated with a high numbers of points. 16 Subjects were never shown the whole structure of the state space, and thus, had to learn by 17 trial and error which moves lead to higher reward. 18 Subjects were first told explicitly how many reward points were associated with each of the 19 eight images. Subjects were then trained on these image-reward associations until they 20 reliably chose the more rewarding image of any presented pair (see Image-reward training). 21 Next, the rules of the state-space task were explained (see State-space task), and multiple- 22 choice questions were used to ensure that subjects understood these instructions. To facilitate 23 learning, subjects were then gradually introduced to the state space, and were allowed one 24 move at a time from a limited set of starting locations (see State-space training). Following 25 this initial exposure, the rules governing two-move trials were explained and subjects 26 completed a series of exercises testing their understanding of a distinction between one-move 27 and two-move trials (see State-space exercise). Once these exercises were successfully 28 completed, subjects played two full blocks of trials in the state pace, that included both one-29 move and two-move trials. 30 We next tested how subjects adapted to a change in the rewards associated with images. For 31 this purpose, we instructed and trained subjects on new image-reward associations (see State-32 space design). Subjects then played two additional state-space blocks with these modified 33 rewards. 34 Finally, we tested how subjects adapted to changes in the spatial structure of the state space. 35 For this purpose, we told subjects that two pairs of images would switch locations, informing 36 them precisely which images these were (see State-space design). Multiple-choice questions 37 were used to ensure that subjects understood these instructions. Subjects then played a final 38 state-space block with this modified spatial map. 39 At the end of the experiment, we also tested subjects' explicit knowledge, asking them to 40 sketch maps of the state spaces and indicate how many points each image was associated 41 with before, and after, the reward contingency changed. Stimuli. To ensure robust decoding from MEG, we used 8 images that differed in colour, 43 shape, texture and semantic category [43][44][45] . These included: a frog, a face, a traffic sign, a 44 tomato, a hand, a house, a pond, and a wrench.
State-space task. Subjects started each trial in a pseudorandom state, identified only by its 1 associated image. Subjects then chose whether to move right, left, up, or down, and the 2 chosen move was implemented on the screen, revealing the new state (i.e., as its associated 3 image) to which the move led. In 'one-move' trials, this marked the end of the trial, and was 4 followed by a short inter-trial interval. The next trial then started from another pseudorandom 5 location. In 'two-move' trials, subjects made an additional move from the location where 6 their first move had led. This second move disallowed backtracking the first move (e.g., 7 moving right and then left). Subjects were informed they would be awarded points associated 8 with any image to which they move. Thus, subjects won points associated with a single 9 image on one-move trials, and the combined value of the two images on two-move trials. The 10 numbers of points awarded were never displayed during the main task. Every 6 trials, short 11 text messages informed subjects what proportion of obtainable reward they had collected in 12 the last 6 trials (message duration 2500 ms). 13 Each state-space block consisted 54 trials, 18 one-move and 36 two-move trials respectively. 14 The first 6 trials were one-move, the next 12 were two-move trials, then the next 6 were again 15 one-move trials, the next 12 two-move, and so on. Every 6 trials, short text messages 16 informed subjects whether the next 6 trials were going to be one-move or two-move trials 17 (message duration 2000 ms). Every six consecutive trials featured 6 different starting 18 locations. The one exception to this were the first of the 24 two-move trials of the 19 experiment, where in order to facilitate learning, each starting location repeated for two 20 consecutive trials (a similar measure was also implemented for one-move trials during 21 training; see State-space training). Subjects' performance improved substantially in the second of such pairs of trials (Δproportion of optimal first choices = +0.15, 95% CI = +0.11 23 to +0.18, p < 0.001, Bootstrap test). 24 At the beginning of every block (except the first one), we tested how well subjects could do 25 the task without additional information, based solely on the identity of the starting locations. 26 For this purpose, images to which subjects' moves led were not shown for the first 12 trials. 27 In two-move trials, this meant subjects implemented a second move from an unrevealed 28 image (i.e., state). 29 State-space design. The mapping of individual images to locations and rewards was 30 randomly determined for each subject, but rewards were spatially organized in a similar 31 manner for all subjects. To test whether subjects could flexibly adjust their choices, the state 32 space was constructed such that there were five locations from which the optimal initial move 33 was different depending on whether one or two moves were allowed. We tested subjects 34 predominantly on these starting locations, using all five of them in every six consecutive 35 trials. Following two blocks, the rewards associated with each image were changed, such that 36 the optimal first moves in both 1-move and 2-move trials, given the new reward associations, 37 were different from the optimal moves under the initial reward associations. The initial and 38 modified reward associations were weakly anti-correlated across images ( = −0.37). 39 Finally, before the last block, we switched the locations of two pairs of images, such that the   State-space exercise. Following the state-space training, which only included one-move 5 trials, we ensured subjects understood how choices should differ in one-and two-move trials 6 by asking them to choose the optimal moves in a series of random, fully visible state spaces. 7 Subjects were given a bird's eye view of each state space, with each location showing the 8 number of reward points with which it was associated. The starting location was indicated in 9 addition to whether one, or two, moves were available from which to collect reward. In all 10 exercises, the optimal initial move was different depending on whether one or two moves 11 were allowed. Every 10 consecutive exercises consisted of 5 one-move trials and 5 two-move 12 trials. To illustrate the continuity of the state space, the exercise included one-move and two-13 move trials, wherein the optimal move required the subject to move off the map and arrive at 14 the other end (e.g., moving left from a leftmost location to arrive at the rightmost location). In

Modelling.
To test what decision algorithm subjects employed, and in particular, whether 1 they chose moves that had previously been most rewarding from the same starting location 2 (model-free planning), or whether they learned how the state space is structured and used this   moves is completed based on the total reward obtained by the two moves. This learning 24 proceeds as described by Eqs. 1 and 2, but with a different learning rate ( MF2 ). 25 All expected values are initialized to , and decay back to this initial value before every 26 update: 28 where MF value retention. This allows learned expectations to be gradually forgotten. 29 Following instructed changes to the number of points associated with each image, or to the   8 wherein the latter are integrated over possible second moves each weighted by its probability: where ̃ is the opposite of , and ′ is the opposite transition prediction error: Self-transitions are impossible and thus their probability is initialized to 0. All other 1 transitions are initialized with uniform probabilities, and these probabilities decay back to 2 their initial values before every update: 11 Since some subjects may simply reset their transition matrix following instructed changes, 12 the algorithm also 'forgets' after such instruction, as in Eq. 13, but only for a single time 13 point and with a different memory parameter, ′ MB .
14 Finally, the probability the algorithm will choose a given move when encountering a given 15 image depends on its model-based estimate of the move's expected outcome:

24
where is a fractional parameter that determines the degree to which reward obtained by the 25 second move is taken into account. 26 Following the first move, Eq. 15 is used to choose a second move based on the observed new 27 location ( ,2 ). However, if the next location is not shown (i.e., in trials without feedback), the 28 agent chooses its second move by integrating Eq. 15 over the expected ,2 , as determined by 29 ( ,1 , ,1 , ,2 ).  Parameter fitting. To fit the free parameters of the different algorithms to subjects' choices, 10 we used an iterative hierarchical expectation-maximization procedure 26 . We first sampled 11 10000 random settings of the parameters from predefined group-level prior distributions.  cutoff frequency using a sixth-order Butterworth IIR filter, and we baseline-corrected each 28 trial's data by subtracting the mean signal recorded during the 400 ms preceding trial onset. 29 Trials in which the average standard deviation of the signal across channels was at least 3 30 times greater than median were excluded from analysis (0.4% of trials, SEM 0.2%). Finally, 31 the data were resampled from 600 Hz to 100 Hz to conserve processing time and improve 32 signal to noise ratio. Therefore, data samples used for analysis were length 273 vectors 33 spaced every 10 ms.

34
Pre-task stimulus exposure. To allow decoding of images from MEG we instructed subjects 35 to identify each of the images in turn (Supplementary Fig. 6). On each trial, the target image 36 was indicated textually (e.g., 'FACE') and then an image appeared on the screen. Subjects' differed in colour, shape, texture and semantic category 43,44 (Fig. 1a). Importantly, at this 43 point subjects had no knowledge as to what the main task would involve, nor that the images 44 would be associated with state-space locations and rewards. This ensured that no task 45 information could be represented in the MEG data at this stage.
MEG decoding. We used support vector machines (SVMs) to decode images and moves 1 from MEG. All decoders were trained on MEG data recorded outside of the main state-space 2 task and validated within the task. As in previous work 25 , we trained a separate decoder for  To enable completion of the MCMC sampling within a reasonable timeframe, we reduced the 4 trial-to-trial sequenceness data to four mean time series per subject: sequenceness encoding 5 the last or penultimate transition following highly or weakly surprising outcomes. High and 6 low surprise were determined based on the state prediction error generated by the hybrid 7 algorithm, whose parameters were fitted to the individual subject's choices (i.e., high -8 above-mean prediction error, lowbelow-mean prediction error). Since we assumed last and 9 penultimate transitions could be replayed in different timepoints, these two types of time deviation that matches the standard deviation of the predicted variable. Length-scales ( ) 24 were drawn from log-normal distributions whose mean is the geometric mean of two 25 extremes: the distance in time between two successive timepoints, and the distance in time 26 between the first and last timepoints. Half of the difference between these two values was 27 used as the standard deviation of the priors. interaction was limited to positive values for the 28 sake of identifiability, since the group-level Gaussian Processes were multiplied by the 29 coefficients. 30 We ran six MCMC chains each for 1400 iterations, with the initial 400 samples used for 31 warmup. STAN's default settings were used for all other settings. Examining the results 32 showed there were no divergent transitions, and all parameters were estimated with effective 33 sample sizes larger than 1000 and shrink factors smaller than 1.1. Posterior predictive checks 34 showed good correspondence between the real and generated data (Fig. 2b,c). 35 Decodability time series analyses. Decodability was tested for difference from zero and 36 covariance with individual flexibility using the Bayesian Gaussian Process approach outlined 37 above with the exclusion of the surprise predictor, which is inapplicable to timepoints that 38 precede outcome onset.

39
Other statistical Methods. Significance tests were conducted using nonparameteric methods 40 that do not assume specific distributions. Differences from zero were tested using 10000 41 samples of bias-corrected and accelerated Bootstrap with default MATLAB settings. 42 Correlations and differences between groups were tested by comparison to null distributions 43 generated by 10000 permutations of the pairing between the two variables of interest. All 44 tests are two-tailed.
Data and Code availability 1 The data and custom code used in this study have been deposited on the Open Science College London. The Wellcome Centre for Human Neuroimaging is supported by core 10 funding from the Wellcome Trust (091593/Z/10/Z).

11
Competing interests 12 The authors declare that they have no conflict of interest.