Value Shapes Abstraction During Learning

Abstractions are critical for flexible behaviours and efficient learning. However, how the brain forgoes the sensory dimension to forge abstract entities remains elusive. Here, in two fMRI experiments, we demonstrate a mechanism of abstraction built upon valuation of task-relevant sensory features. Human volunteers learned hidden association rules between visual features. Computational modelling of participants’ choice data with mixture-of-experts reinforcement learning algorithms revealed that, with learning, emerging high-value abstract representations increasingly guided behaviour. Moreover, the brain area encoding value signals - the ventromedial prefrontal cortex - also prioritized and selected latent task elements, both locally and through its connection to visual cortex. In a second experiment, we used multivoxel neural reinforcement to show how reward-tagging the neural sensory representation of a task’s feature evoked abstraction-based decisions. Our findings redefine the logic of valuation as a goal-dependent, key factor in constructing the abstract representations that govern intelligent behaviour.


Introduction
Value representations have traditionally been linked with neuronal activity in the ventromedial prefrontal cortex (vmPFC) in the context of economic choices 16,18 ; more recently vmPFC role has been extended to the computation of confidence [19][20][21] . While this line of work has been extremely fruitful, it has mostly focused on the hedonic and rewarding aspect of value while neglecting its broader functional role. At the same time, in the field of memory, a large corpus of work has shown that vmPFC is crucial for the formation of schemas or conceptual knowledge 4,5,[22][23][24] , as well as generalizations 25 . Considering its connectivity pattern 26 , the vmPFC is well suited to play a pivotal role in the circuit that involves the hippocampal formation (HPC) and orbitofrontal cortex (OFC), dedicated to extracting latent task information and regularities important for navigating behavioural goals 3,[27][28][29][30] . The vmPFC also collates goalrelevant information from elsewhere in the brain 31 . Thus, the aim of this study is twofold: (1) test whether abstraction naturally emerges during the course of learning, and (2) investigate how the brain, and specifically the vmPFC, uses valuation upon low-level sensory features to inform an abstract process or strategy, and construct task-states.
To achieve this, we leveraged a task in which human participants repeatedly learned novel association rules, while their brain activity was recorded with fMRI. Reinforcement learning (RL) and mixture-of-experts modelling 32,33 allowed us to track participants' valuation processes and to dissociate their learning strategies (both at the behavioural and neural levels) based on the degree of abstraction. In a second experiment, we studied the causal role of value in promoting the formation of abstraction for rapid learning. To anticipate our results, we show that vmPFC and its connection to visual cortex construct abstract representations through a goal-dependent valuation process, that is implemented as top-down control of sensory cortices. combinations arising from 2 features with 2 levels were divided into symmetric (2x2) and asymmetric (3x1) cases. f1-3: features 1 to 3; fruit:rule refers to the fruit as being the association rule. C, Trial-by-trial ratio correct as a measure of within-block learning. Dots represent the mean across participants, while the error bars the SEM, and the shaded area the 95% CI (N = 33). Participant-level ratio correct was computed for each trial across all completed blocks. D, Correlation between learning speed and time, across participants. The learning speed was computed as the inverse of the max-normalized number of trials taken to complete a block. The thin grey lines represent individual participants least square fits, while the black line represents the group-average fit. The correlation was computed with group-averaged data points (N = 11). E, correlation between confidence judgements and learning speed, across participants. Each dot represents data from one participant, and the thick line the regression fit (N = 31 [2 missing data]), ** p<0.01

Mixture-of-experts reinforcement learning for the discovery of abstract representations
We first sought to establish how participants' behaviour was guided by the selection of accurate RL representations. To this end, we built upon a classic RL algorithm (Q-learning) 34 in which state-action value functions (beliefs), used to make predictions on future rewards, are updated according to a trial's task state and action's outcome. In this study, task states were defined by the number of features combinations that the agent may track; hence, we devised algorithms that differed in their state-space dimensionality. The first algorithm, called Feature RL, implements an agent that explicitly tracks all combinations of 3 features, 2 3 = 8 states ( Figure 2A, top left). This algorithm is anchored at a low feature level because each combination of the 3 features results in a unique fingerprint -one simply learns direct pairings between visual patterns and fruits (actions). Conversely, a second algorithm, called Abstract RL, implements an agent that learns to use a more compact or abstract state representation in which only two features are tracked. These compressed representations reduce the explored state-space by half, 2 2 = 4 states (Figure 2A, top right). Importantly, in this task environment there can be as many as 3 Abstract RL in parallel, one for each combination of two features.
The above four RL algorithms were combined in a mixture-of-experts architecture 32,33,35 : that is, a mixture of Feature RL and three different options of Abstract RL ( Figure 2B, see Methods).
The key intuition behind this approach is that, at the beginning of a new block, the agent does not know which abstract representation is correct (i.e. which features are relevant). Thus, the agent should learn which representations (and associated actions) are most predictive of reward, and thereby exploit the best representation for action selection. While all experts participated in action selection, their learning uncertainty (RPE: reward prediction error) determined their strength in doing so 33,36,37 . This architecture allowed us to track the value function of each RL expert separately, while using a unique, global action on each trial. Estimated hyperparameters were then used to compute the value functions of participants' data, as well as to generate new, artificial choice data and compute simulated value functions ( Figure 2C, see Methods). Simulations indicated expected value was highest for the appropriate Abstract RL, followed by Feature RL, and the two Abstract RLs based on irrelevant features as the lowest ( Figure 2D). Participants' empirical data displayed the same pattern, whereby the value function of the appropriate Abstract RL was higher than other RL algorithms ( Figure 2D, right side). Note the large difference between appropriate Abstract RL and Feature RL: this is due to the appropriate Abstract RL being an 'oracle', as it has access to the correct low-dimensional state from the beginning. The RPE variance v adjusted the sharpness with which each RL's (un)certainty was considered for the expert weighting. Crucially, the variance v was associated with participants' learning speed, such that participants who learned the blocks' rules quickly were sharper in expert selection ( Figure 2E, N = 29, robust regression slope = -1.02, t27 = -2.59, p = 0.015). These modelling results provided a first layer of computational support for the hypothesis that valuation is related to abstractions. description of the model. All experts had the same number of hyperparameters: the learning rate a (how much the latest outcome affect the agent's beliefs), the forgetting factor g (how much prior RPEs influence current decisions), and the RPE variance v, modulating the sharpness with which the mixture-of-expert RL model should favour the best performing algorithm in the current trial. C, Approach used for data analysis and model simulation. The model was first fit to participants' data with Hierarchical Bayesian Inference (Piray et al., 2019). Estimated hyperparameters were used to compute the value functions of participants' data, as well as to generate new, artificial choice data and compute simulated value functions. D, Averaged expected value across all states for the chosen action in each RL expert. Left: simulated data, right: participants' empirical data. Dots represent individual agents (left) or participants (right), bars the mean and error bars the SEM. Statistical comparison was performed with two-sided Wilcoxon signed rank tests. Model: AbRLw1 vs AbRLw2, z = 1.00, p = 0.32, all remaining comparisons, z = 5.84, ***p<0.001; Participants: AbRLw1 vs AbRLw2, z = -0.08, p = 0.94, all remaining comparisons, z ≥ 4.92, ***p<0.001. AbRL: Abstract RL, FeRL: Feature RL, AbRLw1: wrong-1 Abstract RL, AbRLw2: wrong-2 Abstract RL. E, Correlation between RPE variance and learning speed (outliers removed, N = 29). Dots represent individual participants' data, the thick line the regression fit, * p<0.05.

Behaviour shifts from Feature-to Abstract-based reinforcement learning
The mixture-of-expert RL model uncovered that participants who learned faster relied more  Figure 3B; two-sided t-test of participant-level proportions: significantly lower than 0.5, but close to that, t32 = -2.87, p = 0.007, Figure 3B inset).
As suggested by the simulations (Figure 3A), the strategy that best explained participants' block data correlated with learning speed. Where learning proceeded slowly, Feature RL was consistently predominant ( Figure 3B), while the reverse happened in blocks where participants displayed fast learning (Abstract RL predominant, Figure 3B). Moreover, across participants, the degree of abstraction (propensity to use Abstract RL) correlated with the empirical learning speed (N = 33, robust regression, slope = 0.52, t31 = 4.56, p = 7.64x10 -5 , Figure 3C, top).
Participants' sense of confidence in having performed the task well also significantly correlated with the degree of abstraction (N = 31, robust regression, slope = 0.026, t29 = 2.69, p = 0.012, Figure 3C, bottom). Together with confidence self-reports being predictive of learning speed ( Figure 1E), these results raise intriguing questions on the function of metacognition, as participants appeared to grasp their own ability to construct and use abstractions 39 .
The two RL algorithms revealed a second aspect of learning. Feature RL had consistently higher learning rates compared with Abstract RL (two-sided Wilcoxon rank sum test against median 0, z = 14.33, p < 10 -30 , Figure 3D). This difference might be explained intuitively as follows. Integration of information had to happen over a longer time horizon in Abstract RL, because a single trial per se was uninformative with respect to the rule. Conversely, a higher learning rate would allow the agent to make larger updates on states that were less frequently visited, as would happen in Feature RL. A similar asymmetry in hyperparameter values was found with greediness ( Figure 3E, two-sided Wilcoxon rank sum test against median 0, z = 7.14, p < 10 -10 ), suggesting action selection tended to be more deterministic in Feature RL (i.e. large ).
We initially predicted that abstraction use would increase with learning, because the brain has to construct abstractions in the first place and should thus initially rely on Feature RL. To test this hypothesis, we quantified the number of participants using Feature RL or Abstract RL strategy in their first and last blocks. The first block was dominated by Feature RL, while the pattern reversed in the last block (two-sided sign test, z = -2.77, p = 0.006, Figure 3F).
Computing the abstraction level separately for the session's median split early and late blocks also resulted in higher abstraction in the late blocks (two-sided sign test, z = -2.94, p = 0.003, Figure 3G). This general effect was complemented by a linear increase towards higher abstraction from early to late blocks ( Figure S2).
Supporting the current modelling framework, the mean expected value of the chosen action was higher for Abstract RL ( Figure S3), and model hyperparameters could be recovered in the presence of noise ( Figure S4, see Methods) 40 .

The role of vmPFC in constructing goal-and task-dependent value from sensory features
Both computational approaches indicated participants relied on both a low-level, featurebased strategy and a more sophisticated abstract strategy (i.e. Feature RL and Abstract RL; Figure 2D, 3B). Besides proving that abstract representations were generally associated with higher expected value, the modelling approach further allowed us to explicitly classify trials as belonging to either strategy. Here, our goal was to dissociate the neural signatures of these distinct learning strategies in order to elucidate how abstract representations are constructed by the human brain.
Since the association between pacman features and fruits was fixed throughout a block and reward was deterministic, we reasoned that an anticipatory value signal might emerge in the vmPFC already at stimulus presentation 41 . We performed a general linear model (GLM) analysis of the neuroimaging data with regressors for 'High value' and 'Low value' trials, elected by the block-level best fitting algorithm (Feature RL and Abstract RL, see Methods for full GLM and regressors specification). As predicted, activity in the vmPFC strongly correlated with value magnitude ( Figure 4A). That is, the vmPFC indexed the anticipated value constructed from the pacman features at stimulus presentation time. We used this signal to functionally define, for ensuing analyses, the subregion of the vmPFC that was maximally related to task computations about value when (pacman's) visual features were integrated.
Concurrently, activity in insular and dorsal prefrontal cortices increased when the low value trials were selected. This pattern of activity is consistent with previous studies on error processing 42,43 ( Figure S5).
In order for the vmPFC to construct goal-dependent value signals, it should receive relevant feature information from other brain areas (specifically visual cortices given the nature of our task). Thus, we computed a psychophysiological interaction (PPI) analysis 44 , to isolate regions whose functional coupling with the vmPFC at the time of stimulus presentation was dependent on the magnitude of expected value. Supporting the idea that vmPFC based its predictions on the integration of visual features, only the connectivity between VC and vmPFC was higher on trials that carried large expected value compared to low value trials ( Figure 4B). Strikingly, the strength of this VC -vmPFC interaction was predictive of the overall learning speed across participants (N = 31, robust regression, slope = 0.016, t29 = 2.53, p = 0.016, Figure 4C), such that participants who had a stronger modulation of the coupling between the vmPFC and VC also learned the blocks' rules faster.

Value-sensitive vmPFC subregion prioritizes abstract elements
Having established that the vmPFC computes a goal-dependent value signal, we evaluated whether this region's activity level was sensitive to the strategy participants used. We also verified reaction times were not different in Feature RL and Abstract RL blocks, precluding an alternative explanation based on differences in difficulty ( Figure S6B).
The next question we asked was, can we retrieve feature information from HPC and vmPFC activity patterns? In order to abstract and operate in the latent space, an agent is still bound to represent and use the features, because the rules are dictated by features' combinations.
One possibility is that feature information could be represented solely in sensory areas. What matters then is the connection with and/or the read out of, vmPFC or HPC. Accordingly, neither HPC or vmPFC should represent feature information (regardless of the strategy used).
Alternatively, feature-level information could be represented also in higher cortical regions under Abstract RL to explicitly support (value-based) relational computations. To resolve this question, we applied multivoxel pattern analysis to classify basic feature information (e.g. colour: red vs green) in three regions of interest: the VC, HPC, and vmPFC; separately for trials belonging to Feature RL and Abstract RL blocks. We found the classification accuracy to be significantly higher in Abstract RL than in Feature RL trials in both HPC and vmPFC We expanded on this idea with two multivoxel pattern analysis searchlights. In short, we inquired which brain regions were sensitive to feature relevance (i.e. when the feature was relevant to the rule or not), and whether we could recover representations of the latent rule itself (fruit preference). Besides the occipital cortex, significant reduction in decoding accuracy when a feature was irrelevant to the rule compared to when it was relevant was also detected in the OFC, ACC, vmPFC and dorsolateral PFC ( Figure S7A). Multivoxel activity patterns in the dorsolateral PFC and lateral OFC also predicted the fruit class ( Figure S7B). Decoded neurofeedback is a form of neural reinforcement based on real time fMRI and multivoxel pattern analysis. It is the closest approximation to a non-invasive causal manipulation, with high levels of specificity and administered without participants' awareness [48][49][50] . Such reinforcement protocols can reliably lead to behavioural or physiological changes that last over time [51][52][53][54] . We used this procedure in a follow-up experiment (N = 22) to artificially add value (rewards) to neural representation in VC of a target task-related feature (e.g. the orientation; Figure 6A, see Methods  conditions. Replicating the results reported above, behaviour in 'relevant' blocks had higher probability to be driven by Abstract RL (Figure 6D, perm. test p < 0.001), while Feature RL tended to appear more in 'irrelevant' blocks. A strategy shift towards abstraction, specific to the blocks in which the target feature was tagged with reward, indicates the effect of value in facilitating abstraction is likely to be mediated by a change in the early processing stage of visual information. In this experiment, by means of neurofeedback, value (in the form of external rewards) was artificially added to a target feature's neural representation in VC.
Hence, the brain used these 'artificial' values to select relevant dimensions when constructing abstract representations by tagging certain sensory channels. Critically, this manipulation indicates that value tagging of early representation has a causal effect on abstraction and consequently on the learning strategy.
centre Figure 6: Artificially adding value to a feature's neural representation. A, Schematic diagram of the follow-up multivoxel neurofeedback experiment. During the neurofeedback procedure, participants were rewarded for increasing the size of a disc on the screen (max session reward 3,000 JPY). Unbeknownst to them, the disc size was changed by the computer program to reflect the likelihood of the target brain activity pattern (corresponding to one of the task features) measured in real time. B, Blocks were subdivided based on the feature targeted by multivoxel neurofeedback: 'relevant' or 'irrelevant' for the blocks' rules. The scatter plots replicate the finding from the main experiment, with a strong dependency between Feature RL / Abstract RL and learning speed. Each coloured dot represents a single block, from one participant. Data aggregated from all participants. C, Abstraction level computed for each participant from all blocks belonging to: 1) left, the latter half of the main experiment (as in Figure 3G, but only selecting participants who took part in the multivoxel neurofeedback experiment); 2) centre, post-neurofeedback, for the 'relevant' condition; 3) right, post-neurofeedback, for the 'irrelevant' condition. Coloured dots represent participants, shaded areas the density plot, central white dots the median, the dark central bar the interquartile range, and thin dark lines the lower and upper adjacent values. D, Bootstrapping the difference between model probabilities on each block, separately for 'relevant' and 'irrelevant' conditions. * p<0.05, *** p<0.001

Discussion
The ability to generate abstractions from simple sensory information has been suggested as key to support flexible and adaptive behaviours 1,12,13 . Here, using computational modelling based on a mixture-of-experts RL architecture, we revealed that value predictions drove participants' selection of the appropriate representation to solve the task. Participants explored and used the task dimensionality through learning, as they shifted from a simple feature-based strategy to using more sophisticated abstractions. The more participants used Abstract RL, the faster they were at solving the task. These results support the idea that decision-making processes in the brain depend on higher-order, summarized representations of task-states 3 .
Further, abstraction likely requires a functional link between sensory regions and areas encoding value predictions about task states ( Figure 4C, the VC-vmPFC coupling predicted participants' learning speed). This is in line with previous work that has demonstrated how estimating reward value of individual features provides a reliable and adaptive mechanism in RL 11 . We extend this notion by showing that the mechanism may support the formation of abstract representations to be further used for learning computations, for example the selection of the appropriate abstract representation.
There is an important body of work considering how the HPC is involved in the formation and update of conceptual information 5,23,25,55 . Likely, the HPC's role is to store or index conceptual/schematic memories, and update concepts 5,24,56 . The 'creation' of new concepts or schemas may happen elsewhere. A good candidate could be the mPFC or vmPFC in humans 56,57 . OOur results expose a potentially mechanism on how vmPFC interplays with HPC in the construction of goa-relevant abstractions: vmPFC-driven valuation of low-level sensory information serves to channel specific features/components to higher order areas (e.g. the HPC, vmPFC, but also the dorsal prefrontal cortex for instance), where it is used to construct abstractions (e.g. concepts, categories or rules). In line with this interpretation we found vmPFC to be more engaged in Abstract RL, while HPC was equally active under both abstract and feature-based strategies ( Figure 5B). Our results indicate that when a feature is irrelevant to the rule, its decodability from activity patterns in OFC/DLPFC decreases ( Figure   S7A). This dovetails well with the OFC/DLPFC function in constructing goal-based task states and abstract rules from relevant sensory information 27,58,59 . Furthermore, the abstract rule itself was represented in multivoxel activity patterns within these regions implicated in abstract strategies, rules and contexts ( Figure S7B) 58,60,61 . How these representations are actually used remains an open question. This study also suggests that there is no single region in the brain that maintains a fixed task state. Rather, the configuration of elements that determines a state is continuously updated or reconstructed over time. At first glance this seems dispendious and inefficient. But this approach would provide high flexibility in noisy and stochastic environments, and where temporal dependencies exist (i.e. any real world situation). By continuously recomputing task states (e.g. in the OFC), the agent can make more robust decisions because they are based on up-to-date information. This computational coding scheme shares strong analogies with the multiverse hypothesis of hippocampal representations, whereby HPC neurons continuously generate representations of alternative, possible future trajectories 62 .
One significant aspect of discussion concerns the elements used to construct abstractions.
We leveraged simple visual features (colour, or stripe orientation), rather than more complex stimuli or objects that can be linked together conceptually 23,63 . Abstractions happen at several levels, from features, to exemplar, concept/category, all the way to rules and symbolic representations. In this work we effectively studied one of the lowest levels of abstraction. One may wonder whether the mechanism we identified here generalizes to more complex scenarios. While our work cannot decisively support this, we are inclined to believe it is unlikely that the brain uses an entirely different strategy to generate new representations at different (benedettodemartino@gmail.com) for correspondence or requests related to materials.

Participants
Forty-six participants with normal or corrected-to-normal vision were recruited for the main experiment (learning task). Based on pilot data and the available scanning time in one session (60 minutes), we set the following conditions of exclusion: failure to learn the association in 3 blocks or more (i.e. reaching a block's limit of 80 trials without having learned the association), or failure to complete at least 11 blocks in the allocated time. Eleven participants were removed based on the above predetermined conditions. Additionally, 2 more participants were removed due to head motion artifacts. Thus, 33 participants ( All experiments and data analyses were conducted at the Advanced Telecommunications Research Institute International (ATR). The study was approved by the Institutional Review Board of ATR. All participants gave written informed consent.

Learning task
The task consisted of learning the fruit preference of pacman-like characters. These characters were made of 3 different features, each with two levels (colour: green vs red, stripes orientation: horizontal vs vertical, mouth direction: left vs right). On each trial, a character composed of a unique combination of the three features was presented. The experimental session was divided into blocks, on each of which a specific rule directed the association between features and preferred fruits. There were always 2 relevant features and 1 irrelevant feature, but these changed randomly at the beginning of each block. Blocks could thus be of three types: CO (colour-orientation), CD (colour-direction), and OD (orientation-direction).
Furthermore, to avoid a simple logical deduction of the rule after 1 trial, we introduced the following pairings. The 4 possible combinations of 2 relevant features with 2 levels were paired with the 2 fruits in both a symmetric or asymmetric fashion -2x2 or 3x1. The appearance of the 2 fruits was also randomly changed at the beginning of each new block (see Figure 1B, e.g. green-vertical: fruit 1, green-horizontal: fruit 2, red-vertical: fruit 1, red-horizontal: fruit 2 or green-vertical: fruit 2, green-horizontal: fruit 2, red-vertical: fruit 1, red-horizontal: fruit 2).
Each trial started with a black screen for 1 sec, following which the character was presented for 2 sec. Then, while the character continued to be present at the centre of the screen, the 2 fruit options appeared below, to the right and left sides (see Figure 1). Participants had 2 sec to indicate the preferred fruit by pressing a button (right for the fruit to the right, left vice versa).
Upon registering a participant's choice, a coloured square appeared around the chosen fruit: green if the choice was correct, red otherwise. The square remained for 1 sec, following which the trial ended with a variable ITI -bringing the trial to a fixed 8 sec duration.
Participants were simply instructed that they had to learn the correct rule on each block, and the rule itself (relevant features + association type) was hidden. Each block contained up to 80 trials, but a block could end before if participants learned the target rule. Learning was measured as a strike of correct trials (between 8 and 12, determined randomly on each block).
Participants were instructed that each correct choice added one point, while incorrect choices did not alter the balance. But importantly, they were told that the points obtained would be weighted by the speed of learning on that block. The faster the learning, the greater the net worth of the points -the monetary reward was computed at the end of each block according to the formula: Where R is the reward obtained on that block, A the maximum available reward (150JPY), ∑ the sum of correct responses, ∑ the number of trials, mcs the maximum length of a correct strike (12 trials), and a is a scaling factor (a = 1.5). This formula ensures a timedependent decay of the reward that approximately follows a quadratic fit. In case participants completed a block in less than 12 trials, if the amount was larger than 150JPY it was rounded to 150JPY.
The maximum terminal monetary reward over the whole session was set at 3,000 JPY; participants on average earned 1251 ± 46 JPY (blocks in which participants failed to learn the association within the 80 trials limit were not rewarded). For each experimental session there was a sequence of 20 blocks that was pre-generated pseudo-randomly, and on average For the sessions done in the MR scanner, participants were instructed to use their dominant hand to press buttons on a dual-button response pad to provide their choices. Concordance between responses and buttons was indicated on the display and, importantly, randomly changed across trials to avoid motor preparation confounds (i.e. associating a given preference choice with a specific button press).

Computational modelling part 1: mixture-of-experts RL model
We built on a standard RL model 6 and prior work in machine learning and robotics to derive the mixture-of-experts architecture 32,33,36 . In this work the mixture-of-experts architecture is composed of several 'expert' RL modules, each tracking a different representational space, and each with its own value function. On each trial, the action selected by the mixture-ofexperts RL model is given by the weighted sum of the actions computed by the experts. The weight reflects the responsibility of each expert, which is computed from the SoftMax of the squared prediction error. In this section we define the general mixture-of-expert RL model, and in the next section we define each specific expert, which are based on the different task-state representations being used.
Formally, the mixture-of-expert RL model global action is defined as: Where N is the number of experts, the responsibility signal, and a the action selected by the jth-model. Thus, is defined as: Where N is the same as above, v is the RPE variance. The expert's uncertainty @@@@@@ 4 is defined as: Where is the forgetting factor, which controls the strength of the impact of prior experience on the current uncertainty estimate. The most recent RPE is computed as: Where is the responsibility signal computed with equation (3), is the learning rate (assumed equal for all experts), and RPE computed with equation (5). Finally, for each expert, the action a at trial t is taken as the argmax of the value function as follows: Where Q is the value function, S the state at current trial, and a the two possible actions.

Computational modelling part 2: Feature RL and Abstract RLs
Each (expert) RL algorithm is built on a standard RL model 6 to derive variants that differ on the number and type of states visited. Here, a state is defined as a combination of features.
Feature RL has 2 3 = 8 states, where each state was given by the combination of all three features (e.g. colour, stripe orientation, mouth direction: green, vertical, left). Abstract RL is designed with 2 2 = 4 states, where each state was given by the combination of two features.
Note that the number of states does not change for different blocks, only the features used to determine them. These learning models create individual estimates of how participants' actionselection was dependent on the features they attended and their past reward history. Both RL models share the same underlying structure and are formally described as: Where ( , ) in (8) is the value function of selecting either fruit-option for packman-state .
The value of the action selected on the current trial is updated based on the difference between the expected value and the actual outcome (reward or no reward). This difference is called the reward prediction error (RPE). The degree to which this update affects the expected value depends on the learning rate . The larger , the more recent outcomes will have a strong impact. On the contrary, for small recent outcomes will have little effect on expectations.
Only the value function of the selected action -which is state-contingent in (8) -is updated.
The expected values of the two actions are combined to compute the probability p of predicting each outcome using a SoftMax (logistic) choice rule: ]^,_ = 1 / (1 + (− ( ( a , 9 ) − ( a , Q ))) The greediness hyperparameter controls how much the difference between the two expected values for a1 and a2 actually influence choices.

Procedures for model fitting, simulations and hyperparameter recovery
Hierarchical Bayesian Inference (HBI) 38  The HBI procedure was then implemented on all participants' data, proceeding block-by-block.
We also simulated the models' action-selection behaviour to benchmark their similarity to human behaviour and, in the case of the Feature RL vs Abstract RL comparisons, to additionally compare their formal learning efficiency. In the case of the mixture-of-experts RL architecture, we simply used the estimated hyperparameters to simulate 45 artificial agents, each completing 100 blocks. The simulation allowed us to compute -for each expert RL module -the mean responsibility signal, and the mean expected value across all states for the chosen action. Additionally, we also computed the learning speed (time to learn the rule of a block) and compare it with the learning speed of human participants.
In the case of the simple Feature RL and Abstract RL models, we added noise to the state information in order to have a more realistic behaviour (from the perspective of human participants). In the empirical data, the action (fruit selection) in the first trial of a new block was always chosen at random because participants did not have access to which were the appropriate representations (states). In later trials participants may have followed specific strategies. For the model simulations we simply assumed that states were corrupted by a decaying noise function: where nt is the noise level at trial t, n0 the initial noise level (randomly drawn from a uniform distribution within the interval [0.3 0.7]), and rte was the decay rate, which was set to 3. This meant that in early trials in a block there was a higher chance of basing the action on the wrong representation (at random), rather than following the appropriate value function. The actions in later trials had a decreasing probability of being chosen at random. This approach is a combination of the classic -greedy policy and standard SoftMax action-selection policy in RL.
The hyperparameters values were sampled from the set obtained participants' data maximum likelihood fits. We simulated 45 artificial agents solving 20 blocks each. The procedure was repeated 100 times for each block with new random seeds. We used two metrics to compare the efficiency of the two models: learning speed (same as above, the time to learn the rule of a block), as well as the fraction of failed blocks (blocks in which the rule was not learned with the 80-trials limit).
We performed a parameter recovery analysis for the simple Feature RL and Abstract RL models based on the data from the main experiment. The parameter recovery analysis was done in order to confirm that the fitted hyperparameters had sensible values and the models themselves were a sensitive description of human choice behaviour 40 . Using the same noisy procedure described above, we generated one more simulated dataset, using the exact blocks that were presented to the 33 participants. The blocks from simulated data were then sorted according to their length, and the hyperparameters a and b were fitted by maximizing the likelihood of the estimated actions given the true model actions. We report in Figure S4 the scatter plot and correlation between hyperparameters estimated from participants data and recovered hyperparameters values, showing good agreement notwithstanding the noise in the estimation.
For the data from the behavioural session after multivoxel neurofeedback, blocks were first divided into whether the targeted feature was relevant or irrelevant to the rule of a given block.
We then applied the HBI procedure as described above, on all participants, with all blocks of the same type (e.g. targeted feature relevant) ordered by length. This allowed us to compute -based on whether the targeted feature was relevant or irrelevant, the difference in frequency between the models. We resampled with replacement to produce distributions of mean population bias for each block type.   70 . The vmPFC ROI was defined based on the significant voxels from the GLM1 'High value' > 'Low value' contrast at p(fpr) < 0.0001, within the OFC. All subsequent analyses were performed using MATLAB v9.5.0.94 (r2018b). Once ROIs were individually identified, timecourses of BOLD signal intensities were extracted from each voxel in each ROI and shifted by 6 sec to account for the hemodynamic delay (assumed fixed). A linear trend was removed from the time-courses, which were further z-score normalized for each voxel in each block to minimize baseline differences across blocks. The data samples for computing the individual feature identity decoders were created by averaging the BOLD signal intensities of each voxel over 2 volumes, corresponding to the 2 sec from stimulus (character) onset to fruit options onset.

Decoding: multivoxel pattern analysis (MVPA)
All MVP analyses followed the same procedure. We used sparse logistic regression (SLR) 71 , to automatically select the most relevant voxels for the classification problem. This allowed us to construct several binary classifiers (e.g. feature id.: colour -red vs green, stripes orientation -horizontal vs vertical, mouth direction -right vs left).
Cross-validation was used for each MVPA by repeatedly subdividing the dataset into a "training set" and a "test set" in order to evaluate the predictive power of the trained (fitted) model. The number of folds was set at k=50. On each fold, 80% of the data was assigned to the training set, and the remaining to the test set. The samples assigned to either set were randomly chosen in each fold. Furthermore, SLR classification was optimized by using an iterative approach 72 : in each fold of the cross-validation, the feature-selection process was repeated 10 times. On each iteration, the selected features (voxels) were removed from the pattern vectors, and only features with unassigned weights were used for the next iteration. At the end of the cross-validation, the test accuracies were averaged for each iteration across folds, in order to evaluate the accuracy at each iteration. The number of iterations yielding the highest classification accuracy was then used for the final computation. Results ( Figure 5C) report the cross-validated average of the best yielding iteration.
For the multivoxel neurofeedback experiment, we used the entire dataset to train the classifier in VC. Thus, each classifier resulted in a set of weights assigned to the selected voxels; these weights could be used to classify any new data sample -and therefore, compute a likelihood of it belonging to the target class.

Real-time multivoxel neurofeedback and fMRI pre-processing
As in previous studies 51,52,73 , during the multivoxel neurofeedback manipulation, participants were instructed to modulate their brain activity, in order to enlarge a feedback disc and maximize their cumulative reward. Brain activity patterns measured through fMRI were used in real time to compute the feedback score. Unbeknownst to participants, the feedback score, ranging from 0 to 100 (empty to full disc), represented the likelihood of a target pattern occurring in their brain at measurement time. Each trial started with an induction period of 6 sec, during which participants viewed a cue (small grey circle) which instructed them to modulate their brain activity. This period was followed by a 6 sec rest interval, and then by a 2 sec feedback, during which the disc appeared on the screen. Lastly, each trial ended with a 6 sec inter-trial interval (ITI). Each block was composed of 12 trials, and one session could last up to 10 blocks (depending on time availability). Participants did 2 sessions on consecutive days. Within a session the maximum monetary bonus was 3,000 JPY.
The feedback was calculated through the following steps. In each block, the initial 10 sec of fMRI data were discarded to avoid unsaturated T1 effects. First, newly measured whole-brain functional images underwent 3D motion correction using Turbo BrainVoyager (Brain Innovation, Maastricht, Netherlands). Second, time-courses of BOLD signal intensities were extracted from each of the voxels identified in the decoder analysis for the target ROI (VC).
Third, the time-course was detrended (removal of linear trend), and z-score normalized for each voxel using BOLD signal intensities measured up to the last point. Fourth, the data sample to calculate the target likelihood was created by taking the average BOLD signal intensity of each voxel over the 6 sec (6 TRs) 'induction' period. Finally, the likelihood of each feature level (e.g. right vs left mouth direction) being represented in the multivoxel activity pattern was calculated from the data sample using the weights of the previously constructed classifier.