Model-based value in midbrain dopamine signals

Midbrain dopamine activity is thought to represent reward prediction errors (RPEs) used to update the value of stimuli and/or actions. However, it remains unclear what sources of value information are available to dopamine neurons, and to what extent values derived from internal models inform dopaminergic RPEs. To assess how midbrain dopamine activity is influenced by internal models of task structure, we trained mice in a multi-step probabilistic decision-making task with changing reward contingencies, and performed photometry recordings from dopamine neurons in the ventral tegmental area (VTA) and dopamine axons in the nucleus accumbens (NAc) and dorsomedial striatum (DMS). Our results indicate that dopamine activity in VTA and NAc terminals is influenced by value information derived from models of task structure. By contrast, value information was absent from activity in DMS dopamine axons, which instead is strongly modulated when making choices towards the option contralateral to the recording site.


Introduction
Changing environments require animals to flexibly adapt their actions to changes in the world's contingencies. Such behavioural flexibility is thought to be aided by rich internal models of the rules and statistical relationships between external events and actions, which allow animals to predict the consequences of action and update these predictions when actual outcomes differ from the predictions (Tolman, 1948;Daw & Dayan, 2014;Doll, Duncan, Simon, Shohamy, & Daw, 2015).
However, a simpler, though less flexible, strategy involves just repeating those actions that were previously rewarded. It just requires storing -or 'catching' -the value of actions and updating this value when it differs from the predicted one using a reward prediction error (RPE). This is what underlies modelfree behaviour (Sutton & Barto, 1998).
Classically, activity in dopamine neurons has been reported to reflect a cached value of actions and to convey a signal consistent for a model-free RPE (Schultz, Dayan, & Montague, 1997;Eshel, Tian, Bukwich, & Uchida, 2016), informing and guiding behaviour (Steinberg et al., 2013;Hamid et al., 2016). However, some recent studies have suggested the presence of higher dimensional signals in dopamine activity (Sadacca, Jones, & Schoenbaum, 2016;Takahashi et al., 2017;Engelhard et al., 2019) which can allow for stimulus-stimulus associations to be learned (Sharpe et al., 2017(Sharpe et al., , 2019. Previous work in humans presented a sequential decisionmaking task in which model-free and model-based behaviour could be dissociated, the 'two-step' task (Gläscher, Daw, Dayan, & O'Doherty, 2010;Daw, Gershman, Seymour, Dayan, & Dolan, 2011). Using fMRI, Daw et al. (2011) showed that activation in the NAc could not be explained by a pure model-free computation, but instead reflected both model-free and model-based predictions weighted by their influence on choice behaviour. However, it is not possible to directly relate BOLD signal changes to dopamine and so it remains unclear the extent to which dopamine activity is itself directly influenced by such model-based predictions.
To investigate this issue, here we used a version of this task adapted for behaving mice (Akam et al., 2017) and employed fibre photometry to determine whether bulk activity of genetically-defined midbrain dopamine cells can also reflect model-based computations. In addition, given the evidence that the computations supported by dopamine cell firing and release in terminal regions may differ (Berke, 2018), we compared the activity in VTA dopamine cells to that in axons in target regions in the NAc and DMS respectively.

Methods
We trained mice on a two-step decision task, adapted from that developed for humans by Daw et al. (2011). The apparatus comprised a central initiation port flanked left and right by 'choice ports' and above and below by 'second-step' ports in which the mice could receive rewards ( Fig 1A).
Subjects initiated a trial in the central port then chose between the left and right ports. Each choice port commonly (80% of trials) caused one of the second-step ports to light up, and rarely (20% of trials) caused the other second-step port to light up. Poking the illuminated second-step port delivered reward with probabilities that changed in blocks. In non-neutral blocks, one second-step port had 80% reward probability and the other 20%, while in neutral blocks both seconds-step ports had 50% reward probability. Mice therefore had to learn to choose the choice port that commonly led to the second-step port with high reward probability. Once mice consistently selected the correct choice a block transition was triggered following a random delay and the second step reward contingencies changed.
We recorded bulk calcium activity in midbrain dopamine neurons and their projections to NAc and DMS using fibre photometry. DAT-cre mice were injected bilaterally in VTA with AAV viruses expressing GCaMP6f and TdTomato. Three optic fibres were implanted in each mouse targeting VTA, NAc and DMS.

Behaviour
Subjects learned to track which option was currently best, performing ∼400 trials and >8 reversal blocks in each session ( Fig 1B).
Choice behaviour was consistent with a model-based reinforcement learning strategy (Daw et al., 2011), with trial outcome (rewarded or not) and state transition (common or rare) interacting to determine subsequent choice; i.e. subjects tended to repeat choices following rewarded common transitions and non-rewarded rare transitions (Fig 1C). Logistic regression using the trial history to predict choice showed a strong effect of both the transition-outcome interaction and state transition on choices over multiple subsequent trials, but minimal direct influence of the trial outcome ( Fig 1D).

Dopamine activity
As expected, calcium activity in VTA and NAc increased at the time of reward, and decreased on reward omission ( Fig. 2A). Surprisingly, in DMS the opposite modulation was observed, with lower calcium activity following reward than reward omission.
Dopamine activity in each region was not only modulated at the time of reward delivery, but presented a rich pattern of activity across the different trial stages. In order to disentangle what behavioural variables were driving dopamine activity at different time points, we used a linear regression analysis predicting trial by trial calcium activity as: where y(i,t) is the calcium activity on trial i at time-point t, β p (t) is the weight for predictor p at time-point t, X p (i) is the value of predictor p on trial i, c(t) is the intercept at time-point t, and ε(i,t) is the residual unexplained variance. Fig. 2B shows the predictor weights β p (t) obtained by fitting the model to activity in each region. Consistent with the average traces, reward on the current trial strongly increased dopamine activity in VTA and NAc when reward information became available, with a faster timescale in NAc than VTA. Reward had a negative and slower influence on calcium activity in DMS terminals.
We next asked how the previous trial's outcome (rewarded or not) affected dopamine activity as a function of whether the second-step reached on the current trial was the same or different to the previous trial. When the second-step state was the same, reward on the previous trial increased dopamine activity in VTA and NAc at the time when the second-step state was revealed, consistent with an RPE driven by the value of the second-step state. In NAc but not VTA this influence reversed at outcome time, consistent with the second-step state value's influence on the outcome-time RPE. However, crucially, if the second-step state was different from the previous trial, the previous reward had the opposite effect, reducing dopamine activity when the second step state was revealed. This is consistent with subjects understanding the negative correlation between the reward probabilities and inferring that reward in one second-step state reduces the likelihood that the other state has high reward probability, i.e. that mice were inferring a single latent variable about the state of the reward probabilities rather than independent values for each secondstep state. No modulation by previous reward was observed in DMS. We also constructed predictors which coded the direction in which a model-based and a model-free value update on the previous trial would affect the value of the action chosen on the current trial. Dopamine activity in VTA, though less clearly in NAc, was increased at choice time when the model-based value update was positive, consistent with model-based action value estimates contributing to an RPE once the choice is made. The direction of model-free action value updates also influenced VTA activity weakly at the same time-point. These action value update predictor loadings in NAc showed a complex temporal pattern which it is unclear how to interpret. Neither of these predictors explained activity in DMS terminals.
Finally, we looked at how direction of movement influenced population activity during the trial. Activity in all three areas showed some modulation by whether the choice required an ipsi-or contralateral movement relative to the recording site. This was particularly striking in DMS terminals, which showed a strong increase in activity between trial initiation and choice when mice chose the contralateral poke, suggesting that activity in DMS encoded initial action choice in a lateralised way, independently of action or state values.

Conclusion
We have presented data from dopamine population recordings during a multi-step probabilistic reversal learning task in mice. Mice were able to track the best option across reversals, exhibiting choice behaviour consistent with model-based reinforcement learning. Photometry recordings from midbrain dopamine neurons and projections to NAc showed evidence of value information which respected the task structure, including the anti-correlated nature of the reward probabilities, and transition structure linking actions and states. By contrast, dopamine in DMS was primarily influenced by the direction of the animals' initial chosen action. Together, this demonstrates that dopamine contains multiple representations beyond model-free RPEs.