Dissociable Reward and Timing Signals in Human Midbrain and Ventral Striatum

Summary Reward prediction error (RPE) signals are central to current models of reward-learning. Temporal difference (TD) learning models posit that these signals should be modulated by predictions, not only of magnitude but also timing of reward. Here we show that BOLD activity in the VTA conforms to such TD predictions: responses to unexpected rewards are modulated by a temporal hazard function and activity between a predictive stimulus and reward is depressed in proportion to predicted reward. By contrast, BOLD activity in ventral striatum (VS) does not reflect a TD RPE, but instead encodes a signal on the variable relevant for behavior, here timing but not magnitude of reward. The results have important implications for dopaminergic models of cortico-striatal learning and suggest a modification of the conventional view that VS BOLD necessarily reflects inputs from dopaminergic VTA neurons signaling an RPE.


Inventory of Supplemental Information
: Summary of behavioural results produced on the task Behavioural results: Shown are the distributions of all subjects' timing estimates on instrumental test trials with fixed timing (top) and variable timing (bottom) predicting CS. In both cases, subjects' estimates are close to the mean CS-US interval of 6 seconds, showing that subjects acquired a good representation of CS-US timings. In variable timing trials, estimates are given earlier and are more variable. Figure S2 illustrates VTA noise correction methods underlying the data in Figure 2

Figure S2
A Independent component analysis (ICA) was used to identify and remove physiological and motion artefacts from the data. This greatly increased the sensitivity to responses from VTA. Illustrated is the ratio of signal variances observed in different regions of the brain before versus after ICA correction (1 (dark): no change, >1 (bright): variance reduced). ICA has removed variance both in areas sensitive to physiological noise, but also near boundaries where subject motion introduces substantial variance.
Note how bright the VTA appears in this image in particular. B Regressors for breathing and pulse were included in the general linear model which was applied to the ICA-corrected data. This further reduced the variance in mid-brain regions. Colour code as in A. VTA coordinates were chosen as in  bottom: groupS). The modulations predicted by the hazard function can be observed in the raw data: In both groups, an early unexpected reward leads to a stronger response than one delivered at the most expected time. However, a late unexpected reward leads to a stronger response only in groupU.
Shown is the mean ± SEM.  A+B: BOLD response to unexpected rewards in variable timing trials, averaged across delivery times for VTA (A) and striatum (B). Not a single voxel shows a significant increase to an unexpected reward across both groups in the striatum, while a large overlap can be observed in the mid-brain.
This suggests fundamental differences in processing between the two structures. C+D: BOLD response to all CS signalling fixed timing in VTA (C) and striatum (D); large parts of mid-brain and striatum are active. To maximise selectivity to voxels reflecting dopaminergic processes, the VTA ROI was based on the contrast depicted in A. The lack of significant voxels in VS in B meant that the VS ROI was based on the contrast shown in D. However, all key findings hold true when the VTA ROI is defined based on the contrast depicted in C, or when VS is defined anatomically instead of functionally, showing that the results of this study are independent of the method of ROI selection. In A-D, the overlap of voxels significant at Z>2.4 in both groups is shown. E BOLD timecourses from VS are split by early, middle and late CS-US intervals as in Figure S3. The modulation over time observed in VTA is not present in VS. F-K BOLD timecourses extracted from a ROI in the ventral putamen (F-H) and dorsal striatum (caudate; I-K) are shown as in Figure 4. The ROIs were defined based on previously reported coordinates. As observed in ventral striatum, BOLD responses in ventral putamen and dorsal striatum also do not encode a reward prediction error, and they are not modulated by the temporal hazard function in either group. E, G, H, J, K all show mean ± SEM.
Table S1 summarizes behavioural performance on the task in Figure 1 Timing estimates in test trials (seconds)

Supplemental Experimental Procedures Alternative ROI definitions
Whilst there was no statistical bias introduced by our selection method for neither VTA nor VS ROIs, readers might nevertheless question the choice of different methods of ROI definition in the two different structures. As explained in the main text, and shown in Figures S4, A-D, we could not define the VS using a reward sensitive contrast as it was not reward sensitive in our task. An alternative would be to define the VTA using a cue-sensitive contrast. We decided against this approach because a large region including the VTA and neighbouring structures show cue-sensitivity ( Figure S4C), whereas a much more focal region in the immediate vicinity of the VTA is reward sensitive ( Figure S4A). A reward sensitive contrast was much more likely to reflect activity derived from the VTA. However, as shown below, the statistical dissociations that we report are present with either method of VTA definition.
A second alternative would be to define both structures anatomically. We found that all effects reported for VS remain significant if the VS ROI is defined anatomically based on the Harvard Subcortical Structures Atlas, or based on previously reported coordinates (see below). However, for the VTA, an anatomical ROI definition is not straight-forward. There are two reasons for this. First, it is a very small structure even on an anatomical image. Second and most importantly, it is placed in the brain in a region that is one of the least likely to co-localize between functional and structural scans. As its position is displaced by susceptibility induced distortions, we acquired field maps to correct this distortion within the limitations of current methodology. Nevertheless, we acknowledge it is unlikely that an anatomical ROI will be as sensitive as a functional localizer.

Additional VTA ROIs
Here, we re-performed all statistical tests in two different ways, both of which define the VS and VTA based on the same contrast.
An alternative VTA ROI was defined using the response to all fixed timing cues, the contrast

VS response to variable timing trials
Furthermore, unlike in VTA, the BOLD signal to unpredictable rewards in variable timing trials did not conform with the group-relevant temporal hazard function (Figure 4C and S4E

Behaviour in groupS versus groupU
To show that the absence of an RPE effect in VS cannot be due to a lack of learning in one of the groups, we directly compared the performance on test trials across groups. The number of successful timing predictions on test trials did not differ between groupU and groupS (t 13 =1.92, p=0.14), and there was no difference in average precision (two-sample t-tests on (a) all timing estimates: t 26 =-0.09. p=0.93; (b) fixed timing estimates: t 26 =0.29, p=0.78; (c) variable timing estimates: t 26 =-0.35, p=0.73). All results described for the combined data of both groups held true when examining each group on its own (Kolmogorov-Smirnov test: p<0.001 for both groups; average estimates non-significantly different from 6s: p>0.5 in both groups). Thus, results in VS cannot be explained by poor learning of CS-US intervals in either group.