Competing neural representations of choice shape evidence accumulation in humans

Making adaptive choices in dynamic environments requires flexible decision policies. Previously, we showed how shifts in outcome contingency change the evidence accumulation process that determines decision policies. Using in silico experiments to generate predictions, here we show how the cortico-basal ganglia-thalamic (CBGT) circuits can feasibly implement shifts in decision policies. When action contingencies change, dopaminergic plasticity redirects the balance of power, both within and between action representations, to divert the flow of evidence from one option to another. When competition between action representations is highest, the rate of evidence accumulation is the lowest. This prediction was validated in in vivo experiments on human participants, using fMRI, which showed that (1) evoked hemodynamic responses can reliably predict trial-wise choices and (2) competition between action representations, measured using a classifier model, tracked with changes in the rate of evidence accumulation. These results paint a holistic picture of how CBGT circuits manage and adapt the evidence accumulation process in mammals.


Introduction
Choice is fundamentally driven by information.The process of deciding between available actions is continually updated using incoming sensory signals, processed at a given accumulation rate, until sufficient evidence is reached to trigger one action over another (2,3).The parameters of this evidence accumulation process are highly plastic, adjusting to both the reliability of sensory signals (1,(4)(5)(6)(7) and previous choice history (8)(9)(10)(11)(12)(13), to balance the speed of a given decision with local demands to choose the correct action.
We recently showed how environmental change influences the decision process by periodically switching the reward associated with a given action in a 2-choice task (1).This reward contingency change induces competition between old and new action values, leading to a shift in preference toward the new most rewarding option.This internal competition prompts humans to dynamically reduce the rate at which they accumulate evidence (drift-rate in a normative drift diffusion model, DDM (3)) and sometimes also increases the threshold of evidence they need to trigger an action (boundary height) (1).The result is a change of the decision policy to a slow, qualitatively reflect independent action representations (27)(28)(29)(30)(31).
Within these action channels, activation of the direct pathway, via cortical excitation of D1-expressing spiny projection neurons (SPNs) in the striatum, releases GABAergic signals that can suppress activity in the CBGT output nucleus (internal segment of the globus pallidus, GPi, in primates or substantia nigra pars reticulata, SNr, in rodents) (26,(32)(33)(34).This relieves the thalamus from tonic inhibition, thereby exciting postsynaptic cortical cells and facilitating action execution.Conversely, activation of the indirect pathway via D2-expressing SPNs in the striatum controls firing in the external segment of the globus pallidus (GPe) and the subthalamic nucleus (STN), resulting in strengthened basal ganglia inhibition of the thalamus.This weakens drive to postsynaptic cortical cells and reduces the likelihood that an action is selected in cortex.
Critically, the direct and indirect pathways converge in the GPi/SNr (35,36).This suggests that these pathways compete to control whether each specific action is selected (37).The apparent winner-take-all selection policy and action-channel like coding (27)(28)(29)(30)(31) also imply that action representations themselves compete.Altogether, this neuroanatomical evidence suggests that competition both between and within CBGT pathways controls the rate of evidence accumulation during decision making (12,15,21).
To illustrate this process, we designed a spiking neural network model of the CBGT circuits, shown in Fig. 1A, with dopamine-dependent plasticity occurring at the corticostriatal synapses (17,38).The network performed a probabilistic 2-arm bandit task with switching reward contingencies (see Supp.Methods) that followed the same general structure as our prior work (1), with the exception that block switches were deterministic for the model, happening every 10 trials, whereas in actual experiments they are generated probabilistically so as to increase the uncertainty of participant expectations of the timing of outcome switches.In brief, the network selected one of two targets, each of which returned a reward according to a specific probability distribution.The relative reward probabilities for each target were held constant at 75% and 25% and the action-outcome contingency was changed every 10 trials, on average.
For the purpose of this study we focus primarily on the neural and behavioral effects that occur around the switching of the optimal target.We used four different network instances (see Supp. Methods) as a proxy for simulating individual differences over human participants.
Figure 1B shows the firing rates of dSPNs and iSPNs in the left action channel, time-locked to selection onset (when thalamic units exceed 30Hz, t=0), for both fast (<196ms) and slow (> 314.5ms) decisions (see Fig. 1 -Fig.Supp. 1 for node-by-node firing-rates).As expected, the dSPNs show a ramping of activity as decision onset is approached and the slope of this ramp scales with response speed.In contrast, we see that iSPN firing is sustained during slow movements and weakly ramps during fast movements.However, iSPN firing was relatively insensitive to left versus right decisions.This is consistent with our previous work showing that differences in direct pathways track primarily with choice while indirect pathway activity modulates overall response speeds (12,21) as supported by experimental studies (39)(40)(41).
We then modeled the behavior of the CBGT network using a hierarchical version of the DDM (42), a canonical formalism for the process of evidence accumulation during decisionmaking (3) (Fig. 2A).This model returns four key parameters with distinct influences on evidence accumulation.The drift rate (v) represents the rate of evidence accumulation, the boundary height (a) represents the amount of evidence required to cross the decision threshold, nondecision time (t) is the delay in the onset of the accumulation process, and starting bias (z) is a bias to begin accumulating evidence for one choice over another (see Methods section).
We tracked internal estimates of action-value and environmental change using trial-by-trial estimates of two ideal observer parameters, the belief in the value of the optimal choice (∆B) and change point probability (Ω), respectively (see (1,4) and Methods for details).Using these estimates, we evaluated how a suspected change in the environment and the belief in optimal choice value influenced underlying decision parameters.Consistent with prior observations in humans (1) we found that both v and a were the most pliable parameters across experimental conditions for the network.Specifically, we found that the model mapping ∆B to drift rate and Ω to boundary height and the model relating ∆B to drift rate provided equivocal best fits to the data over human participants (∆DIC null = −29.85± 12.76 and ∆DIC null = −22.60 ± 7.28, respectively; see (43) and Methods for guidelines on model fit interpretation).All other models failed to provide a better fit than the null model (Supp.File 1).Consistent with prior work (1), we found that the relationship between Ω and the boundary height was unreliable (mean β a∼Ω = 0.069±0.152;mean p = 0.232±0.366).However, drift rate reliably increased with ∆B in three of four participants (mean β v∼∆B = 0.934 ± 0.386; mean p < 0.001; 3/4 participants p < 0.001; Supp.File 2).
These effects reflect a stereotyped trajectory around a change point, whereby v immediately plummets and a briefly increases, with a quickly recovering and v slowly growing as reward feedback reinforces the new optimal target (1).Because prior work has shown that the change in v is more reliable than changes in a (1) and because v determines the direction of choice, we focus the remainder of our analysis on the control of v.
To test whether these shifts in v are driven by competition within and between action channels, we predicted the network's decision on each trial using a LASSO-PCR trained on the predecision firing rates of the network (see Measuring neural action representations).The choice of LASSO-PCR was based on prior work building reliable classifiers from whole-brain evoked responses that maximizes inferential utility (see (44)).The method is used when models are over-parameterized, as when there are more voxels than observations, relying on a combination of dimensionality reduction and sparsity constraints to find the true, effective complexity of a given model.While these are not considerations with our network model, they are with the human validation experiment that we describe next.Thus, we used the same classifier on our model as on our human participants to directly compare theoretical predictions and empirical observations.Model performance was cross-validated at the run-level using a leave-one-runout procedure, resulting in 45 folds per subject (five runs for each of the nine sessions).We then classified all trials in the hold-out set to evaluate prediction accuracy.The cross-validated accuracy for the four models, simulating individual participants, is shown in Figure 2B as ROC curves.The classifier was able to predict the chosen action with approximately 75% accuracy (72-80%) for each simulated participant, with an average area under the curve (AUC) of approximately 0.75, ranging from 0.71 to 0.77.
Examining the encoding pattern in the simulated network, we see lateralized activation over left and right action channels (Fig. 1A), with opposing weights in GPi and thalamus, and, to a lesser degree, contralateral encoding in STN and in both indirect and direct SPNs in striatum.
We do not observe contralateral encoding in cortex, which likely reflects the emphasis on basal ganglia structures and lumped representation of cortex in the model design.
To quantify the competition between action channels, we took the unthresholded prediction from the LASSO-PCR classifier, ŷt , and calculated its distance from the optimal target (i.e., target with the highest reward probability) on each trial (Fig. 2C).This provided an estimate of the uncertainty driven by the separability of pre-decision activity across action channels.In other words, the distance from the optimal target should increase with increased co-activation of circuits that represent opposing actions.The decision to model aggregate trial dynamics with a classifier stems from the limitations of the hemodynamic response that we will use next to vet the model predictions in humans.The low temporal resolution of the evoked BOLD signal makes finer-grained temporal analysis for the human data impossible, as the signal is a low-pass filtered version of the aggregate response over the entire trial.So, we chose to represent the macroscopic network dynamics as classifier uncertainty, that cleanly links the cognitive model results to both behavior and neural dynamics at the trial-by-trial level using only two variables (drift rate and classifier uncertainty).This approach allows us to directly compare model and human results.
If the competition in action channels is also driving v, then there should be a negative correlation between the classifier's uncertainty and v, particularly around a change point.Indeed, this is exactly what we see (Fig. 2D).In fact, the uncertainty and v are consistently negatively correlated across all trials in every simulated participant and in aggregate (Fig. 2E).Thus, in our model of the CBGT pathways, competition between action representations drives changes in v in response to environmental change.

Humans adapt decision policies in response to change
To test the predictions of our model, a sample of humans (N=4) played a dynamic two-armed bandit task under experimental conditions similar to those used for the simulated CBGT network and prior behavioral work (1) as whole brain hemophysiological signals were recorded using functional magnetic resonance imaging (fMRI) (Fig. 3 -Fig.Supp.1).On each trial, participants were presented with a male and female Greeble (45).The goal was to select the Greeble most likely to give a reward.Selections were made by pressing a button with their left or right hand to indicate the left or right Greeble on the screen.
Crucially, we designed this experiment such that each participant acted as an out-of-set replication test, having performed thousands of trials individually.Specifically, to ensure we had the statistical power to detect effects on a participant-by-participant basis, we collected an extensive data set comprising 2700 trials over 45 runs from nine separate imaging sessions for each of four participants.Consequently, we amassed a grand total of 36 hours of imaging data over all participants, which was used to evaluate the replicability of our findings at the participant-by-participant level.Therefore, our statistical analyses were able to estimate effects on a single-participant basis.
Behaviorally, human performance in the task replicated our prior work (1).Both response speed and accuracy changed across conditionsin a way that matched what we observed in Experiment 2 in (1).Specifically, we see a consistent effect of change point on both RT and accuracy that matches the behavior of our network (Fig. 3 -Fig.Supp.2).To address how a change in the environment shifted underlying decision dynamics, we used a hierarchical DDM modeling approach (42) as we did with the network behavior (see Methods for details).Given previous empirical work (1) and the results from our CBGT network model showing that only v and, less reliably, a respond to a shift in the environment, we focused our subsequent analysis on these two parameters.We compared models where single parameters changed in response to a switch, pairwise models where both parameters changed, and a null model that predicts no change in decision policy (Supp.File 1 and Supp.File 2).Consistent with the predictions from our CBGT model, we found equivocal fits for the model mapping both ∆B to v and Ω to a and a simpler model mapping ∆B to v (see Supp.File 1 for average results).This pattern was fairly consistent at the participant level, with 3/4 participants showing ∆B modulating v (Supp.File 2).These results suggest that as the belief in the value of the optimal choice approaches the reward value for the optimal choice, the rate of evidence accumulation increases.
Taken altogether, we confirm that humans rapidly shift how quickly they accumulate evidence (and, to some degree, how much evidence they need to make a decision) in response to a change in action-outcome contingencies.This mirrors the decision parameter dynamics predicted by the CBGT model.We next evaluated how this change in decision policy tracks with competition in neural action representations.

Measuring action representations in the brain
To measure competition in action representations, we first needed to determine how individual regions (i.e., voxels) contribute to single decisions.For each participant, trial-wise responses at every voxel were estimated by means of a general linear model (GLM), with trial modeled as a separate condition in the design matrix.Therefore, the βt,v estimated at voxel v reflected the magnitude of the evoked response on trial t.As in the CBGT model analysis, these wholebrain, single-trial responses were then submitted to a LASSO-PCR classifier to predict left/right response choices (Fig. .The performance of the classifier for each participant was evaluated with a 45-fold cross-validation, iterating through all runs so that each one corresponded to the hold-out test set for one fold. Our classifier was able to predict single trial responses well above chance for each of the four participants (Fig. 3A and B), with mean prediction accuracy ranging from 65% to 83% (AUCs from 0.72 to 0.92).Thus, as with the CBGT network model, we were able to reliably predict trial-wise responses for each participant.Figure 3C shows the average encoding map for our model as an illustration of the influence of each voxel on our model predictions (Fig. 3 -Fig.Supp. 4 displays individual participant maps).These maps effectively show voxel-tuning towards rightward (blue) or leftward (red) responses.Qualitatively, we see that cortex, striatum, and thalamus all exhibit strongly lateralized influences on contralateral response prediction.
Indeed, when we average the encoding weights in terms of principal CBGT nuclei (Fig. 3D), we confirm that these three regions largely predict contralateral responses.See Figure 3 -Fig.
Supp. 5 for a more detailed summary of the encoding weights across multiple cortical and subcortical regions.
These results show that we can reliably predict single-trial choices from whole-brain hemodynamic responses for individual participants.Further, key regions of the CBGT pathway contribute to these predictions.Next, we set out to determine whether competition between these representations for left and right actions correlates with changes in the drift rate, as predicted by the CBGT network model (Fig. 2C).

Competition between action representations may drive drift-rate
To evaluate whether competition between action channels correlates with the magnitude of v on each trial, as the CBGT network predicts (Fig. 2C), we focused our analysis on trials surrounding the change point, following analytical methods identical to those described in the previous section and shown in Fig. 2C.
Consistent with the CBGT network model predictions, following a change point, v shows a stereotyped drop and recovery as observed in the CBGT network (Fig. 2C) and prior behavioral work (1) (Fig. 4A).This drop in v tracked with a relative increase in classifier uncertainty, and subsequent recovery, in response to a change in action-outcome contingencies (mean bootstrapped β: −0.021 to −0.001; t range: −3.996 to −1.326; p S1 = 0.057, p S2 < 0.001; p S3 < 0.001; p S3 = 0.080, p All < 0.001).As with the CBGT network simulations (Fig. 2D), we also observe a consistent negative correlation between v and classifier uncertainty over all trials, irrespective of their position to a change point, in each participant and in aggregate (Fig. These results clearly suggest that, as predicted by our CBGT network simulations and prior work (12,17,21), competition between action representations drives changes in the rate of evidence accumulation during decision making in humans.

Discussion
We investigated the underlying mechanisms that drive shifts in decision policies when the rules of the environment change.We first tested an implementation-level theory of how CBGT networks contribute to changes in decision policy parameters.This theory predicted that the rate of evidence accumulation is driven by competition across action representations.Using a high-powered, within-participants fMRI design, where each participant served as an independent replication test, we found evidence consistent with our CBGT network simulations.Specifically, as action-outcome contingencies change, thereby increasing uncertainty of optimal choice, decision policies shift with a rapid decrease in the rate of evidence accumulation, followed by a gradual recovery to baseline rates as new contingencies are learned (see also (1)).
These results empirically validate prior theoretical and computational work predicting that competition between neural populations encoding distinct actions modulates how information is used to drive a decision (9,12,14,19,20).
Our findings here align with prior work on the role of competition in the regulation of evidence accumulation.In the decision-making context, the ratio of dSPN to iSPN activation within an action channel has been linked to the drift-rate of single-action decisions (14)(15)(16)37).In the motor control context, this competition manifests as movement vigor (46)(47)(48).Yet, our results show how competition across channels drives drift-rate dynamics.So how do we reconcile these two effects?Mechanistically, the strength of each action channel is defined by the relative difference between dSPN and iSPN influence.In this way, competition across action channels is defined by the relative balance of direct and indirect pathway activation within each channel.Greater direct vs. indirect pathway competition in one action channel, relative to another, makes that action decision relatively slow and reduces the overall likelihood that it is selected.This mechanism is consistent with prior theoretical (12,21) and empirical work (18).
While our current work postulates a mechanism by which changes in action-outcome contingencies drive changes in evidence accumulation through plasticity within the CBGT circuits, the results presented here are far from conclusive.For example, our model of the underlying neural dynamics predicts that the certainty of individual action representations is encoded by the competition between direct and indirect pathways (see also (12,21,38)).Thus, external perturbation of dSPN (or iSPN) firing, say with optogenetic methods, during decision-making should causally impact the evidence accumulation rate and, subsequently, the speed (or slow) the speed at which the new action-outcome contingencies are learned.Indeed, there is already some evidence for this outcome (see (18), but also (49) for contrastive evidence).
Our model, however, has very specific predictions with regards to disruptions of each pathway within an action representation.Disrupting the balance of dSPN and iSPN efficacy should selectively impact the drift-rate (and, to a degree, onset bias; see ( 21)), while non-specific disruption of global iSPN efficacy across action representations should selectively disrupt boundary height (and, to a degree, accumulation onset time; see again (21)).
Careful attention to the effect size of our correlations between channel competition and drift rate shows that the effect is substantially smaller in humans than in the model.This is not surprising and due to several factors.Firstly, the simulated data is not affected by the same sources of noise as the hemodynamic signal, whose responses can be greatly influenced by factors such as heterogeneity of cell populations and properties of underlying neurovascular coupling.Additionally, our model is not susceptible to non-task related variance, such as fatigue or lapses of attention, which the humans likely experienced.We could have fine tuned the model results based on the empirical human data, but that would contaminate the independence of our predictions.Finally, our simulations only used a single experimental condition, whereas human experiments varied the relative value of options and volatility, which led to more variance in human responses.Yet, despite these differences we see qualitative similarities in both the model and human results, providing confirmation of a key aspect of our theory.
Looking at the overall pattern of results, we see that increasing the difference between dSPN and iSPN firing in the channel representing the new optimal-action, say by selective excitation of the relevant dSPNs, should speed up the time to resolve the credit assignment problem during learning.This would result in faster and more accurate learning following an environmental change and lead to characteristic signatures in the distribution of reaction times, as well as choice probabilities, reflective of a shift in evidence accumulation rate.Of course, testing these predictions is left to future work.

Conclusion
As the world changes and certain actions become less optimal, successful behavioral adaptation requires flexibly changing how sensory evidence drives decisions.Our simulations and hemophysiological experiments in humans show how this process can occur within the CBGT circuits.Here, a shift in action-outcome contingencies induces competition between encoded action plans by modifying the relative balance of direct and indirect pathway activity in CBGT circuits, both within and between action channels, slowing the rate of evidence accumulation to promote adaptive exploration.If the environment subsequently remains stable, then this learning process accelerates the rate of evidence accumulation for the optimal decision by increasing the strength of action representations for the new optimal choice.This highlights how these macroscopic systems promote flexible, effective decision-making under dynamic environmental conditions.Predicted left and right responses.The distance of the predicted response from the optimal choice represents uncertainty for each trial.For example, here the predicted probability of a left response on the first trial ŷt 1 is 0.8.The distance from the optimal choice on this trial and, thereby, the uncertainty u t 1 , is 0.

Supplementary Figures
Figure 1 - Each panel shows the firing rates for a specific CBGT nucleus starting 100 ms prior to a left decision.The decision threshold for thalamus (30 spikes/second) is marked with a horizontal gray line.Note that the y axes have different limits for different nuclei due to differences of scale in their firing rates.The probability of reward for the statistically optimal target (conflict; y-axis) and the rate at which the optimal target shifted (volatility; x-axis) were manipulated according to this design, with each point representing a combination of the two variables.High conflict resulted in a smaller difference in the probability of reward between the optimal and suboptimal targets, while high volatility resulted in frequent switches in the optimal target selection.preprocessing and then we estimated single-trial hemodynamic responses.Cross-validated prediction model.We reduced the dimensionality of the trial estimate matrix, X, using Singular Value Decomposition.Then we conducted logistic regression with a sparsity penalty (L1norm).Outputs.For the imaging data, we predicted left or right responses, here coded as 0 or 1.We calculated classifier uncertainty from the unthresholded response prediction.The distance of this predicted response from the optimal choice represents classifier uncertainty for each trial.Here, the predicted probability of a left response ŷt1 is 0.8.The distance from the optimal choice on this trial, and, thereby, the classifier uncertainty is 0.2.The joint distribution of reaction times and accuracies were also fitted to estimate latent decision parameters using the Drift Diffusion Model (DDM).

Simulations
We simulated neural dynamics and behavior using a biologically based, spiking cortico-basal ganglia-thalamic (CBGT) network model (11,21).The network representing the CBGT circuit is composed of 9 neural populations: cortical interneurons (CxI), excitatory cortical neurons (Cx), striatal D1/D2-spiny projection neurons (dSPNs/iSPNs), striatal fast-spiking interneurons (FSI), the internal (GPi) and external globus pallidus (GPe), the subthalamic nucleus (STN), and the thalamus (Th).All the neuronal populations are segregated into two action channels with the exception of cortical (CxI) and striatal interneurons (FSIs).Each neuron in the population was modeled with an integrate-fire-or-burst-model ( 50), and a conductance-based synapse model was used for NMDA, AMPA and GABA receptors.The neuronal and network parameters (inter-nuclei connectivity and synaptic strengths) were tuned to obtain realistic baseline firing rates for all the nuclei.The details of the model are described in our previous work (21) as well as in the appendix for the sake of completeness.
Corticostriatal weights for D1 and D2 neurons in striatum were modulated by phasic dopamine to model the influence of reinforcement learning on network dynamics.The details of STDP learning are described in detail in previous work (38), but key details are shown below.As a result of these features of the CBGT network, it was capable of learning under realistic experimental paradigms with probabilistic reinforcement schemes (i.e.under reward probabilities and unstable action-outcome values).

Threshold for CBGT network decisions
A decision between the two competing actions ("left" and "right") was considered to be made when either of the thalamic subpopulations reached a threshold of 30Hz.This threshold was set based on the network dynamics for the chosen parameters with the aim of obtaining realistic reaction times.The maximum time allowed to reach a decision was 1000ms.If none of the thalamic subpopulations reached the threshold of 30Hz, no action was considered to be taken.Such trials were dropped from further analysis.Reaction times were calculated as time from stimulus onset to decision (either subpopulation reaches the threshold).The "slow" and "fast" trials were categorized as reaction times ≥ 75th percentile (314.5ms) and reactions time < 50th percentile (196.0ms),respectively, of the reaction time distributions.The firing rates of the CBGT nuclei during the reaction times were used for prediction analysis as discussed in our description of single-trial response estimation below.

Corticostriatal weight plasticity
The corticostriatal weights are modified by a dopamine-mediated STDP rule, where the phasic dopamine is modulated by reward prediction error.The internal estimate of the reward is calculated at every trial by a Q-learning algorithm and is subtracted from the reward associated with the experimental paradigm to yield a trial-by-trial estimate of the reward prediction error.The effect of dopaminergic release is receptor dependent; a rise in dopamine promotes potentiation for dSPNs and depression for iSPNs.The degree of change in the weights is dependent on an eligibility trace which is proportional to the coincidental pre-synaptic (cortical) and post-synaptic (striatal) firing rates.The STDP rule is described in detail in (38) as well as in the appendix.

In silico experimental design
We follow the paradigm of a 2 arm bandit task, where the CBGT network learns to consistently choose the rewarded action until the block changes (i.e the reward contingencies switch), at which point the CBGT network re-learns the rewarded action (reversal learning).Each session consists of 40 trials with a block change every 10 trials.The reward probabilities represent a conflict of (75%, 25%); that is, in a left block, 75% of the left actions are rewarded, whereas 25% of the right actions are rewarded.The inter-trial-interval in network time is fixed to 600ms.order.Each learning trial presented a male and female Greeble (45), with the goal of selecting the gender identity of the Greeble that was most rewarding.Because individual Greeble identities were resampled on each trial, the task of the participant was to choose the gender identity rather than the individual identity of the Greeble which was most rewarding.
Probabilistic reward feedback was given in the form of points drawn from the normal distribution N (µ = 3, σ = 1) and converted to an integer.These points were displayed at the center of the screen.For each run, participants began with 60 points and lost one point for each incorrect decision.To promote incentive compatibility (51,52), participants earned a cent for every point earned.Reaction time was constrained such that participants were required to respond within between 0.1 s and 0.75 s from stimulus presentation.If participants responded in ≤ 0.1 s, ≥ 0.75 s, or failed to respond altogether, the point total turned red and decreased by 5 points.Each trial lasted 1.5 s and reward feedback for a given trial was displayed from the time of the participant's response to the end of the trial.To manipulate change point probability, the gender identity of the most rewarding Greeble was switched probabilistically, with a change occurring every 10, 20, or 30 trials, on average.To manipulate the belief in the value of the optimal target, the probability of reward for the optimal target was manipulated, with P set to 0.65, 0.75, or 0.85.Each session combined one value of P with one level of volatility, such that all combinations of change point frequency and reward probability were imposed across the nine sessions.Finally, the position of the high-value target was pseudo-randomized on each trial to prevent prepotent response selections on the basis of location.

Behavioral analysis
Statistical analyses and data visualization were conducted using custom scripts written in R (R Foundation for Statistical Computing, version 3.4.3)and Python (Python Software Foundation, version 3.5.5).Scripts are publicly available (53).
Binary accuracy data were submitted to a mixed effects logistic regression analysis with either the degree of conflict (the probability of reward for the optimal target) or the degree of volatility (mean change point frequency) as predictors.The resulting log-likelihood estimates were transformed to likelihood for interpretability.RT data were log-transformed and submitted to a mixed effects linear regression analysis with the same predictors as in the previous analysis.
To determine if participants used ideal observer estimates to update their behavior, two more mixed effects regression analyses were performed.Estimates of change point probability and the belief in the value of the optimal target served as predictors of reaction time and accuracy across groups.As before, we used a mixed logistic regression for accuracy data and a mixed linear regression for reaction time data.

Estimating evidence accumulation using drift diffusion modeling
To assess whether and how much the ideal observer estimates of change point probability (Ω) and the belief in the value of the optimal target (∆B) (1,4) updated the rate of evidence accumulation (v), we regressed the change-point-evoked ideal observer estimates onto the decision parameters using hierarchical drift diffusion model (HDDM) regression (54).These ideal observer estimates of environmental uncertainty served as a more direct and continuous measure of the uncertainty we sought to induce with our experimental manipulations.Using this more direct approach, we pooled change point probability and belief across all conditions and used these values as our predictors of drift rate and boundary height.Responses were accuracycoded, and the belief in the difference between targets values was transformed to the belief in the value of the optimal target (∆B optimal(t) = B optimal(t) − B suboptimal(t) ).This approach allowed us to estimate trial-by-trial covariation between the ideal observer estimates and the decision parameters.
To find the models that best fit the observed data, we conducted a model selection process using Deviance Information Criterion (DIC) scores.A lower DIC score indicates a model that loses less information.Here, a difference of ≤ two points from the lowest-scoring model cannot rule out the higher scoring model; a difference of three to seven points suggests that the higher scoring model has considerably less support; and a difference of 10 points suggests essentially no support for the higher scoring model (43,55).We evaluated the DIC scores for the set of fitted models relative to an intercept-only regression model (DIC intercept -DIC model i ).

MRI Data Acquisition
Neurologically healthy human participants (N=4, 2 female) were recruited.Each participant was tested in nine separate imaging sessions using a 3T Siemens Prisma scanner.Session 1 included a set of anatomical and functional localizer sequences (e.g., visual presentation of Greeble stimuli with no manual responses, and left vs. right button responses to identify motor networks).Sessions 2-10 collected five functional runs of the dynamic 2-armed bandit task (60 trials per run).Male and female "greebles" served as the visual stimuli for the selection targets (45), with each presented on one side of a central fixation cross.Participants were trained to respond within 1.5 seconds.
To minimize the convolution of the hemodynamic response from trial to trial, inter-trial intervals were sampled according to a truncated exponential distribution with a minimum of 4 s between trials, a maximum of 16 s, and a rate parameter of 2.8 s.To ensure that head position was stabilized and stable over sessions, a CaseForge head case was customized and printed for each participant.The task-evoked hemodynamic response was measured using a high spatial (2mm 3 voxels) and high temporal (750ms TR) resolution echo planar imaging approach.This design maximized recovery of single-trial evoked BOLD responses in subcortical areas, as well as cortical areas with higher signal-to-noise ratios.During each functional run, eye-tracking (EyeLink, SR Research Inc.), physiological signals (ECG, respiration, and pulse-oximetry via the Siemens PMU system) were also collected for tracking attention and for artifact removal.
Preprocessing fMRI data were preprocessed using the default fMRIPrep pipeline (56), a standard toolbox for fMRI data preprocessing that robust to variations in scan acquisition protocols, and minimal user manipulation.

Single-trial response estimation
We used a univariate general linear model (GLM) to estimate within-participant trial-wise responses at the voxel-level.Specifically, for each fMRI run, preprocessed BOLD time series were regressed onto a design matrix, where each task trial corresponded to a different column, and was modeled using a boxcar function convolved with the default hemodynamic response function given in SPM12.Thus, each column in the design matrix estimated the average BOLD activity within each trial.In order to account for head motion, the six realignment parameters (3 rotations, 3 translations) were included as covariates.In addition, a high-pass filter (128 s) was applied to remove low-frequency artifacts.Parameter and error variance were estimated using the RobustWLS toolbox, which adjusts for further artifacts in the data by inversely weighting each observation according to its spatial noise (57).
Finally, estimated trial-wise responses were concatenated across runs and sessions and then stacked across voxels to give a matrix, βt,v , of T (trial estimations) × V (voxels) for each participant.

Single-trial response prediction
A machine learning approach was applied to predict left/right greeble choices from the trialwise responses.First, using the trial-wise hemodynamic responses, we estimated the contrast in neural activation when the participant made a left versus right selection.A Lasso-PCR clas-sifier (i.e. an L1-constrained principal component logistic regression) was estimated for each participant according to the below procedure.We should note that the choice of LASSO-PCR was based on prior work building reliable classifiers from whole-brain evoked responses that maximizes inferential utility (see (44)).This approach is used in case of over-parameterization, as when there are more voxels than observations, and relies on a combination of dimensionality reduction and sparsity constraints to find the effective complexity of a model.First, a singular value decomposition (SVD) was applied to the input matrix X: where the product matrix Z = US represents the principal component scores, i.e. the values of X projected into the principal component space, and V T an orthogonal matrix whose rows are the principal directions in feature space.Then the binary response variable y (Left/Right choice) was regressed onto Z, where the estimation of the β coefficients is subject to an L1 penalty term C in the objective function: where β and Z include the intercept term, y i = {−1, 1} and N is the number of observations.Projection of the estimated β coefficients back to the original feature (voxel) space was done to yield a weight map ŵ = V β, which in turn was used to generate final predictions ŷ: where x denotes the vector of voxel-wise responses for a given trial (i.e. a given row in the X matrix).When visualizing the resulting weight maps, these were further transformed to encoded brain patterns.This step was performed to aid in correct interpretation in terms of the studied brain process, because doing this directly from the observed weights in multivariate classification (and regression) models can be problematic (58).
Here, the competition between left-right neural responses decreases classifier decoding accuracy, as neural activation associated with these actions becomes less separable.Therefore, classifier prediction serves as a proxy for response competition.To quantify uncertainty from this, we calculated the Euclidean distance of these decoded responses ŷ from the statistically optimal choice on a given trial, opt choice.This yielded a trial-wise uncertainty metric derived from the decoded competition between neural responses.
The same analytical pipeline was used to calculate single trial responses for simulated data with a difference that trial-wise average firing rates of all nuclei from the simulations were used instead of fMRI hemodynamic responses.

Neuron model
We used an integrate-and-fire-or-burst model that models the membrane potential V (t) as where g L represents the leak conductance, V L is the leak reversal potential and the first term g L (V (t) − V L ) is the leak current; the next term is a low threshold Ca 2+ current with maximum conductance g T , gating variable h(t), and reversal potential V T , which activates when V (t) > V h due to the Heaviside function H; I syn is the synaptic current and I ext is the external current.This neuron model is capable of producing post inhibitory bursts, regulated by the gating variable that decays with the time constant τ − h , when the membrane potential reaches a certain threshold V h and rises with time constant τ + h .However, when g T is set to zero, the model reduces to a leaky integrate and fire neuron.Currently, we model GPe and STN neuronal populations with bursty neurons and the remaining neuronal populations with leaky integrateand-fire neurons, all with conductance-based synapses.
The synaptic current I syn (t) consists of three components, two excitatory currents corresponding to AMPA and NMDA receptors and one inhibitory current corresponding to GABA receptors, and is calculated as below: 1 + e −0.062V (t)/3.57+ g GABA s GABA (t)(V (t) − V I ) where g i represents the maximum conductance corresponding to the receptor i ∈ {AMPA, NMDA, GABA}, V I and V E represent the excitatory and inhibitory reversal potentials, and s i represents the gating variable for each current, with dynamics given by: The gating variables for AMPA and GABA act as leaky integrators that are increased by all incoming spikes, with an additional constraint for NMDA that ensures that the maximum value of s NMDA remains below 1.
The values of neuronal parameters for all of the nuclei are listed in Supplementary File 3, external inputs to the CBGT nuclei are listed in Supplementary File 4, synaptic parameter values are listed in Supplementary File 5, connectivity types and probabilities are listed in Supplementary File 6, and the number of neurons in each CBGT population is shown in Supplementary File 7.

Spike timing dependent plasticity rule
The plasticity rule we use is a dopamine modulated STDP rule also described in (38).All the values of the relevant parameters are listed in Supp.File 8.The weight update of a corticostriatal synapse is controlled by three factors: 1) an eligibility trace, 2) the type of the striatal neuron (iSPN/dSPN), and 3) the level of dopamine.
To compute the eligibility (E) for a given synapse, an activity trace of each neuron in the pre-synaptic and post-synaptic populations is tracked via the equations If the post-synaptic spike follows the spiking activity of the pre-synaptic population closely enough in time, then the eligibility variable E increases and allows for plasticity to occur.On the other hand, if a pre-synaptic spike follows the spiking activity of the post-synaptic population, then E decreases.In the absence of any activity and spikes, the eligibility trace decays to zero with a time constant τ E .Putting these effects together, we obtain the equation τ E dE dt = X P OST (t)A P RE (t) − X P RE (t)A P OST (t) − E.
Finally, the function f X (K DA ) converts the level of dopamine into an impact on plasticity in a way that depends on the identity X of the post-synaptic neuron, as follows: where c sets the dopamine level at which f iSP N reaches half-maximum.Supplementary File 8 lists the specific parameters used for the STDP rule.

Figures
Figures

Fig. 1 .
Fig. 1.Biologically based CBGT network dynamics and behavior.A) Each CBGT nucleus is organized into left and right action channels with the exception of a common population of striatal fast spiking interneurons (FSIs) and cortical interneurons (CxI).Values show encoded weights for left and right action channels when a left action is made.Network schematic adapted from Figure 1 of Vich et al. 2022.(21).B) Firing rate profiles for dSPNs (left panel) and iSPNs (right panel) prior to stimulus onset (t=0) for a left choice.SPN activity in left and right action channels is shown in red and blue, respectively.Slow and fast decisions are shown with dashed and solid lines, respectively.C) Choice probability for the CBGT network model.The reward for left and right actions changed every 10 trials, marked by vertical dashed lines.The horizontal dashed line represents chance performance.

Fig. 2 .
Fig. 2. Competition between action plans should drive evidence accumulation.A) Decision parameters were estimated by modeling the joint distribution of reaction times and responses within a drift diffusion framework.B) Classification performance for single-trial left and right actions shown as an ROC curve.The gray dashed line represents chance performance.C) 2. D) Change-point-evoked uncertainty (lavender) and drift rate (green).The change point is marked by a dashed line.E) Bootstrapped estimates of the association between uncertainty and drift rate.Results for individual participants are presented along with aggregated results.

Fig. 3 .
Fig. 3. Single-trial prediction of action plan competition in humans.A) Overall classi-

Fig. 4 .
Fig. 4. Competition between action plans drives evidence accumulation in humans.A) Classifier uncertainty (lavender) and estimated drift rate (v; green) dynamics.B) Bootstrapped estimate of the association between classifier uncertainty and drift rate by participant and in aggregate.

Figure Supplement 1 .
Figure 1 -Figure Supplement 1. Simulated CBGT nuclei firing rates for a left decision.

Figure 1 -
Figure 1 -Figure Supplement 2. Simulated and human behavior.Change point evoked reaction times are shown in red and accuracy, or the probability of selecting the optimally rewarding choice, is shown in green.Chance is marked as a green horizontal dashed line.The change point is marked by the vertical gray line.A) Simulated behavior.B) Human behavior.

Figure 3 -
Figure 3 -Figure Supplement 1. Experimental design.The probability of reward for the

Figure 3 -
Figure 3 -Figure Supplement 2. Analysis method.Inputs.Behavioral responses in the form of reaction time and accuracy, along with trial-by-trial hemodynamic responses, were collected as participants learned the task.In the case of the simulated CBGT network, this step involved simulating responses to experimental manipulations.Preprocessing.Data underwent standard

Figure 3 -
Figure 3 -Figure Supplement 3. Encoding maps in standardized space for each participant.Rows represent individual participants.Columns refer to left and right views of the whole brain.Thalamus and striatum are shown beneath each cortical map.Values are z-scored.

Figure 3 -
Figure 3 -Figure Supplement 4. Encoding patterns by CBGT node.A) Simulated CBGT encoding weights.B) Human CBGT encoding weights for comparison with the simulated CBGT network results.Each point represents the average result for each participant.Bars represent participant-averaged data.C) The full set of human CBGT encoding weights for all captured nodes from whole-brain imaging.Gray error bars represent 95% CIs over participants.Left hemisphere weights are marked in blue and right hemisphere weights are marked in red.

Figure 4 -
Figure 4 -Figure Supplement 1. Competition between action plans and boundary height.A) Change-point-evoked classifier uncertainty (lavender) and estimated boundary height (red).Bootstrapped 95% CIs are shown.B) The association between classifier uncertainty and boundary height by participant and in aggregate.
P RE dA P RE dt = ∆ P RE X P RE (t) − A P RE (t) τ P OST dA P OST dt = ∆ P OST X P OST (t) − A P OST (t) where X P RE , X P OST are spike trains, such that A P RE and A P OST maintain a filtered record of synaptic spiking of the pre/post neuron, respectively, with spike impact parameters ∆ P RE , ∆ P OST and time constants τ P RE , τ P OST .