Reward-driven changes in striatal pathway competition shape evidence evaluation in decision-making

Cortico-basal-ganglia-thalamic (CBGT) networks are critical for adaptive decision-making, yet how changes to circuit-level properties impact cognitive algorithms remains unclear. Here we explore how dopaminergic plasticity at corticostriatal synapses alters competition between striatal pathways, impacting the evidence accumulation process during decision-making. Spike-timing dependent plasticity simulations showed that dopaminergic feedback based on rewards modified the ratio of direct and indirect corticostriatal weights within opposing action channels. Using the learned weight ratios in a full spiking CBGT network model, we simulated neural dynamics and decision outcomes in a reward-driven decision task and fit them with a drift diffusion model. Fits revealed that the rate of evidence accumulation varied with inter-channel differences in direct pathway activity while boundary height varied with overall indirect pathway activity. This multi-level modeling approach demonstrates how complementary learning and decision computations can emerge from corticostriatal plasticity.


Introduction
The flexibility of mammalian behavior showcases the dynamic range over which neural circuits can be modified by experience and the robustness of the emergent cognitive algorithms that guide goal-directed actions. Decades of research in cognitive science has independently detailed the algorithms of decision-making (e.g., accumulation-to-bound models, [1]) and reinforcement learning (RL; [2,3]), providing foundational insights into the computational principles of adaptive decision-making. In parallel, research in neuroscience has shown how the selection of actions, and the use of feedback to modify selection processes, both rely on a common neural substrate: cortico-basal ganglia-thalamic (CBGT) circuits [4][5][6][7][8].
Understanding how the cognitive algorithms for adaptive decision-making emerge from the circuit-level dynamics of CBGT pathways requires a careful mapping across levels of analysis [9], from circuits to algorithm (see also [10,11]). Previous simulation studies have demonstrated how the specific circuit-level computations of CBGT pathways map onto subcomponents of the multiple sequential probability ratio test (MSPRT; [5,12]), a simple algorithm of information integration that selects single actions from a competing set of alternatives based on differences in input evidence [13,14]. Allowing a simplified form of RL to modify corticostriatal synaptic weights results in an adaptive variant of the MSPRT that approximates the optimal solution to the action selection process based on both sensory signals and feedback learning [15,16]. Previous attempts at multi-level modeling have largely adopted a "downwards mapping" approach, whereby the stepwise operations prescribed by computational or algorithmic models are intuitively mapped onto plausible neural substrates. Recently, Frank [17] proposed an alternative "upwards mapping" approach for bridging levels of analysis, where biologically detailed models are used to simulate behavior that can be fit to a particular cognitive algorithm. Rather than ascribing different neural components with explicit computational roles, this variant of multi-level modeling examines how cognitive mechanisms are influenced by changes in the functional dynamics or connectivity of those components. A key assumption of the upwards mapping approach is that variability in the configuration of CBGT pathways should drive systematic changes in specific sub-components of the decision process, expressed by the parameters of the drift diffusion model (DDM; [1]). Indeed, by fitting the DDM to synthetic choice and response time data generated by a rate-based CBGT network, Ratcliff and Frank [18] showed how variation in the height of the decision threshold tracked with changes in the strength of subthalamic nucleus (STN) activity. Thus, this example shows how simulations that map up the levels of analysis can be used to investigate the emergent changes in information processing that result from targeted modulation of the underlying neural circuitry.
Converging lines of evidence from human and non-human animal experiments have identified the striatum as playing a critical role in the process of accumulating evidence during decision-making [19], particularly in the context of value-based decision tasks [20]. In accumulation-to-bound models like the DDM, the drift rate parameter controls the speed at which evidence accumulates in favor of one choice over another (see [21]). It remains unclear how the dual-pathway organization of corticostriatal inputs contributes to the representation and comparison of evidence for conflicting actions. Recent theoretical models have proposed that decision evidence may be encoded as a dynamic competition between the direct and indirect pathways within a single CBGT action channel [7,[22][23][24]. In this scenario, the strength of evidence for a given action would be computed as the likelihood ratio of two hypotheses: "execute" (direct pathway) versus "suppress" (indirect pathway). Indeed, due to the opposing influence of dopamine (DA) on the sensitivity of direct and indirect MSNs to cortical input, DA-mediated RL could sculpt this competition to bias decisions towards the behaviorally optimal target [15]. In addition to this form of within-channel competition, competition between action channels may also contribute to the drift rate, such that greater differences between the activation of opposing direct pathways lead to faster evidence accumulation.
To see how the state of corticostriatal synaptic weights can impact decision process parameters, we adopted an upwards mapping approach to modeling adaptive choice behavior across neural and cognitive levels of analysis (Fig 1). Using a preliminary spike-timing dependent plasticity (STDP; [25,26]) simulation, we modeled how phasic DA feedback signals [27] can modulate the relative balance of corticostriatal synapses, thereby promoting or deterring action selection. The effects of learning on the synaptic weights were incorporated into a spiking model of the full CBGT network meant to capture the known physiological properties and connectivity patterns of the constituent neurons in these circuits [28]. The performance (i.e., accuracy and response times) of the CBGT simulations was then fit using a hierarchical DDM [29]. This progression from synapses to networks to behavior allows us to show how variability in striatal pathway competition can manifest behaviorally by mapping how specific features of striatal activity that result from reward-driven changes in corticostriatal synaptic weights could underlie parameters of the fundamental cognitive algorithms of decisionmaking.

Tuning corticostriatal synaptic weights
Our goal was to fit a DDM to behavioral data from our model CBGT network under different corticostriatal synaptic weight schemes. The first step was to find parameter settings for these synaptic weights, which are known sites of plasticity associated with dopaminergic signals about reward outcomes [27], that correspond to what would be observed after learning different levels of reward conflict in a two-alternative forced choice task. In this case, reward conflict is represented by the similarity in the reward probabilities associated with the two targets. In the absence of clear experimental consensus about how the strengths of direct and indirect pathway corticostriatal synapses vary across conditions, we followed the computational literature and performed a preliminary simulation of a reduced network with dopamine-related STDP in these weights [25,26].
In the reduced network, one of two available actions, which we call left (L) and right (R), was selected by the spiking of model striatal medium spiny neurons (MSNs). These model MSNs were grouped into action channels receiving inputs from distinct cortical sources (Fig 1,  upper right). Every time an action was selected, dopamine was released, after a short delay, at an intensity proportional to a reward prediction error based on comparison of the resulting reward to a value estimate from ongoing Q-learning. All neurons in the network experienced this non-targeted increase in dopamine, emulating striatal release of dopamine by substantia nigra pars compacta neurons, leading to plasticity of corticostriatal synapses (see Subsection STDP network and [30]).
We first performed simulations in which a fixed reward level was associated with each action and confirmed that the reduced network agreed with several experimental benchmarks (see S1 Fig): (a) firing rates in the direct pathway MSNs (dMSNs; firing rates D L and D R ) associated with the more highly rewarded action increased, (b) firing rates of the indirect pathway MSNs (iMSNs; firing rates I L and I R ) remained quite similar [31], and (c) both the dMSNs and the iMSNs associated with a selected action were active during the action selection process [32][33][34].
Given this confirmation of model performance, we next simulated a probabilistic reward task introduced in previous experimental studies on action selection with probabilistic rewards in human subjects [20]. For consistency with experiments, we always used p L + p R = 1, where p L and p R were the probabilities of delivery of a reward of size r = 1 when actions L and R were performed, respectively. Moreover, as in the earlier work, we considered the three cases p L = 0.65 (high conflict), p L = 0.75 (medium conflict) and p L = 0.85 (low conflict). The simulation results show that corticostriatal synaptic weights onto the two dMSN populations clearly separated out over time (Fig 2). The separation emerged earlier and became more drastic as the conflict between the rewards associated with the two actions diminished, i.e., as reward probabilities became less similar. Interestingly, for relatively high conflict, corresponding to relatively low p L , the weights to both dMSN populations rose initially before those onto the less rewarded population eventually diminished (Fig 2A and 2B). This initial increase likely occurred because both actions yielded a reward of 1, leading to a significant dopamine increase, on at least some trials. The weights onto the two iMSN populations, on the other hand, remained much more similar, exhibiting just slight decreases over time (Fig 2A-2C). The ratio of the dMSNs relative to the iMSNS, w D /w I , provided a single summary representation of the temporal evolution process, which highlighted the difference between channels within each reward scenario as well as the enhancement of this difference with less conflict between p L and p R (Fig 2D-2F). We therefore extracted the corticostriatal synaptic weight ratios from the three probabilistic reward cases to use in our subsequent simulations of the CBGT network.

CBGT dynamics and choice behavior
As illustrated by our STDP simulations, differences in rewards associated with different actions lead to differences in the ratios of corticostriatal synaptic weights to dMSNs versus those to iMSNs, across action channels. Using weight ratios adapted from the STDP model, obtained by varying weights to dMSNs with fixed weights to iMSNs (Fig 2), we next performed simulations with a full spiking CBGT network (Fig 1) to study the effects of this corticostriatal imbalance on the emergent neural dynamics and choice behavior following feedback-dependent learning in the context of low, medium, and high probability reward schedules (2500 trials/condition; see Subsection Neural dynamics for details). In each simulation, cortical inputs featuring gradually increasing firing rates were supplied to both action channels, with identical statistical properties of inputs to both channels. These inputs led to evolving firing rates in nuclei throughout the basal ganglia, also partitioned into action channels, with an eventual action selection triggered by the thalamic firing rate in one channel reaching 30 Hz (Fig 3). We and R (Q R ) estimated by Q-learning versus the ratio of the corticostriatal weights to those dMSN neurons that facilitate the action relative to the weights to those iMSN that interfere with the action. Both the weights and the ratios have been averaged over 8 different realizations. A small jump occurs in the Q R trace for p L = 0.65 and is joined by a dashed line; this comes from the time discretization and averaging.
found that both dMSN and iMSN firing rates gradually increased in response to cortical inputs, and dMSN firing rates became higher in the channel for the selected action than for the other channel. Interestingly, iMSN firing rates also became higher in the selected channel, consistent with recent experiments (see [35], among others). Similar to the activity patterns observed in the striatum, higher firing rates were also observed in the selected channel's STN and thalamic populations, whereas GPe and GPi firing rates were higher in the unselected channel (Fig 3).
More generally across all weight ratio conditions, dMSNs and iMSNs exhibited a gradual ramping in population firing rates [19] that eventually saturated around the average RT in each condition (Fig 4A). To characterize the relevant dimensions of striatal activity that contributed to the network's behavior, we extracted several summary measures of dMSN and iMSN activity, shown in Fig 4B and 4C. Summary measures of dMSN and iMSN activity in the L and R channels were calculated by estimating the area under the curve (AUC) of the population firing rate between the time of stimulus onset (200 ms) and the RT on each trial. Trialwise AUC estimates were then normalized between values of 0 and 1, including estimates from all trials in all conditions in the normalization, and normalized estimates were averaged over all trials. As expected, increasing the disparity of left and right Ctx-dMSN weights led to greater differences in direct pathway activation between the two channels (i.e., larger D L − D R ; Fig 4B). The increase in D L − D R reflects a form of competition between action channels, where larger values indicate stronger dMSN activation in the optimal channel and/or a weakening of dMSN activity in the suboptimal channel. Similarly, increasing the weight of Ctx-dMSN connections caused a shift in the competition between dMSN and iMSN populations within the left action channel (i.e., an increase in D L − I L ). Thus, manipulating the weight of Ctx-dMSN connections to match those predicted by the STDP model led to both between-and withinchannel biases favoring firing of the direct pathway of the optimal action channel in proportion to its expected reward value.
Interestingly, although the weights of Ctx-iMSN connections were kept constant across conditions, iMSN populations showed reliable differences in activation between channels ( Fig  4C). Similar to the observed effects on direct pathway activation, higher reward conditions were associated with progressively greater differences in the L and R indirect pathway firing rates (i.e., increased I L − I R ). At first glance, greater indirect pathway activation in more rewarded action channels seems to be at odds with the similarity of cortical synaptic weights to both indirect pathway channels that we obtained in the STDP model (Fig 2A) and also appears Reward-driven changes in striatal dynamics shape evidence evaluation in decision-making to be at odds with canonical theories of the roles of the direct and indirect pathways in RL and decision-making. This finding can be explained, however, by the presence of channel-specific excitatory inputs to the striatum from the thalamus in the CBGT network. That is, the strengthening and weakening of Ctx-dMSN weights in the L and R channels, respectively, translated into relatively greater downstream disinhibition of the thalamus in the L channel, which increased the excitatory feedback to L-dMSNs and L-iMSNs while reducing thalamostriatal feedback to R-MSNs in both pathways.
Finally, we examined the effects of reward probability on all iMSN firing rates, combined across action channels (I all ; Fig 4C). Observed differences in I all across reward conditions were notably more subtle than those observed for other summary measures of striatal activity, with greatest activity in the medium reward condition, followed by the high and low reward conditions, respectively.
We also examined the distribution of RTs for L responses across reward conditions ( Fig  4E). All conditions showed a rightward skew in the distribution of RTs, an empirical hallmark of simple choice behavior and a useful check of the suitability of accumulation-to-bound models like the DDM for modeling behavioral data sets. Moreover, the degree of positive skew in the RT distributions (i.e., greater probability mass at lower than at higher values) for L responses became more pronounced with increasing reward probability, suggesting that the observed decrease in the mean RT at higher levels of reward was driven by a change in the shape of the distribution, and not, for instance, a temporal shift in its location. While the skewed shape of the RT distributions produced by the CBGT network are qualitatively consistent with those typically observed in human choice experiments, it is worth noting that the network-simulated RTs appeared to follow an exponential distribution, whereas empirical RTs often exhibit a mode closer to the center of mass.

CBGT-DDM mapping
We performed fits of a normative DDM to the CBGT network's decision-making performance (i.e., accuracy and RT) to understand the effects of corticostriatal plasticity on emergent changes in decision behavior. This process was implemented in three stages. First, we compared models in which only one free DDM parameter was allowed to vary across levels of reward probability (single parameter DDMs). Next, a second round of fits was performed in which a second free DDM parameter was included in the best-fitting single parameter model identified in the previous stage (dual parameter DDMs). Finally, the two best-fitting dual parameter models were submitted to a third and final round of fits with the inclusion of trialwise measures of striatal activity (see Fig 4B and 4C) as regressors on designated parameters of the DDM.
All models were evaluated according to their relative improvement in performance compared to a null model in which all parameters were fixed across conditions. To identify which single parameter of the DDM best captured the behavioral effects of alterations in reward probability as represented by Ctx-dMSN connectivity strength, we compared the deviance information criterion (DIC) of DDM models in which either the boundary height (a), the onset delay (tr), the drift rate (v), or the starting-point bias (z) was allowed to vary across conditions. Fig 5A shows the difference between the DIC score of each model (DIC M ) and that of the null model (ΔDIC = DIC M − DIC null ), with lower values indicating a better fit to the data (see Table 1 for additional fit statistics). Conventionally, a DIC difference (ΔDIC) of magnitude 10 or more is regarded as strong evidence in favor of the model with the lower DIC value [41]. Compared to the null model as well as alternative single parameter models, allowing the drift rate v to vary across conditions afforded a significantly better fit to the data  Table 2). The ΔDIC score of the best-fitting model at each stage is plotted in green. The best overall fit was provided by DDM regression model III.  Table 1. Single-and dual-parameter DDM goodness-of-fit statistics. DIC is a complexity-penalized measure of model fit, DIC = D(θ) + pD, where D(θ) is the deviance of model fit under the optimized parameter set θ and pD is the effective number of parameters. ΔDIC is the difference between each model's DIC and that of the null model for which all parameters are fixed across conditions. Asterisks denote models providing best fits within the single-parameter group ( � ) and across both groups ( �� ). Reward-driven changes in striatal dynamics shape evidence evaluation in decision-making (ΔDIC = −960.79). Examination of posterior distributions of v in the best-fitting single parameter model revealed a significant increase in v with successively higher levels of reward probability (v Low = 0.35; v Med = 1.61; v High = 2.71), capturing the observed increase in speed and accuracy across conditions by increasing the rate of evidence accumulation toward the upper (L) decision threshold.

Null
To investigate potential interactions between the drift rate and other parameters of the DDM, we performed another round of fits in which a second free parameter (either a, tr, or z), in addition to v, was allowed to vary across conditions ( Fig 5A). Compared to alternative dualparameter models, the combined effect of allowing v and a to vary across conditions provided the greatest improvement in model fit over the null model (ΔDIC = −1174.07), as well as over the best-fitting single parameter model , the second best-fitting dual parameter model, in which v and z were left free across conditions, also afforded a significant improvement over the drift-only model . Thus, both v, a and v, z dual parameter models were considered in a third and final round of fits. The third round was motivated by the fact that, while behavioral fits can yield reliable and informative insights about the cognitive mechanisms engaged by a given experimental manipulation, recent studies have effectively combined behavioral observations with coincident measures of neural activity to test more precise hypotheses about the neural dynamics involved in regulating different cognitive mechanisms [20,42,43]. To this end, we refit the v, a and v, z models to the same simulated behavioral dataset (i.e., accuracy and RTs produced by the CBGT network) as in the previous rounds, but now with different trialwise measures of striatal activity included as regressors on one of the two free parameters in the DDM.
For each regression DDM (N = 24 models, corresponding to 24 ways to map 2 of 4 striatal activity measures to the v, a and v, z models), one of the summary measures shown in Fig 4B and 4C was regressed on v, and another regressed on either a or z, with separate regression weights estimated for each level of reward probability. Model fit statistics are shown for each of the 24 regression models in Table 2, along with information about the neural regressors included in each model and their respective parameter dependencies. The relative goodnessof-fit afforded by all 24 regression models is visualized in Fig 5A (lower panel), identifying what we have labelled as model III as the clear winner with an overall DIC = −18860.37 and with ΔDIC = −9716.17 compared to the null model. In model III, the drift rate v on each action selection trial depended on the relative strength of direct pathway activation in L and R action channels (e.g., D L − D R ), whereas the boundary height a on that trial was computed as a function of the overall strength of indirect pathway activation across both channels (e.g., I all ). To determine how these parameter dependencies influenced levels the boundary height and rate of evidence accumulation across levels of reward probability, the following equations were used to transform intercept and regression coefficient posteriors into posterior estimates of a and v for each condition j: where ΔD j and I j are the mean values of D L − D R and I all in condition j (see Fig 4B and and high (m v High ¼ 5:10; s v High ¼ 0:09) conditions. Thus, increasing the disparity of dMSN activation between L and R action channels led to faster and more frequent leftward actions by increasing the rate of evidence accumulation towards the correct decision boundary. Also consistent with parameter estimates from the best-fitting dual parameter model (i.e., v, a), inclusion of trialwise values of I all led to an increase in the boundary height in the medium (m a Med ¼ 1:03; s a Med ¼ 0:01) and high (m a High ¼ 1:02; s a High ¼ 0:01) conditions compared to estimates in the low condition (m a Low ¼ 0:93; s a Low ¼ 0:01). However, in contrast with boundary height estimates derived from the non-hierarchical DDMs that simply looked at condition-wise effects (see Table 1), estimates of a in model III showed no significant difference between medium and high levels of reward probability.
Examination of trialwise covariation of striatal regressors with raw choice and RT data of the network revealed additional insights into the influence of striatal dynamics on decision outcomes. Summary measures of competition between (D L − D R ) and within (D L − I L ) channels (see Fig 4B) Table 2. DDM regression models and goodness-of-fit statistics. Asterisk denotes best performing model.  , right) appears to contradict the role of the indirect pathway in our model as having a suppressive influence on the decision process. However, this relationship can be understood as a consequence of heightened excitatory feedback to the striatum from the thalamus. On trials in which the sampled stimulus is strong, heightened channel-specific feedback from the thalamus causes an increase in the activity of both dMSN and iMSN populations within the favored channel. This additional excitatory drive to that channel's dMSN population leads to faster decision times, despite the coupled increase in iMSN activity on those trials. Thus, the paired increase in iMSN activation with decision speed does not imply a causal relationship but a correlation driven by thalamic feedback acting on multiple targets. Next, we evaluated the extent to which the best-fitting regression model (i.e., model III) was able to account for the qualitative behavioral patterns exhibited by the CBGT network in each condition. To this end, we simulated 20,000 trials in each reward condition (each trial producing a response and RT given a parameter set sampled from the model posteriors) and compared the resulting RT distributions, along with mean speed and accuracy measures, with those produced by the CBGT model (Fig 5D and 5E). Parameter estimates from the best-fitting model captured the increasing positive skew of RT distributions as well as the concurrent increase in mean decision speed and accuracy with increasing reward probability.
In summary, by leveraging trialwise measures of simulated striatal MSN subpopulation dynamics to supplement RT and choice data generated by the CBGT network, we were able to 1) substantially improve the quality of DDM fits to the network's behavior across levels of reward probability compared to models without access to neural observations and 2) identify dissociable neural signals underlying observed changes in v and a across varying levels of reward probability associated with available choices.

Robustness analysis
To ensure that the conclusions derived from simulations performed with the original single CBGT network were not disproportionately influenced by the chosen connectivity parameters, we performed a subsequent round of simulations and analyses aimed at assessing the robustness of our major findings. Robustness of neural and behavioral outcomes across reward conditions and of subsequent fits to the DDM were probed by re-simulating the decision experiment with multiple "subject" networks (N = 15). For every projection connecting two nuclei in the CBGT network, each subject was parameterized using a randomly sampled connection efficacy (see Methods section Network sampling procedure).
Similar to the original behavioral outcomes, the sampled CBGT networks showed a trending increase in both speed and accuracy with increasing probability of reward for the better action ( Fig 6A). Separate repeated-measures ANOVA tests were used to assess the statistical reliability of changes in RT and accuracy across levels of reward, revealing a marginal effect on RT, F(2, 28) = 3.29, p = 0.052, and a significant effect on accuracy, F(2, 28) = 465.53, p < 0.00001. Importantly, surrounding the subject-averaged behavioral estimates, substantial variability was observed in accuracy and RT means of individual networks. This variation in behavior across subject networks confirms that the efficacy sampling procedure had the intended effect for the purposes of assessing outcome robustness. Despite the apparent variability in behavioral means, several other patterns were conserved from the original CBGT simulations. Notably, similar to the RT distributions of the original network (see Fig 4E), the sampled network simulations led to RT distributions with an increasingly positive skew at higher levels of reward ( Fig 6B), with similar ranges of RTs observed in the original and sampled networks.
Furthermore, normalized summary measures of striatal dynamics showed consistent patterns of change across levels of reward in the sampled network simulations (Fig 6C and 6D) compared to those of the original network (see Fig 4B and 4C). As in the original simulations, the normalized difference between left and right dMSN activation (D L − D R ), as well as the within-channel difference between the left dMSN and iMSN activation (D L − I L ), both showed an increase across reward conditions ( Fig 6C). Also similar to the original CBGT simulations, results with the sampled networks showed an increase in the relative activation of left and right iMSN populations (I L − I R ) with increasing reward, with the aggregate activation across all iMSNs (I all ) staying relatively stable across conditions (Fig 6D). Consistent with the largely mirrored effects of reward on striatal summary statistics in the original and sampled networks, the time course of firing rates in striatal populations of the sampled networks showed highly similar patterns to those of the original network in each reward condition (S2 Fig).
We next sought to confirm that the mapping identified between CBGT and DDM parameters was not altered by introducing variability to the connection weights between CBGT nuclei. We found that hierarchical fits to the behavior of multiple sampled networks supported the same best-fitting single-and dual-parameter models as those selected from fits to the original single network's behavior (Fig 7A; cf. Fig 5A). As observed in the original fits, allowing the parameter v to vary across reward conditions afforded substantially better fits to the sampled network data when compared with all other single-parameter models (Fig 7A, left). Also consistent with the single network fits, allowing both v and a to vary across reward conditions further improved the DIC score over that of the drift-only model (Fig 7A, right). Finally, we fit regression DDM models I-XII (i.e., all v, a models) to the data generated by the sampled networks to evaluate the robustness of the originally identified mapping between trialwise striatal dynamics and parameters of the DDM. Specifically, in the best fitting regression model for the original network (i.e., model III), trialwise fluctuations in v and a were driven by changes in D L − D R and I all , respectively. Compared to alternative regression models, this same model best accounted for the behavior of the sampled networks (Fig 7B), leading to an increase in drift rate and bound with reward probability in the morerewarded condition (Fig 7C), suggesting that this particular mapping is reasonably robust to variability in the underlying network connectivity. Similar to the original network (S3A As noted previously, this counter-intuitive increase in speed coupled with greater overall iMSN activation can be understood as a consequence of thalamic feedback to the striatum, which becomes elevated in proportion to the motor-facilitating effects of the direct pathway on the output of the basal ganglia. As with our original simulations, parameter estimates associated with model III provided an excellent fit to RT distributions generated by the CBGT network ( Fig 7D).

Discussion
Reinforcement learning in mammals alters the mapping from sensory evidence to action decisions. Here we set out to understand how this adaptive decision-making process emerges from underlying neural circuits using a modeling approach that bridges across levels of analysis, from plasticity at corticostriatal synapses to CBGT network function to quantifiable behavioral parameters [11,12,15,18]. As a preliminary step, we showed that a simple, DA-mediated STDP rule alters the ratio of direct and indirect pathway corticostriatal weights within action channels in response to reward-driven feedback. With this result in hand, we simulated the network-level dynamics of CBGT circuits, as well as behavioral responses, under different levels of conflict in reward probabilities. As reward probability for the optimal target increased, the asymmetry of dMSN firing rates between action channels grew, as did the overall activity of iMSNs across both action channels. By fitting the DDM to the simulated decision behavior of the CBGT network, we found that changes in the rate of evidence accumulation tracked with the difference in dMSN population firing rates across action channels, while the the level of evidence required to trigger a decision tracked with the overall iMSN population activity. These findings show how, at least within this specific framework, plasticity at corticostriatal synapses induced by phasic changes in DA can have a multifaceted effect on cognitive decision processes.
Our theoretical experiments rest on two critical assumptions: 1) evidence accumulation models can reliably capture value-based decision processes, and 2) the CBGT pathways accumulate evidence for competing actions in order to identify the most contextually appropriate response. As for the first assumption, multiple lines of evidence support the appropriateness of accumulation-to-bound models for describing value-based decisions. First, RTs in value-based decision tasks, such as the n-armed bandit task, have similar distributional properties as those found in perceptual decision tasks. If the action selection policy is to simply choose the action with the highest value, given a single comparison of the values of all possible actions, then the low effort required in the selection process should manifest as lower variability in the RT distribution across trials [20,45]. Second, under a sequential sampling framework, like that used by the DDM, the signal-to-noise ratio of evidence accumulation can effectively control the greediness of the action selection policy [45]. Finally, it has been shown that sequential sampling models with collapsing bounds satisfy several of the optimality conditions for valuebased choices [46]. Put all together, the previous literature makes clear that accumulation of evidence models are viable algorithmic descriptions of value-based choice.
The second assumption that the accumulation of evidence process relies on CBGT pathways is also supported by a growing body of empirical and theoretical evidence. Initial causal evidence for the link between CBGT pathways and perceptual decision-making was reported by Ding and Gold [47], who showed that microstimulation of the caudate nucleus in macaques could modify perceptual choices in a saccade task and this shift in choice was reflected as a change in the starting point bias z in a standard DDM. More recently, Yartsev et al. [19] showed that, in rodents performing an auditory discrimination task, the anterior dorsolateral striatum satisfied three fundamental criteria for establishing causality in the evidence accumulation process: (1) inactivation of the striatum ablated the animal's discrimination performance on the task, (2) perturbation of striatal neurons during the temporal window of evidence accumulation had predictable and reliable effects on trialwise behavioral reports, and (3) gradual ramping, proportional to the strength of evidence, was observed in both single unit and population firing rates of the striatum (however, see also [48]). Consistent with these empirical findings, Caballero et al. [16] recently proposed a novel computational framework, capturing perceptual evidence accumulation as an emergent effect of recurrent activation of competing action channels. This modeling work builds on previous studies showing how the architecture of CBGT loops is ideal for implementing a variant of the sequential probability ratio test [5,12]. Taken together, these converging lines of evidence point to CBGT pathways as being causally involved in the accumulation of evidence for decision-making.
The idea that an accumulation of evidence algorithm can be implemented via network-level dynamics within looped circuit architectures stands in sharp contrast to cortical models of decision-making that presume a more direct isomorphism between accumulators and neural activity (for review see [21]). Early experimental work showed how population-level firing rates in area LIP displayed the same ramp-to-threshold dynamics as predicted by an evidence accumulation process [49][50][51]. This simple relation between algorithm and implementation has now come into question. Follow-up electrophysiological experiments showed how this population-level accumulation may, in fact, reflect the aggregation of step-functions across neurons that resemble an accumulator when summed together, yet lack accumulation properties at the level of individual units [52]. In addition, recent results from intervention studies are inconsistent with the causal role of cortical areas in the accumulation of evidence. For instance, Katz et al. [53] found that inactivation of area LIP in macaques had no effect on the ability of monkeys to discriminate the direction of motion stimuli in a standard random dot motion task. In contrast to the presumed centrality of LIP in sensory evidence accumulation, these findings and supporting reports from [54] and [55] suggest that cortical areas like LIP provide a useful proxy for the deliberation process but are unlikely to have a causal role in the decision itself.
The recent experimental [19] and theoretical [16] revelations of CBGT involvement in decision-making are particularly exciting, not only for the purposes of identifying a likely neural substrate of perceptual choice, but also for their implications for integrating accumulation-tobound models (e.g., action selection mechanisms) with theories of RL (e.g., feedback-dependent learning of action values). We previously proposed a Believer-Skeptic framework [7] to capture the complementary roles played by the direct and indirect pathways in the feedbackdependent learning and the moment-to-moment evidence accumulation leading up to action selection (see also [23,24]). This competition between opposing control pathways can be characterized as a debate between a Believer (direct pathway) and a Skeptic (indirect pathway), reflecting the instantaneous probability ratio of evidence in favor of executing and suppressing a given action respectively. Because the default state of the basal ganglia pathways is motorsuppressing (e.g., [56,57]), the burden of proof falls on the Believer to present sufficient evidence for selecting a particular action. In accumulation-to-bound models like the DDM, this sequential sampling of evidence is parameterized by the drift rate. Thus, the Believer-Skeptic model assumes that this competition should be reflected, at least in part, in the rate of evidence accumulation. As for the role of learning in the Believer-Skeptic competition, multiple lines of evidence suggest that dopaminergic feedback during learning systematically biases the directindirect competition in a manner consistent with increasing the drift rate for more rewarding actions [7,20,36,38,45,58]. Indeed, the STDP simulations in the current study showed opposing effects of dopaminergic feedback on corticostriatal synapses in the direct pathway when comparing across the optimal and suboptimal action channels.
In support of the biological assumptions underlying the CBGT network, several important empirical properties naturally emerged from our simulations. First, both dMSN and iMSN striatal populations were concurrently activated on each trial (see [31,44,59]) and exhibited gradually ramping firing rates that often saturated before the response on each trial [19,48]. Second, in contrast with the relatively early onset of ramping activity in the striatum, recipient populations in the GPi sustained high tonic firing rates throughout most of the trial, with activity in the selected channel showing a precipitous decline near the recorded RT [28,60,61]. This delayed change in GPi activation is caused by the opposing influence of concurrently active dMSN and iMSN populations in each channel, such that the influence of the direct pathway on the GPi is temporarily balanced out by activation of the indirect pathway (see [28]). To represent low, medium, and high levels of reward probability conflict, we manipulated the weights of cortical input to dMSNs in each channel, increasing and decreasing the ratio of direct pathway weights to indirect pathway weights for L and R actions, respectively. As expected, increasing the difference in the associated reward for L and R actions led to stronger firing in L-dMSNs and weaker firing of R-dMSNs. Consistent with recently reported electrophysiological findings [31,44], we also observed an increase in the firing of iMSNs in the L action channel, which in our simulations may arise from channel-specific feedback from the L component of the thalamus. Behaviorally, the choices of the CBGT network became both faster and more accurate (e.g., higher percentage of L responses) at higher levels of reward, suggesting that the observed increase in L-iMSN firing did not serve to delay or suppress L selections. These changes in neural dynamics also produced consequent changes in valuebased decision behavior consistent with previous studies linking parameters of the DDM with experiential feedback.
One of the critical outcomes of the current set of experiments is the mechanistic prediction of how variation in specific neural parameters relates to changes in parameters of the DDM. Consistent with past work (see [7,20]), the DDM fits to the CBGT-simulated behavior showed an increase in drift rate toward the higher valued decision boundary with increasing expected reward. Additionally, we found that greater disparity in the expected values of alternative actions led to an increase in the boundary height. Indeed, the co-modulation of drift rate and boundary parameters observed here has also been found in human and animal experimental studies of value-based choice [20,36,38]. For example, experiments with human subjects in a value-based learning task showed that selection and response speed patterns were best described by an increase in the rate of evidence for more valued targets, coupled with an upwards shift in the boundary height for all targets [36]. Moreover, in healthy human subjects, but not Parkinson's disease patients, reward feedback was found to drive increases in both rate and boundary height parameters, effectively breaking the speed-accuracy tradeoff [36]. To identify more precise links between the relevant neural dynamics underlying the observed drift rate and boundary height effects we performed another round of model fits with striatal summary measures included as regressors to describe trial-by-trial variability. Behavioral fits were substantially improved by estimating trialwise values of drift rate as a function of the difference between L-and R-dMSN activation and trialwise values of boundary height as a function of the iMSN activation across both channels. These relationships stand both as novel predictions arising from the current study and as refinements to the competing striatal pathways framework of decision uncertainty [7,23,24], implying that the decision promoting component (i.e., the Believer) relies on a competition between action channels while the decision suppressing component (i.e., the Skeptic) involves a cooperative aspect across all action channels.
While our present findings provide key insights into the links between implementation mechanisms and cognitive algorithms during adaptive decision-making, they are constrained by the nature of the multi-level modeling approach itself. Our goal was to evaluate a specific hypothesis under the competing striatal pathways framework about the combined role of corticostriatal pathways in learning and decision-making, and our simulations demonstrate that strengthening corticostriatal synapses is one way that the brain can adjust striatal firing to shape the drift rate and accumulation threshold, promoting faster and more frequent selection of actions with a higher expected value. We do not presume, however, that the impacts of dopaminergic plasticity at corticostriatal synapses on striatal activity are singularly responsible for setting the drift rate during value-based decision-making. Indeed, the manipulation of a single CBGT parameter can have a compounding effect on the downstream network dynamics (e.g., amplified thalamostriatal feedback in the optimal action channel) that ultimately contributes to parametric changes at the cognitive level (e.g., an increase in the drift rate, favoring the optimal choice). Moreover, because the CBGT network has many more parameters than the DDM, many different properties of the CBGT network, aside from corticostriatial weights and measures of striatal activity, could potentially be manipulated to cause analogous behavioral patterns and inferred effects on the drift rate and boundary height parameters in the DDM. For instance, in contrast to the striatal iMSN modulation of boundary height observed in the current study, Ratcliff and Frank [18] found that simulated changes in STN firing were also capable of describing a change in the boundary height, raising the threshold in the context of high decision conflict. In fact, experimental evidence suggests the existence of both striatal [62][63][64] and subthalamic [42,64,65] mechanisms for adjusting the boundary height. It remains for future work to study how multiple mechanisms such as these work together to impact decision behavior as well as to consider more complex decision-making tasks that may help to expose distinct roles for these aspects of CBGT activity. Another open direction is to generalize our approach to include more detailed representations of neurons in CBGT populations, such as Hodgkin-Huxley-type models, and additional detail about BG neuronal subpopulations and pathways, such as distinct representations of arkypallidal and prototypical GPe neurons and the GPe projection to the striatum. Finally, while we established the robustness of our results to extensive variation of CBGT connection strengths, it is important to note that the predictions of our study depend on the assumptions and parameter choices inherent in our model construction. For instance, the STDP model used to tune corticostriatal weights treats dopamine in a simplified way, neglecting tonic dopamine as well as enhanced iMSN sensitivity to negative fluctuations in dopamine [23].
Our simulations make several novel predictions for future experiments. First, while the learning-related changes in L and R direct pathway corticostriatal weights were mirrored by the relative firing rates of L-and R-dMSNs, the iMSN firing rates expressed channel-specific differences, despite the invariance of the corticostriatal weights to iMSNs across conditions. Critically, the observed increase in iMSN firing disparity between the L and R channels in our simulations emerged due to the thalamostriatal feedback assumed in the CBGT network, where dMSN activation leads to disinhibition of the thalamus, thereby increasing excitatory feedback to both MSN subtypes within a given channel. Thus, by design, our model assumes that feedback connections from the thalamus should adhere to a channel-specific (e.g., focal) topology. This assumption of focality should be examined by future neuroanatomical research. In addition, if the topological predictions of our model are confirmed, the simulations presented here make the second prediction that activating the thalamic cells during the decision window should counterintuitively increase iMSN, as well as dMSN, firing rates while at the same time increasing decision speeds. One important caveat to these predictions is that thalamostriatal feedback was not incorporated into the STDP simulations used to generate the post-learning corticostriatal weights of the full CBGT network in each reward condition. Thus, additional potentiation of both dMSN and iMSN populations arising from increased thalamic gain in the optimal channel could interact with dopaminergic feedback during the learning process and thus lead to alternative downstream effects. Indeed, little is known about the role that thalamostriatal inputs play during reinforcement learning, requiring future experimental data to address these questions. Third, our simulations showed that the difference in firing rates of the dMSNs across action channels modulated the rate of value-based evidence accumulation. According to our simulations, increasing the relative magnitude of dMSN activity in the R channel compared to the L channel, for example using optogenetic stimulation of dMSNs in the dorsolateral striatum, should speed and facilitate the selection of contralateral lever presses. Choice and RT data could then be fit with the DDM to determine if the behavioral effects of laterally-biased dMSN stimulation were best described by a change in the drift rate. Analogous experiments targeting iMSNs but without channel specificity could be used similarly to evaluate our prediction that overall iMSN activity level modulates DDM boundary height.

Conclusion
Here we characterize the effects of dopaminergic feedback on the competition between direct and indirect CBGT pathways and how this plasticity impacts the evaluation of evidence for alternative actions during value-based choice. Using simulated neural dynamics to generate behavioral data for fitting by the DDM and determining how measures of striatal activity influence this fit, we show how the rate of evidence accumulation and the decision boundary height are modulated by the direct and indirect pathways, respectively. This multi-level modeling approach affords a unique combination of biological plausibility and mechanistic interpretability, providing a rich set of testable predictions for guiding future experimental work at multiple levels of analysis.

Methods
In our work, we generate behavioral data under several reward scenarios using a spiking cortico-basal ganglia-thalamic (CBGT) network, comprising neurons and synaptic connections from the key cortical and subcortical areas within the CBGT computational loops that take sensory evidence from cortex and make a decision to select one of two available responses. We fit this data using the drift diffusion model (DDM), a cognitive model of decision-making that describes the accumulation-to-bound dynamics underlying the speed and accuracy of simple choice behavior [1], and we harness this fit in a hierarchical Bayesian framework (that we jointly refer to as a hierarchical DDM) to establish an upwards mapping from measures of striatal activity to DDM parameters. In this section, we briefly describe a spike-timing dependent plasticity (STDP) network consisting of striatal neurons and their cortical inputs, with corticostriatal synaptic plasticity driven by phasic reward signals resulting from simulated actions and their consequent dopamine release, which we use to select corticostriatal synaptic weights for our CBGT network, and then we present the details of the CBGT and DDM models along with some computational approaches that we use in simulating and analyzing them.
All simulation and analysis code reported in this work is publicly available at: https:// github.com/CoAxLab/CBGT.

STDP network
Here we outline the general STDP model and simulation procedures. See [30] for full details of the model implementation.
To set the corticostriatal synaptic weights across reward conditions, we consider a simplified computational model of the striatum consisting of two different populations that receive different inputs from the cortex (see Fig 1, left). Although they do not interact directly, they compete with each other to be the first to select a corresponding action. In this model, each striatal population contains two different types of units: (i) dMSNs, which facilitate action selection, and (ii) iMSNs, which suppress action selection. Each of these neurons is represented with the exponential integrate-and-fire model [66]. The inputs from the cortex to each MSN neuron within a population are generated using a collection of oscillatory Poisson processes. Each of these cortical spike trains, which we refer to as daughters, is generated from a baseline oscillatory Poisson process, which we call the mother train. We consider two different mother trains to generate the cortical daughter spike trains for the two different MSN populations. Each dMSN neuron or iMSN neuron receives input from a distinct cortical daughter train, with the corresponding transfer probabilities p D and p I , respectively. As shown in [67], the cortex to iMSN release probability exceeds that of cortex to dMSN. Hence, we set p D < p I . Values of these and other parameters are selected based on properties of striatal firing determined by experimental work [68][69][70][71][72][73][74], on fits of network behavior to qualitative benchmarks from behavioral experiments involving action selection in a probabilistic reward environment [20], and on other computational papers involving related model components [26,66].
These model components were coupled with an action selection criterion based on MSN spike timing, a simulated dopamine signal reflecting the comparison between actual and expected reward received upon the implementation of an action, and a learning rule for updating corticostriatal synaptic weights based on an eligibility trace determined using an STDP rule and on dopamine level. The general framework builds on previous work [23,26,75]. The idea of this plasticity scheme is that an eligibility trace (cf. [76]) represents a neuron's recent spiking history and hence its eligibility to have its synapses modified, with changes in eligibility following an STDP rule that depends on both the pre-and the post-synaptic firing times. Plasticity of corticostriatal synaptic weights depends on this eligibility together with dopamine levels, which in turn depend on the reward consequences that follow neuronal spiking. One new feature is that the synaptic weight update scheme depends on the type of MSN neuron involved in the synapse, consistent with the observations that dMSNs tend to have less activity than iMSNs at resting states [68,69,71,73] and are more responsive to phasic changes in dopamine than iMSNs, which are largely saturated by tonic dopamine.
Each dMSN facilitates performance of a specific action. We specify that an action occurs, and so a decision is made by the model, when at least three different dMSNs of the same population spike in a small time window of specified duration. When this condition occurs, a reward is delivered and the dopamine level is updated correspondingly based on comparison of reward level to the maximal possible action valued learned so far [77]; action values are updated via standard Q-learning. The updated dopamine level impacts all neurons in the network, depending on eligibility. Then, the spike counting and the initial window time are reset, and cortical spikes to all neurons are turned off over the next 50 ms before resuming again as usual. We assume that iMSN activity within a population counters the performance of the action associated with that population [78]. We implement this effect by specifying that when an iMSN in a population fires, the most recent spike fired by a dMSN in that population is suppressed. Note that this rule need not contradict observed activation of both dMSNs and iMSNs preceding a decision [32], see Subsection Tuning corticostriatal synaptic weights. For convenience, we refer to the action implemented by one population of neurons as "left" or L and the action selected by the other population as "right" or R. Based on action selection, reward delivery is probabilistic: every time that an action occurs, the reward r i is set to be 1 with probability p i or 0 otherwise, i 2 {L, R}. We consider three different probabilities such that p L + p R = 1 and p L > p R , keeping the action L as the preferred one. Specifically, we take p L = 0.85, p L = 0.75, and p L = 0.65 to allow comparison with previous results [20].
The network is integrated computationally by using the Runge-Kutta (4,5) method in Matlab (ode45) with time step δt = 0.01 ms. Different realizations lasting 15 s are computed to simulate variability across different subjects in a learning scenario.

CBGT network
The spiking CBGT network is adapted from previous work [28]. Like the STDP model described above, the CBGT network simulation is designed to decide between two actions, a left or right choice, based on incoming sensory signals (Fig 1). The full CBGT network was comprised of six interconnected brain regions (see Table 3), including populations of neurons in the cortex, striatum (STR, medium spiny neurons, MSNs), external segment of the globus pallidus (GPe), internal segment of the globus pallidus (GPi), subthalamic nucleus (STN), and thalamus. Because the goal of the full spiking network simulations was to probe the consequential effects of corticostriatal plasticity on the functional dynamics and emergent choice behavior of CBGT networks after learning has already occurred, CBGT simulations were conducted in the absence of any trial-to-trial plasticity, and did not include dopaminergic projections from the subtantia nigra pars compacta. Rather, corticostriatal weights were manipulated to capture the outcomes of STDP learning as simulated with the learning network (Subsection STDP network) under three different probabilistic feedback schedules (see Table 4), each maintained across all trials for that condition (N = 2500 trials each). Table 3. Synaptic efficacy (g) and probability (P) of connections between populations in the CBGT network, as well as postsynaptic receptor types (AMPA, NMDA, and GABA). The topology of each connection is labeled as either diffuse, to denote connections with a P > 0 of projecting to left and right action channels, or focal, to denote connections that were restricted to within each channel. Reward-driven changes in striatal dynamics shape evidence evaluation in decision-making Neural dynamics. To build on previous work on a two-alternative decision-making task with a similar CBGT network and to endow neurons in some BG populations with bursting capabilities, all neural units in the CBGT network were simulated using the integrate-and-fireor-burst model [79]. Each neuron's membrane dynamics were determined by:

Connection
In Eq 3, parameter values are C = 0.5 nF, g L = 25 nS, V L = −70 mV, V h = −0.60 mV, and V T = 120 mV. When the membrane potential reaches a boundary V b , it is reset to V r . We take V b = −50 mV and V r = −55 mV. The middle term in the right hand side of Eq 3 represents a depolarizing, low-threshold Ttype calcium current that becomes available when h grows and when V is depolarized above a level V h , since H(V) is the Heaviside step function. For neurons in the cortex, striatum (both MSNs and FSIs), GPi, and thalamus, we set g T = 0, thus reducing the dynamics to the simple leaky integrate-and-fire model. For bursting units in the GPe and STN, rebound burst firing is possible, with g T set to 0.06 nS for both nuclei. The inactivation variable, h, adapts over time, decaying when V is depolarized and rising when V is hyperpolarized according to the following equations: and with t À h ¼ À 20 ms and t þ h ¼ 100 ms for both GPe and STN. For all units in the model, the synaptic current I syn reflects both the synaptic inputs from other explicitly modeled populations of neurons within the CBGT network, as well as additional background inputs from sources that are not explicitly included in the model. This current is computed using the equation The reversal potentials are set to V E = 0 mV and V I = −70 mV. The synaptic current components correspond to AMPA (g 1 ), NMDA (g 2 ), and GABA A (g 3 ) synapses. The gating variables s i for AMPA and GABA A receptor-mediated currents satisfy: while NMDA receptor-mediated current gating obeys: In Eqs 7 and 8, t j is the time of the j th spike and α = 0.63. The decay constant, τ, was 2 ms for AMPA, 5 ms for GABA_A, and 100 ms for NMDA-mediated currents. A time delay of 0.2 ms was used for synaptic transmission. Network architecture. The CBGT network includes six of the nodes shown in Fig 1, excluding the dopaminergic projections from the substantia nigra pars compacta that are simulated in the STDP model. The membrane dynamics, projection probabilities, and synaptic weights of the network (see Table 3) were adjusted to reflect empirical knowledge about local and distal connectivity associated with different populations, as well as resting and task-related firing patterns [28,61].
The cortex included separate populations of neurons representing sensory information for L (N = 270) and R (N = 270) actions that approximate the processing in the intraparietal cortex or frontal eye fields. On each trial, L and R cortical populations received excitatory inputs from an external source, sampled from a truncated normal distribution with a mean and standard deviation of 2.5 Hz and 0.06, respectively, with lower and upper limits of 2.4 Hz and 2.6 Hz. Critically, L and R cortical populations received the same strength of external stimulation on each trial to ensure that any observed behavioral effects across conditions were not the result of biased cortical input. Excitatory cortical neurons also formed lateral connections with other cortical neurons with a diffuse topology, or a non-zero probability of projecting to recipient neurons within and between action channels (see Table 3 for details). The cortex also included a single population of inhibitory interneurons (CtxI; N = 250 total) that formed reciprocal connections with left and right sensory populations. Along with external inputs, cortical populations received diffuse ascending excitatory inputs from the thalamus (Th; N = 100 per input channel).
L and R cortical populations projected to dMSN (N = 100/channel) and iMSN (N = 100/ channel) populations in the corresponding action channel; that is, cortical signals for a L action projected to dMSN and iMSN cells selective for L actions. Both cortical populations also targeted a generic population of FSIs (N = 100 total) providing widespread but asymmetric inhibition to MSNs, with stronger FSI-dMSN connections than FSI-iMSN connections [80]. Within each channel, dMSN and iMSN populations also formed recurrent and lateral inhibitory connections, with stronger inhibitory connections from iMSN to dMSN populations [80]. Striatal MSN populations also received channel-specific excitatory feedback from corresponding populations in the thalamus. Inhibitory efferent projections from the iMSNs terminated on populations of cells in the GPe, while the inhibitory efferent connections from the dMSNs projected directly to the GPi.
In addition to the descending inputs from the iMSNs, the GPe neurons (N = 1000/channel, as in past work [28], reflecting divergence of connections from striatum to GPe [81,82]) received excitatory inputs from the STN. GPe cells also formed recurrent, within channel inhibitory connections that supported stability of activity. Inhibitory efferents from the GPe terminated on corresponding populations in the the STN (i.e., long indirect pathway) and GPi (i.e., short indirect pathway). Both the projections from GPe to STN and the feedback projections from STN to GPe were sparse relative to other connection profiles in the network [83]. We did not include arkypalldal projections (i.e., feedback projections from GPe to the striatum; [84]) as it is not currently well understood how this pathway contributes to basic choice behavior.
STN populations were composed of burst-capable neurons (N = 1000/channel as in GPe [28], reflecting low synaptic divergence of GPe inputs to STN [83]) with channel-specific inhibitory inputs from the GPe as well as excitatory inputs from cortex (the hyperdirect pathway). The since no cancellation signals were modeled in the experiments (see Subsection Simulations of experimental scenarios), the hyperdirect pathway was simplified to background input to the STN. Unlike the striatal MSNs and the GPe, the STN did not feature recurrent connections. Excitatory feedback from the STN to the GPe was assumed to be sparse but channel-specific, whereas projections from the STN to the GPi were channel-generic and caused diffuse excitation in both L-and R-encoding populations.
Populations of cells in the GPi (N = 100/channel) received inputs from three primary sources: channel-specific inhibitory afferents from dMSNs in the striatum (i.e., direct pathway) and the corresponding population in the GPe (i.e., short indirect pathway), as well as excitatory projections from the STN shared across channels (i.e., long indirect and hyperdirect pathways; see Table 3). The GPi did not include recurrent feedback connections. All efferents from the GPi consisted of inhibitory projections to the motor thalamus. The efferent projections were segregated strictly into pathways for L and R actions.
Finally, L-and R-encoding populations in the thalamus were driven by two primary sources of input, integrating channel-specific inhibitory inputs from the GPi and diffuse (i.e., channelspanning) excitatory inputs from cortex. Outputs from the thalamus delivered channel-specific excitatory feedback to corresponding dMSN and iMSN populations in the striatum as well as diffuse excitatory feedback to cortex.
Simulations of experimental scenarios. Because the STDP simulations did not reveal strong differences in Ctx-iMSN weights across reward conditions, only Ctx-dMSN weights were manipulated across conditions in the full CBGT network simulations. In all conditions the Ctx-dMSN weights were higher in the left (higher/optimal reward probability) than in the right (lower/suboptimal reward probability) action channel (see Table 4). On each trial, external input was applied to L-and R-encoding cortical populations, each projecting to corresponding populations of dMSNs and iMSNs in the striatum, as well as to a generic population of FSIs. Critically, all MSNs also received input from the thalamus, which was reciprocally connected with cortex. Due to the suppressive effects of FSI activity on MSNs, sustained input from both cortex and thalamus was required to raise the firing rates of striatal projection neurons to levels sufficient to produce an action output. Due to the convergence of dMSN and iMSN inputs in the GPi, and their opposing influence over BG output, co-activation of these populations within a single action channel served to delay action output until activity within the direct pathway sufficiently exceeded the opposing effects of the indirect pathway [28]. The behavioral choice, as well as the time of that decision (i.e., the RT) were determined by a winner-take-all rule with the first action channel to cause the average firing rate of its thalamic population to rise above a threshold of 30 Hz being selected.

Drift diffusion model
To understand how altered corticostriatal weights influence decision-making behavior, we fit the simulated behavioral data from the CBGT network with a DDM [1,85] and compared alternative models in which different parameters were allowed to vary across reward probability conditions. The DDM is an established model of simple two-alternative choice behavior, providing a parsimonious account of both the speed and accuracy of decision-making in humans and animal subjects across a wide variety of binary choice tasks [85]. Importantly, this accumulation-to-bound process affords predictions about the average accuracy, as well as the distribution of response times, under a given set of model parameters.
Parameter optimization. The DDM assumes that evidence is stochastically accumulated as the log-likelihood ratio for two competing decision outcomes. Evidence is tracked by a single decision variable θ until reaching one of two decision thresholds, representing the evidence criterion for committing to a choice. The dynamics of θ is given by where v is the mean strength of the evidence and σ is the standard deviation of a white noise process W, representing the degree of noise in the accumulation process. The choice and RT on each trial are determined by the first passage of θ through one of the two decision boundaries (a, 0). In this formulation, θ remains fixed at a predefined starting point z/a 2 [0, 1] until time tr, resulting in an unbiased evidence accumulation process when z = a/2. In perceptual decision tasks, v reflects the signal-to-noise ratio of the stimulus. However, in a value-based decision task, v can be taken to reflect the difference between Q-values for the left and right actions. Thus, an increase (decrease) in Q L − Q R from 0 would correspond to a proportional increase (decrease) in v, leading to more rapid and frequent terminations of θ at the upper (lower) boundary a (0). Fits of the DDM were performed using hierarchical DDM, an open source Python package for Bayesian estimation of DDM parameters (see [86] for details). Each model was fit by drawing 2000 Markov Chain Monte-Carlo (MCMC) samples from the joint posterior probability distribution over all parameters, with acceptance based on the likelihood of the observed accuracy and RT data given each parameter set (see below). A burn-in period of 1200 samples was implemented to ensure that model selection was not influenced by samples drawn prior to convergence. Sampling chains were also visually inspected for signs of convergence failure; however, parameters in all models showed normally distributed posterior distributions with little autocorrelation between samples suggesting that sampling parameters were sufficient for convergence.
For each step in the MCMC sampling procedure, the likelihood that the first boundary crossing occurs at the lower boundary at the observed RT t under the sampled parameters was estimated using the closed form solution proposed by Navarro and Fuss [87]: The analogous expression for the upper boundary is obtained from Eq (10) by replacing each instance of z with a − z and replacing drift rate v with −v. Finally, the resulting likelihood was multiplied by a prior distribution to yield the updated joint posterior over all parameters (see [86] for additional details regarding parameter optimization in hierarchical DDM).
Model comparison. To narrow the subset of possible DDM models considered, DDM fits to the CBGT model behavior were conducted in three stages using a forward stepwise selection process. First, we compared models in which a single parameter in the DDM was free to vary across reward conditions. For these simulations all the DDM parameters were tested. Next, additional model fits were performed with the best-fitting model from the previous stage, but with the addition of a second free parameter. Finally, the two best fitting dual parameter models were submitted to a final round of fits in which trialwise measures of striatal activity (see Fig 4B and 4C) were included as regressors on the two designated parameters of the DDM. All CBGT regressors were normalized between values of 0 and 1. Each regression model included one regression coefficient capturing the linear effect of a given measure of neural activity on one of the free parameters (e. g., a, v, or z), as well as an intercept term for that parameter, resulting in a total of four free parameters per selected DDM parameter or 8 free parameters altogether. For example, in a model where drift rate is estimated as function of the difference between dMSN firing rates in the left and right action channels, the drift rate on trial t is given by is the beta coefficient for reward condition j, and X j (t) is the observed difference in dMSN firing rates between action channels on trial t in condition j; see also Eqs (1) and (2) in Section CBGT-DDM mapping. A total of 24 separate regression models were fit, testing all possible combinations between the two best-fitting dual parameter models and the four measures of striatal activity summarized in Fig 4B and 4C.

Network sampling procedure
To assess the robustness of our findings to alterations in the connection weights used in the CBGT network, we repeated the simulation and fitting procedures described in sections Simulations of experimental scenarios and Drift diffusion model, respectively, with additional variability in the parameterization of CBGT connection strengths. The functional outputs of the spiking CBGT network depend on two critical parameters: 1. P ij : the probability that a cell in population i projects to a cell in population j, 2. g ij : the conductance strength associated with inputs from cells in population i to cells in population j.
The product of these two values can be taken to represent the weight (ω ij ) of connections from population i to population j. The core simulations in this work were performed using a single ω ij for each connection that was held constant for all simulated trials. We also introduced variability in ω ij in additional simulations to explore robustness of results to variability in the underlying network connectivity.
For our robustness check, rather than simulating 2500 trials per condition with a single set of CBGT connectivity parameters, we generated multiple "subject" networks, simulating 200 trials per condition with each. For each subject k (N = 15), g(k) ij was independently sampled from a normal distribution with mean m g ij and standard deviation s g ij ¼ 0:05m g ij , gðkÞ ij � N ðm g ij ; s g ij Þ. For each connection included in the random sampling procedure, μ ij was set equal to the value of g ij in Table 3. The total strength of the connection from population i to population j in subject network k was given by the equation ω(k) ij = P ij � g(k) ij .
As in the original single network simulations, the dopaminergic effects on synaptic strengths of corticostriatal inputs to left (L, optimal) and right (R, suboptimal) action channels were determined by scaling the baseline value of g Ctx-MSN by the proportional change in synaptic strength ϕ observed in the STDP simulations in reward condition C (see Table 4). Thus, for subject network k, the weight of Ctx where P D is the connection probability of cortical inputs to L and R dMSN populations (see Table 3), g D (k) is sampled synaptic efficacy for subject network k, and ϕ(C) is the proportional change due to DA-mediated STDP in condition C indicated in Table 4. Because Ctx − iMSN weights were less sensitive to dopaminergic feedback in STDP simulations, manipulations of corticostriatal weights across levels of reward were restricted to Ctx − dMSN connections.

Hierarchical fits to sampled networks
To confirm that variability in the underlying connection weights of the network did not alter the upwards mapping between CBGT and DDM parameters, all single and dual static parameter models and regression models I-XII were re-fit to the data generated by the sampled networks. In contrast to the original model fits, which were performed on data from a single network, model fits to the sampled network data were performed hierarchically. Hierarchical model fits allow for simultaneous and conditional estimation of parameters at the individual and group levels. At the individual level, samples were drawn from the joint posterior distribution given each individual subject's data. At the group level, these individual subject samples were aggregated to derive estimates of the mean and variance for each parameter. Finally, these estimates of group-level summary statistics served as prior distributions at the subjectlevel, reducing the weight of outlier subjects that deviate significantly from the group. All single and dual parameter models were defined in the same way as those fit to the original network, but with all free parameters estimated for each individual network and with corresponding estimates of the mean and variance of each parameter at the group level. Regression DDM fits were also performed hierarchically, with a random intercept for drift rate and boundary-height estimated for each individual subject network, and with regression weights estimated as fixed effects (i.e., shared across subject networks at the group level).