A potential reset mechanism for the modulation of decision processes under uncertainty

Humans and other mammals flexibly select actions in noisy, uncertain contexts, quickly using feedback to adapt their decision policies to either explore other options or to exploit what they know. Drawing inspiration from the plasticity of cortico-basal ganglia-thalamic circuitry, we recently developed a cognitive model of decision-making that uses both a value-driven learning signal to update an internal estimate of state action-value (i.e., conflict in the probability of reward between two choices) and a change-point-driven learning signal that adapts to changes in reward contingencies (i.e., a previously high value target becoming devalued). In this work, we expand on previous results from our group (Bond, Dunovan, & Verstynen, 2018) to more carefully detail how these environmental signals drive changes in the decision process. Across nine separate behavioral testing sessions, we independently manipulated the level of value-conflict and volatility in action-outcome contingencies. Using a hierarchical drift diffusion model, we found that the belief in the value difference between options had the greatest influence on decision processes, impacting drift rate, while estimates of environmental change had a smaller, but detectable influence on the decision threshold. Taken together, these findings bolster our previous work showing how separate environmental signals impact different aspects of the decision algorithm.


Introduction
In natural contexts, successful behavior in a dynamic environment requires making fast, accurate decisions and updating those decisions based on an internal model of the state of the environment. Drawing inspiration from the computational architecture of cortico-basal ganglia-thalamic circuitry (Dunovan & Verstynen, 2016), we previously proposed a cognitive model that 1) updates the rate of evidence accumulation using estimates of value differences between possible actions and 2) updates the threshold of decision processes using estimates of change point probability. Using an adaptive-decisionmaking algorithm that unifies drift diffusion models and reinforcement learning (Dunovan & Verstynen, 2017;Pedersen, Frank, & Biele, 2017), we modeled decision processes under more expansive conditions of value-conflict, or the proximity of the probability of reward between two choices, and feedback volatility, or the instability of action-value associations. We sought to replicate our previously observed effects showing that value-conflict decreases the rate of evidence accumulation and that volatility in action-value associations decreases the amount of evidence needed to make a decision. For this replication, we adopted a high-power, within-subject design (N = 4 subjects, 3600 trials/subject) where we independently manipulated the degree of conflict in reward values and volatility in action-outcome contingencies.

Task
Participants Four participants were recruited from the Paid Psychology Subject Pool and the local community. They were paid $10 per session in addition to a performance bonus. These experiments were approved by the Institutional Review Board at Carnegie Mellon University.
Stimuli and procedure Each participant completed nine sessions composed of 400 trials each, generating 3600 trials per subject. Data were collected from four participants in accordance with a replication-based design, with each participant serving as a replication experiment 1. Participants completed these sessions across three weeks in randomized order. Each trial presented a male and female greeble (Gauthier & Tarr, 1997), with the goal of selecting the sex identity of the greeble which was most profitable. Individual greeble identities were resampled on each trial; thus, the task of the participant was to choose the sex identity rather than the individual identity of the greeble which was most rewarding (Figure 1). Probabilistic reward feedback was given in the form of points drawn from the normal distribution N(µ = 3, σ = 1) and these points were displayed at the center of the screen. Participants began with 200 points and lost one point for each incorrect decision. To promote incentive compatibility, participants earned a cent for every point earned. If participants responded in < .1s, > 1s, or failed to respond altogether, the point total turned red and decreased by 5 points. Each trial lasted 1.5 s and reward feedback for a given trial was displayed from the participant response to the end of this window. Reaction time was constrained such that participants were required to respond within 0.1 and 0.75 s from stimulus presentation.
To manipulate change point probability, the sex identity of the most rewarding greeble was switched probabilistically. To manipulate the belief in the value of the optimal target, the probability of reward for the optimal target was manipulated. Further, the position of the high-value target was pseudorandomized on each trial to prevent prepotent response selections on the basis of location.
Throughout the task, the head-stabilized diameter of the left pupil was measured with an Eyelink 1000 at 1000 Hz from within a custom-built booth designed to eliminate the influence of ambient sources of luminance. Because the dynamic range of the pupillary response is known to be highly sensitive to a variety of influences (Sirois & Brisson, 2014), participants were exposed to a sinusoidal variation in luminance prior to the reward-learning task to establish the dynamic range of the pupillary response for that session. During the rewardlearning task, all stimuli were rendered isoluminant with the background of the display to further prevent luminance-related confounds of the task-evoked pupillary response. To minimize the convolution of the task-evoked pupillary response from trial to trial, the inter-trial interval was sampled from a truncated exponential distribution with a minimum of 4 s, a maximum of 16 s, and a rate parameter of 2.
The pupillometry data are not presented at this time, but will be used in follow-up analyses.

Cognitive Model
Here we propose that the drift rate (v) and the decision threshold (a) are modulated on a trial-by-trial basis according to two estimates of uncertainty from an ideal observer.
Updating action-values To model how learners update action-values, we calculate an estimate of how often the same action will give a different reward. We call this learning signal change point probability (Ω). The change point probability will be close to 1 as the probability of a sample coming from a uniform distribution, relative to a Gaussian distribution, increases: Relative action-value Along with estimates of the stability of action-value contingencies, feedback signals also drive the belief in the reward of an action. We call this signal B, and it is learned separately for each action target. Given that c = the chosen target and u = the unchosen target, the belief in the mean of the distribution of reward differences on the next trial is calculated as: The unchosen target value decays to the pooled expected value of both targets, E(r): The signed belief in the reward difference between targets is calculated as the difference in belief for targets 0 and 1:

Update rules
The learning rate of the model [α] is determined by the change point probability [Ω] and the model confidence [φ]. Here, the learning rate will be high if either 1) a change in the mean of the distribution of the difference in expected values is likely [Ω is high] or 2) the estimate of the mean is highly imprecise [σ 2 n is high]: The prediction error, δ, is the difference between the model belief and the reward difference observed: And the estimated variance, σ 2 , is calculated as: We propose that the belief in the relative reward for the two choices, B, updates the drift rate, v, or the speed of evidence and that the change point probability, Ω decreases the decision threshold, a, or the amount of evidence needed to make a decision: We adapted the above ideal observer calculations from a previous study (Vaghi et al., 2017).
Reaction times decreased as change point probability increased in the majority of cases (p < 0.03 in 3/4 replicates, β = −0.02±0.01, Figure 3). The belief in the value of the optimal target had minimal impact on reaction times (β = 0.00 ± 0 in 4/4 subjects).
The RT distributions generated from each participant were then submitted to hierarchical drift diffusion model regression (Wiecki, Sofer, & Frank, 2013). For these regressions, we evaluated the fit of either our hypothesized update rule or the inverse model to the data, with Ω and B as predictors of either a or v. Consistent with our hypothesis, we found strong evidence for the model that mapped drift-rate updates onto trial-wise changes in the belief of the value of the optimal target and decision threshold updates onto changes in change point probability (hypothesized model best accounted for the data in 3/4 cases; DIC∆ = −31 points; Figure 4).
Using the posterior probability distributions of the regression coefficients, we found that the drift-rate increased with the belief in the value of the optimal target ( p(β v < 0) < .01 in all cases; mean p(β v < 0) = 0 ± 0 ). We found a weak effect of change point probability on the decision threshold (mean observed p(β a > 0) = 0.27 ± 0.11).

Conclusions
Using a high-powered within-subject design, we replicated and expanded our previous work to show that different environmental signals modulate different aspects of the accumulation-of-evidence process during decision making. Future work will explore how pupil responses, as a proxy for noradrenergic activity, track with estimates of environmental volatility as a possible mechanism for the dynamic modulation of decision threshold.