Expressions for Bayesian confidence of drift diffusion observers in fluctuating stimuli tasks

We introduce a new approach to modelling decision confidence, with the aim of enabling computationally cheap predictions while taking into account, and thereby exploiting, trial-by-trial variability in stochastically fluctuating stimuli. Using the framework of the drift diffusion model of decision making, along with time-dependent thresholds and the idea of a Bayesian confidence readout, we derive expressions for the probability distribution over confidence reports. In line with current models of confidence, the derivations allow for the accumulation of “pipeline” evidence that has been received but not processed by the time of response, the effect of drift rate variability, and metacognitive noise. The expressions are valid for stimuli that change over the course of a trial with normally-distributed fluctuations in the evidence they provide. A number of approximations are made to arrive at the final expressions, and we test all approximations via simulation. The derived expressions contain only a small number of standard functions, and require evaluating only once per trial, making trial-by-trial modelling of confidence data in stochastically fluctuating stimuli tasks more feasible. We conclude by using the expressions to gain insight into the confidence of optimal observers, and empirically observed patterns.


A Product of normal distributions
We will repeatedly encounter a situation where we are interested in the probability distribution over z in the following equation, p(z) ∝ N (y; z, σ 2 A )N (y; µ, σ 2 B )dy , where N denotes the normal distribution.The product of two normal distributions is a scaled normal distribution (over y; Bromiley, 2014).The scaling is given by, Once we have integrated over y, only the scaling is left, as the normal distribution over y integrates to one.Hence,

B The observer
This supplement section describes in more detail the observer that was outlined in Section 2. In particular, working from the mathematical description of the task in Section 2 we first derive the inference problem that the observer solves.We then find the computations used by a Bayesian observer to determine their choices and confidence.
Although observers could in principle leverage knowledge that evidence is constant throughout each frame, we assume that observers ignore this additional structure.This will be a valid approximation when evidence frames used in an experiment are sufficiently short.In this case, the approximation will lead to little discrepancy between the observer's estimate of the probability of being correct, and the true probability.We test this claim below, once we have derived an expression for the probability of being correct (Fig. 8, and equation 54).Instead of accounting for the constant evidence over the course of a frame, we assume that observers treat the fraction of evidence received in a short time step, j, as generated according to, Importantly, the observer incorrectly assumes that the δE j at different time steps, j, are independent (i.e., the observer ignores the fact that evidence is constant through all time steps in a single frame).
We assume that observers correctly model the effect of internal noise (accumulator noise, and driftrate scaling variability).We address the plausibility of this assumption in Section 4. Hence observers infer that increments in the accumulator are related to the underlying signal as follows, p(δx j |µ, φ) = d(δE j ) p(δx j |δE j , φ)p(δE j |µ) = d(δE j ) N (δx j ; δE j φ, σ 2 acc δt)N (δE j ; To reach the final line we rearranged and applied the result in Supplement Section A. In words, this equation states that evidence measurements have a mean that matches the underlying stimulus signal multiplied by the drift-rate scaling, but are variable due to both internal accumulator noise and the fact that evidence in the stimulus is itself variable.It is interesting to note that the drift-rate scaling, φ, not only affects the mean increment, but also the variability of increments, through the term φ 2 σ 2 E /t f .This effect arises because high drift-rate scaling increases the effect of the stimulus on the accumulation, increasing the effects of both the stimulus mean and variability in the stimulus.We assume that participants ignore this difference between stimulus variability and internal variability (which is not affected by drift-rate scaling).Hence we assume that observers believe evidence measurements are corrupted by the same amount of variability regardless of the level of drift-rate scaling: We test the effect of this assumption on the calibration of confidence below (Fig. 8).
Having specified the properties of the task and the beliefs of the observer, we can infer the rule used by a Bayesian observer to translate all evidence measurements gathered so far into a decision and confidence.We denote the vector of all evidence measurements, δx 1 , δx 2 , ..., δx N as δx.Inferring the posterior probability over S, given δx, in the environment described by ( 6), (7), and (8), is an inference problem that has been considered before.Moran (2015), using a result from Drugowitsch et al. (2012), derived the posterior probability over the two options S = 1 and S = 2.We also provide a derivation in Supplement Section C, and simply state the result in this supplement section.Using Bayes rule with all evidence samples, the log-posterior ratio is given by (Supplement Section C; Moran, 2015), where x = N i=1 δx i , and t is the time spent accumulating evidence.We will often work with a scaled version of the log-posterior ratio, where x lp denotes the scaled version of the log-posterior ratio.θ() provides an abbreviation for the purpose of making future equations less cluttered.K is the scaling constant and is equal to, Note that the log-posterior ratio without any scaling is given by x lp /K.
The observer can use the (scaled) log-posterior ratio in the computation of confidence, as it is monotonically related to the probability they are correct through, if the observer reports R = 1.(This expression can be found by rearranging p(S = 1|δx)+p(S = 2|δx) = 1.)A similar expression holds for R = 2, the only difference being that x is replaced with −x, and x lp with −x lp .

Testing observer model approximations
Above, we assumed that observers approximate the true generative model for the evidence measurements.Specifically we assumed, (a) observers ignore the fact that evidence is constant within each frame, and (b) observers ignore the increased effect of variability in the stimulus, when the drift-rate scaling is high, and the decreased effect when the drift-rate scaling is low.While the assumptions may be plausible, it is also plausible that, if given feedback, observers could learn to closely approximate the mapping from x and t e to probability correct (Kiani & Shadlen, 2009).To adequately capture this situation we need to ensure that, under our assumptions about the observer, confidence remains closely related to performance.
To test the two approximations, we simulated the diffusion process using small time steps.At each time step the accumulator increment, δx i , was determined by drawing a value consistent with (4), until a decision threshold was crossed (Tuerlinckx et al., 2001).Accumulation continued until all evidence measurements had been processed, at which point a simulated confidence was determined in accordance with (54).Full details of the simulation can be found in Supplement Section G. Fig. 8 shows accuracy against confidence for two cases, one in which there is no variability in driftrate scaling, φ, and one with variability.Full details of the process used to plot the data are also provided in Supplement Section G.Only one approximation is relevant when there is no variability in drift-rate scaling, approximation (a).This approximation appears to have little effect on the mapping Figure 8: Accuracy as a function of confidence, where confidence is calculated using the observer's generative model of their evidence measurements.This model only approximates the true generative model.When there is variability in drift-rate scaling, the approximations cause the observer's confidence to show some deviations from calibration.Simulation and plotting details in Supplement Section G.
between accuracy and confidence.An additional approximation is relevant when there is variability in drift-rate scaling, assumption (b).We can see that this causes a moderate mismatch between accuracy and confidence (Fig. 8).
It is interesting to note that the latter approximation, that observers ignore the effect of drift rate on stimulus variability, is implicitly present in much previous work.This is because stochastically fluctuating stimuli are often used, but stimulus variability is not treated separately from accumulator variability in the computational models applied (e.g.Kiani et al., 2014;Ratcliff and McKoon, 2008;van den Berg et al., 2016, although see Zylberberg et al., 2016).Fig. 8 suggests this approximation does not cause large issues.

C Bayesian observer's policy
At some time t, the observer has gathered N sensory samples, δx 1 , δx 2 , ..., δx N , collectively denoted δx.The samples are generated according to ( 6), (7), and (8).Using Bayes rule the observer can infer the probability of the two possible values for S. Using a result from Drugowitsch et al. (2012), Moran (2015) derived the posterior probability over the two options S = 1 and S = 2, for exactly the case we have.For completeness, we rederive the posterior here.
Given all the evidence measurements so far, where we have used the flat prior over S in (1).Looking at the product in (56), using (8), and using standard formulae for the product of normal distributions (e.g.see Bromiley, 2014) we have, where the proportionality sign indicates proportionality with respect to ν, x = N i=i δx i , and we have used that t = N i=1 δt.Hence in (56) we have, where the sign on ν 0 is determined by S. Using the result in Supplement Section A, Rearranging, we find the (not scaled) log-posterior ratio is given by,

D Effect of posterior ratio scaling
In the main text we considered a model in which the observer uses a noisy readout of the scaled logposterior ratio to determine their confidence.A noisy readout of the scaled log-posterior ratio (x lp ), is equivalent to a scaled version of a noisy readout of the unscaled log-posterior ratio (x lp /K).First consider the readout in the main text in (10): Then consider a noisy readout of the same form but of the unscaled log-posterior ratio.Denote this new readout x c .We have, where σ m is the standard deviation of the metacognitive noise affecting this readout.
Our claim is that the distribution over x c when scaled by K, is identical to the distribution over x c .Consider a scaled version of x c , x c = K x c .Using the rule for changing the variable of a probability density function, where the relationship between the variables being swapped is monotonic, then, Hence, the scaled readout of the unscaled log-posterior ratio, x c , has the same distribution as the readout of the scaled log-posterior ratio, x c .The metacognitive noise parameters in the two cases are related to each other through σ m = K σ m .

E Details of interrogation condition derivation
The steps leading to (11) were described in words in the main text.In equations, these steps are as follows.
For now just consider p(x lp |E), which we need to find (12).Let's marginalise over φ, E and φ are independent of each other (when not conditioned on other variables; see Fig. 4).Hence, where we have used (5).
Recall from (4) that, Using standard formulae for the sum of normally distributed random variables, we have that x = i δx i will be given by, Using (9) we have, Substituting these results into (70) gives us, We can apply the result in Supplement Section A. Simplifying, we find, Returning to (12), we now have, In the case in which we have no metacognitive noise, x c = x lp .We can express this using the Dirac delta function as p(x c |x lp ) = δ(x c − x lp ).Then we have, Using (11) and our expression for Ψ(x c ), where we have used that p(R = 2|x lp = x c ) is zero anywhere x lp is inconsistent with the response R = 2 (i.e.x lp < 0), and one elsewhere.In the case where R = 1 an identical expression holds, except the limits become min{0, −d i+1 } and min{0, −d i }, and we divide by p(R = 1|E) not p(R = 2|E).
We can compute the probability of the response, p(R|E), using, Where a = −∞, b = 0 for R = 1 or a = 0, b = ∞ for R = 2.These limits again come from using that p(R|x lp ) = 1 when R and x lp are consistent, and p(R|x lp ) = 0 otherwise.
Let's now consider what happens in the presence of metacognitive noise, when x c no longer equals x lp , but is a noisy version of it as in (10).Returning to (78), in this case Ψ(x c ) becomes, As discussed in the main text, the product of p(R|x lp ) and the normal distribution over x lp is a normal distribution truncated to the region where R and x lp are consistent, because p(R|x lp ) is one when R is consistent with x lp , and zero otherwise.Note that the product of these two terms is not a probability distribution over x lp because it is not normalised.However, we can write this product as a scaled truncated normal distribution.Using J(x lp ) as an abbreviation, We have used (84) to get the final line.As before a = −∞, b = 0 or a = 0, b = ∞ depending on the response made.µ lp and σ 2 lp , are the mean and variance of the distribution prior to truncation (see equations 15 and 16).(Parameterisation of the truncated normal distribution, T N , is described in the main text.) Returning to (85) we now have, This expression is the convolution of a normal distribution and a truncated normal distribution.We can perform the convolution by writing the distributions out in full, collecting all terms containing y into a single exponential, and integrating (S.Turban, personal communication, December, 2019): where Φ() indicates the standard cumulative normal distribution, 1 is an indicator function (which is one when its argument is true, and zero otherwise), α = In our case, a and b take very specific values.Let, then we can write, and using (11) our result for confidence is, We have also used L in the above expression to account for the different limits of integration when R = 1 and R = 2.
We can solve the integral in (96) via numerical integration.It is also possible to rearrange the expression to a form that is faster to evaluate numerically.Consider a change of variables, Then we have, and, where ϕ() indicates the standard normal distribution.Also denote, Then, Hence, in (96) we have, where, Using result 10010.1 from Owen (1980), gives us, where, is the bivariate cumulative normal distribution, corresponding to a distribution with mean and covariance, This integral can then be numerically evaluated using standard functions.

F Details of free response condition derivation
In this section we cover certain details of the derivations for the free response condition, starting with how we arrived at the distribution over φ in (26).
For the purpose of inferring φ, we approximate the evidence stream, E, by its average prior to the time of the decision, E, where we have used (25) in the first line.Precisely, E is given by E = i δE i /t d , where the summation is taken over all time steps prior to the decision.
Using Bayes rule, ∝ p(φ|E) In this equation, Ω is the set of all paths that do not cross the decision threshold prior to t d , and arrive at the threshold at t d (Moreno-Bote, 2010).δx (ω) is the δx vector for the particular path ω.
As discussed in the main text, we also approximate p(φ|E) ≈ p(φ).The approximation gives, Returning to (4) but now using our approximation replacing the full evidence stream with the average evidence, we have that δx i is given by ≈ N (δx i ; φEδt, σ 2 acc δt) (119) The proportionality is with respect to φ.
Using standard results for the product of normal distributions we find where we have used t d = δt, x d = δx i , and K (ω) is the constant of proportionality (see Drugowitsch et al., 2012 for closely related calculations).This constant will be different for different paths, so we explicitly indicate the path, ω.Hence, we have in ( 117) ∝ p(φ)N (φ; This is the expression given in the main text that we were aiming to derive.However, we can simplify this expression further. Recalling (5), φ is distributed as, giving p(φ|R, t r , E) ∝ N (φ; 1, σ 2 φ )N (φ; where we have used standard formulae for the product of normal distributions. In the main text we derived an expression for the first term in the integral in ( 23), and in this supplement section we have derived an expression for the second term.These expressions can be used to perform the integral in (23) as follows, Applying the result in Supplement Section A and rearranging we produce (30).

G Simulation and plotting details
We simulated the task described in the main text with the parameters shown in Table 4.The only difference between the simulation and the setup described in the main text is that evidence in a frame could not go below 0 or above 3096.Evidence within a frame was resampled until it met these constraints.
The midpoint between the means of the two evidence signals, referred to as "reference value" and denoted µ/2, was sampled each trial.It was sampled from a normal distribution centred on 1000, with standard deviation of 100, that was truncated at 500 and 1500.The parameters used for the observer are shown in Table 5.Within a simulation, all participants were simulated using the same parameters.Except for Fig. 7, half of the trials were simulated for the free response condition, and half for the interrogation condition.For Fig. 7 all trials were from the free response condition, but two different versions of the free response condition were used, as described in the main text.
A linear decision threshold, symmetric for both options, was used in the free response condition.At the onset of evidence accumulation the threshold was at the value given in Table 5 for threshold height, and it subsequently changed at the rate given by threshold slope.In the interrogation condition, stimulus presentation duration was drawn from a normal distribution of mean 0.8 s and standard deviation 0.3 s, truncated at 0.2 and 4 s.

Variable
Expression Fig. 8 The evolution of the observer's accumulator was simulated using small time steps of duration 0.1 ms.Each accumulator increment was determined by sampling from the distribution in (4).Simulation of the accumulation continued until a decision threshold was crossed, and all remaining evidence in the pipeline had been processed.The only exception was the simulation for Fig. 7.For the "Immediate conf."condition in this plot, accumulation terminated as soon as the decision threshold was reached, consistent with the idea that the manipulation used by Kiani et al. (2014) ensured confidence was not based on pipeline evidence.
Confidence reports were produced in accordance with the model (see ( 54), (10) and Fig. 2).Where a plot refers to binned confidence, the confidence values produced by the model have been divided into 10 bins of equal numbers of data points on a participant-by-participant basis.
To make predictions for confidence using the derived expressions, we set the values of the parameters in these expressions (σ acc , σ φ , σ m , I, threshold height and threshold slope) to the true values used to simulate the data.We set the bin edge parameters (d i ) to the values that were used for dividing the simulated confidence reports into 10 bins.
Prior to plotting, data for the x-axis was binned separately for each participant and data series.For each participant, data series, and bin, the average value of the x-variable was computed.Using these averages, the average value across participants was computed, and this determined the x-location of each bin. 10 bins were used apart from in Fig. 7 where 5 bins were used.The y-variable was calculated separately for each participant and bin, before averaging across participants.Error bars, and the width of error shading, represent ±1 standard error of the mean across participants.

H Testing the approximations in other models
In Fig. 6 we compared the mean and variance of confidence produced via simulations, and confidence produced through the predictions derived in the main text.The aim was to test whether the approximations made during the derivations were reasonable.We found little discrepancy between the simulated and predicted confidence.For completeness, we ran a similar analysis using a model not considered in the main text.Specifically, we considered an observer who did not use a Bayesian readout for confidence.Instead their confidence was directly determined by the final state of the accumulator: More accumulated evidence favouring the choice generated higher confidence.We adjusted the reported derivations to cover this case by setting θ(t) = 1 (see equation 9).
For this model we found similar discrepancies between the simulations and the derived predictions.Figure 9 provides an example.Again, discrepancies were in general small.Nevertheless, we should bear in mind that just because an approximation works well for some model, or at some parameter values, does not mean it is guaranteed to work well in all models or at all parameter values.Figure 9: Mean and variance of binned confidence, produced via simulation of an alternative model (error bars), and produced through the derived predictions adjusted for an alternative model (shading).The alternative model featured confidence that was a direct readout of the evidence accumulated, rather than a Bayesian readout for confidence."Average evidence" here refers to the absolute value of the following quantity: The difference in dots summed over all presented frames, divided by the duration of stimulus presentation.
of the simulated observers.

Table 4 :
Parameters for the task used in the simulation.