The Autocorrelated Bayesian Sampler: A Rational Process for Probability Judgments, Estimates, Confidence Intervals, Choices, Confidence Judgments, and Response Times

Normative models of decision-making that optimally transform noisy (sensory) information into categorical decisions qualitatively mismatch human behavior. Indeed, leading computational models have only achieved high empirical corroboration by adding task-specific assumptions that deviate from normative principles. In response, we offer a Bayesian approach that implicitly produces a posterior distribution of possible answers (hypotheses) in response to sensory information. But we assume that the brain has no direct access to this posterior, but can only sample hypotheses according to their posterior probabilities. Accordingly, we argue that the primary problem of normative concern in decision-making is integrating stochastic hypotheses, rather than stochastic sensory information, to make categorical decisions. This implies that human response variability arises mainly from posterior sampling rather than sensory noise. Because human hypothesis generation is serially correlated, hypothesis samples will be autocorrelated. Guided by this new problem formulation, we develop a new process, the Autocorrelated Bayesian Sampler (ABS), which grounds autocorrelated hypothesis generation in a sophisticated sampling algorithm. The ABS provides a single mechanism that qualitatively explains many empirical effects of probability judgments, estimates, confidence intervals, choice, confidence judgments, response times, and their relationships. Our analysis demonstrates the unifying power of a perspective shift in the exploration of normative models. It also exemplifies the proposal that the “Bayesian brain” operates using samples not probabilities, and that variability in human behavior may primarily reflect computational rather than sensory noise.

Theorists have taken steps toward a unified model from two starting points, normative and descriptive.Existing normative models are elegant, parsimonious and are easily extendable to all six measures.But these models fail to provide a satisfactory account of many important qualitative effects observed in the empirical data.By contrast, various descriptive models, which systematically deviate from normative assumptions, capture the empirical effects both qualitatively and quantitatively for up to three of these six measures, but no single model makes predictions across all measures.
Here, we develop a simple and consistent computational process that can account for a surprising variety of qualitative findings across all six measures.To achieve this goal, we build on a strong normative foundation for all six measures, rooted in a sampling approximation to Bayesian inference.This approach also implies a radical shift in viewpoint concerning the nature of the decision-making process and the origin of variability in human behavior.In the perceptual decisionmaking literature, existing normative models generally operate on noisy sensory information and evaluate the relative probability that this noisy information is generated by the different hypotheses (corresponding to choice options; e.g., Green & Swets, 1966;Ratcliff & Rouder, 1998).We argue, by contrast, that noise in perceptual decision-making arises primarily not through uncertainty about sensory information, but because of the inherently stochastic nature of the cognitive process that underpins Bayesian inference.
Our starting point is that exact Bayesian computations are generally intractable; and hence that a Bayesian brain can, at best, only approximate these computations.One of the most widely used approaches to such approximation in statistics and machine learning assumes that the brain draws samples from posterior probabilities, which is often very much easier than calculating those probabilities exactly.Inspired by this approach, many theorists in the Bayesian tradition have argued that the cognitive processes thus operate over these samples, rather than representations of probabilities (Dasgupta et al., 2017;Griffiths et al., 2012;Lieder et al., 2018;Sanborn & Chater, 2016;Vul et al., 2014;Zhu et al., 2020).But this process of sampling will inevitably be noisy-different samples will be drawn on different occasions.Thus, in this type of model, the main source of variability does not arise from sensory noise, but from computational noise caused by the process of sampling.In other words, instead of evaluating evidence from the sensory system or memory, we propose that the cognitive system operates on stochastically generated hypotheses.
Our aim in this article is to outline a general process that can be applied to a wide variety of measures and tasks, when equipped with a task-specific representation.Our focus is to show that this process provides a unified framework which captures a wide range of qualitative phenomena across measures and tasks, rather than to produce a comprehensive quantitative model of a particular task.The article is structured as follows.First, we review the traditional probabilistic view of normative decision-making and note its limitations in explaining psychological data.Then we propose an alternative sampling-based approximation approach to alleviate the computational burden associated with the normative models, which in turn suggests a shift in the target problem of normative concern from accumulating sensory data to integrating stochastic hypotheses.We next demonstrate the unifying power of this perspective shift by applying this rational process model, which we call the Autocorrelated Bayesian Sampler (ABS), to the six behavioral measures, emphasizing on qualitative predictions of the model.Finally, we show how to use the ABS framework to create complete cognitive models after exploring the general judgement and decision-making process in detail.

Overview of Probabilistic Decision-Making
The idea that human decision-making process is an optimal, perhaps Bayesian, process is attractive in the light of its potential justification as the end-state of evolution and/or learning (Bogacz et al., 2006;Drugowitsch et al., 2019;Gold & Shadlen, 2002;Green & Swets, 1966;Hawkins & Heathcote, 2021;Moran, 2015;Pleskac & Busemeyer, 2010;Ratcliff & Rouder, 1998;Tickle et al., 2023).There are many task-specific Bayesian models in psychology; but in the area of cognitive and perceptual decisionmaking, they are often elaborations of the general decision process of signal detection theory (SDT), which describes how sensory evidence can be transformed into optimal behavior (Green & Swets, 1966).
To illustrate our discussion, we shall consider the following trial in a perceptual task as a running example: a cloud of 24 dots briefly appears on-screen (this is the stimulus, s).1 Participants might be asked to report the probability that the number of dots falls within a certain window (a, probability judgments).They may also be asked to estimate the exact number of dots (b, estimates) or to provide a confidence interval for the estimate (c, confidence interval).Alternatively, participants could be asked to decide whether the number of dots was greater or smaller than some predefined boundary (d, choices) and the experimenter might record the elapsed times for making such choices (e, RT), for example, by asking participants whether there were greater than 25 dots on the screen.The participant may also be asked to rate their confidence in their choices (f, confidence in the decision).
Importantly, while numerosity judgment provides a concrete illustration, and one that connects naturally to existing models such as SDT, the general approach applies quite broadly.For a wide range of tasks, the six behaviors above can be collected and modeled.So, for example, participants might be asked memorybased questions about how much their last grocery bill was (e.g., "Your last grocery bill exceeded £150.What is the probability that this proposition is correct?"), or even asking participants about oneoff future events such as how many years they expect to live.Thus, while we use the numerosity example in Figure 1 because it is simple and straightforward to relate to SDT, our approach applies to a wide range of cognitive and perceptual tasks, as we will see below.
More formally, in the SDT, we wish to choose between option A and B, based on a total of T units of sensory input (s 1 , s 2 , … , s T ) typically assumed to be accumulated over time.Assuming that both options are equally likely a priori (i.e., P(A) = P(B)), the key variable is the summed log-likelihood ratio over the evidence from each individual unit of sensory input: Pðs t jAÞ Pðs t jBÞ , where Pðs t jAÞ is the likelihood of sensory evidence s t when option A is the correct choice (similarly for Pðs t jBÞ).The probability of choosing A over B should be a function of the summed loglikelihood ratios.Provided with imperfect evidence (e.g., detecting a ship on a noisy radar image), SDT is a principled way to filter out irrelevant sensory noise to pick out the useful signal (e.g., whether the image contains ship).The approach can be applied in many areas of psychology, including memory, perception, and reasoning (Kellen et al., 2021;McCarley & Benjamin, 2013;Rotello, 2018;Trippas et al., 2018).SDT, however, makes no explicit commitment on the time course of how evidence is generated and/or collected, and so makes no predictions for response time.This issue can be addressed with a dynamic extension of SDT: the sequential probability ratio test (SPRT), which postulates that the stream of sensory evidence arrives steadily and sequentially over time (Bogacz et al., 2006;Edwards, 1965;Laming, 1968).To deal optimally with the incoming sensory evidence in, for example, binary choice, the evidence should be continuously integrated into the log-likelihood ratio between the two options until a fixed threshold is reached, and RT are predicted to depend on the amount of evidence accumulated before the threshold is reached.More formally, the log-likelihood ratio for choosing option A over B is recursively updated after the arrival of each new piece of sensory evidence (s T ): Pðs T jAÞ Pðs T jBÞ : (2) Once the log-likelihood ratio reaches a threshold (assuming symmetric thresholds: L(T) > δ or L(T) > −δ), the evidence accumulation process stops and the response depends on whether the positive or negative threshold is reached.Increasing the magnitude of the threshold (δ) produces a slower but more accurate response as more sensory evidence, on average, is accumulated before either threshold is reached.The SPRT is optimal in the sense that the expected amount of evidence (i.e., T) is minimized for any fixed probability of deciding incorrectly (Wald & Wolfowitz, 1948).In other words, following the SPRT allows for the fastest response time for a particular error rate.Because the sensory inputs are assumed to be independent of one another, the SPRT is a random walk model whose starting point is L(0) = 0 and with two absorbing states: −δ and δ (see Figure 2A).While intuitive and simple, the framework of the SPRT also makes decisions that take time, as people do, which is an advantage over SDT in modeling empirical data.Indeed, the SPRT can produce human-like speed-accuracy tradeoffs: requiring faster decisions reduces accuracy, while requiring more accurate decisions reduces speed.This is captured in the model by assuming that people control the magnitude of the thresholds to suit their objectives.In response to an experimental emphasis on speed (accuracy), people are assumed to decrease (increase) the decision threshold; the model's guarantee of optimal performance implies that these two measures will trade off against one another.
Unfortunately, the SPRT does not easily explain other psychological relationships between choice and RT.In binary choice, for example, the SPRT predicts identical response time distributions for choosing either of the two options (assuming an unbiased starting point, L(0) = 0, and symmetric thresholds), contradicting the empirical observation that mean RTs differ for correct and incorrect decisions (Ratcliff & Rouder, 1998;Stone, 1960).This is far from the only issue: Table 1 summarizes several qualitative effects of choice, response time, and confidence, the majority of which cannot be accommodated by the SPRT.
These stylized facts have been used to motivate descriptive models, including the family of models known as drift diffusion models (DDMs), that relax the normative SPRT framework to better describe human data, specifically regarding three key measures: choice, response time, and confidence.While such approaches have been highly successful, our focus here remains on approaches closely tied to normative depictions of behavior, though we return to DDMs and other common descriptive models below.

A Representation for Producing Estimates and Confidence Intervals
The categorical-hypotheses representations used in SPRT can produce choice, response time, and confidence measures, but are not fine-grained enough to produce probability judgments (e.g., judge the probability that the number of dots was greater than 25), estimates (e.g., how many dots are there on the screen), or confidence intervals (e.g., placing a 95% confidence interval around the estimate).What is needed is an extension of the hypothesis space beyond the categorical hypotheses used when making a choice.In principle, within a Bayesian framework, this is straightforward, although the resulting model looks very different.Instead of simply using two categorical hypotheses (e.g., whether there are more than 25 dots on the screen), the model can instead represent the finegrained hypotheses relevant for estimates (e.g., with a hypothesis corresponding to each of exact number of dots on the screen).With such a representation, estimates and confidence intervals can simply be a function (e.g., the mean and quantiles respectively) of this distribution.The probabilities of categorical hypotheses used to produce choices, confidence judgments, and RT can be calculated simply by summing up the posterior probability of the fine-grained hypotheses that are consistent with each choice (e.g., summing the probability of all the hypotheses in which the number of dots is more than 25).

Figure 1 Illustrations of the Variety of Behavioral Measures for a Single Task
This representational change, however, does not allow a probabilistic model to account for many of the empirical effects found with estimates and confidence intervals.For estimates, anchoring effects demonstrate a dependence on preceding choices even when the choice question that provides an "anchor" transparently contains no information (Tversky & Kahneman, 1974).Moreover, in the empirical data, estimated confidence intervals are typically far too narrow and are strikingly different depending on whether participants produce or evaluate them (Juslin et al., 2007).In addition, a long line of empirical work shows that probability judgments are systematically biased and incoherent (e.g., subadditivity, conjunction fallacies, partition dependence), which seems to argue against all purely probabilistic models (e.g., Dasgupta et al., 2017;Tentori et al., 2013;Tversky & Kahneman, 1983;Tversky & Koehler, 1994;Zhu et al., 2020).
Exact probabilistic models also show fundamental mismatches with the results of recent investigations into the source of noise in human judgment and decision-making.While probabilistic models assume a noise-free inference process using precise probabilities, there is growing empirical evidence suggesting that much, or even most, variability in decision-making in fact arises from "computational noise" (i.e., variability in precision and approximation used to perform inference) rather than "sensory noise" (i.e., variability in relevant sensory features) or "decision noise" (i.e., variability associated with action selection; Drugowitsch et al., 2016;Findling & Wyart, 2021;Stengård & van den Berg, 2019;Wyart & Koechlin, 2016).Clearly, then, there are problems with the descriptive adequacy of all probabilistic models, including SDT and the SPRT, which may stem from the psychologically implausible assumption of exact calculation of probabilities and the lack of mechanism to account for the stochasticity in the inference process.In the next section, we propose how to address these fundamental problems, before evaluating how far the proposed solution produces a better qualitative match to a wide range of regularities in human behavior.

A Sampling-Based Approximation Perspective for Rational Decision-Making
Assuming imprecise probabilities does not necessarily mean abandoning probabilistic models.While exact Bayesian computation is often out of reach for real-world computational mechanisms, including the human brain (Anderson, 1991;Aragones et al., 2005;Kwisthout et al., 2011), computer scientists and statisticians have devised a number of sophisticated, general-purpose approximations for producing useful answers with a more reasonable amount of computational time and effort.It is therefore interesting to explore whether the brain has hit on similar solutions.One major family of general-purpose approximations in computer science and statistics is based on sampling. 2ollowing the Bayesian approach, we propose that people solve cognitive tasks by building an internal model and posterior distribution over fine-grained hypotheses, which can support the responses for all of the aforementioned six behavioral measures.But, because the exact representation of the posterior probabilities of hypotheses is typically computationally intractable, we further hypothesize that the posterior probability distribution is not computed exactly, but is approximated by drawing representative

Conjunction fallacy
The judged probability of the conjunctive event is higher than that of its constituents

Confidence in decisions
Positive relationship between confidence and stimuli discriminability Confidence increases as stimuli discriminability increases Baranski and Petrusic (1998); Garrett (1922); Vickers (1979); Johnson (1939); Vickers and Packer (1982) Higher stimuli discriminability partitions the hypothesis space to more unevenly favor the correct response.Thus, the hypothesis samples that were converted into evidence become more homogenous with increasing discriminability

Resolution of confidence
There is a positive relationship between choice accuracy and confidence judgments Baranski and Petrusic (1998); Vickers (1979); Garrett (1922); Ariely et al. (2000); Johnson (1939); Vickers and Packer (1982) Optionally terminating an autocorrelated sampling process will make the correct responses faster and thereby greater confidences

Metacognitive inefficiency
Metacognitive efficiency (as measured by meta_d′/d′) decreases for higher confidence ratings Shekhar andRahnev (2021a, 2021b) Confidence is based on the proportion of responseconsistent binomial samples (see Appendix D, for detail).The sample autocorrelation amplifies the metacognitive inefficiency Negative (cross-trial) relationship between confidence and RT Within the same condition, confidence decreases as RT increases Henmon (1911); Baranski and Petrusic (1998); Vickers and Packer (1982); Johnson (1939) Confidence is based on the proportion of responseconsistent samples with an optional stopping rule Positive (cross-condition) relationship between confidence and RT People are more confident in conditions in which they take more time to make a choice Vickers and Packer (1982) The decision threshold is greater in the accuracy condition than in the speed condition, so choices will take longer while the greater difference in sample counts at threshold leads to higher confidence

Confidence intervals
Strong overconfidence in self-produced confidence intervals Self-generated confidence intervals are far too narrow Juslin et al. (2007) (Anderson, 1991;Sanborn et al., 2010), (b) sampling algorithms show much of the same behavioral variability and deviations from ideal probabilistic inference as observed in people across a range of domains (Dasgupta et al., 2017;Griffiths et al., 2012;Lieder et al., 2018;Sanborn & Chater, 2016;Vul et al., 2014;Zhu, Newall, et al., 2022), and (c) the variability of sampling algorithms has been found to match neural variability in the cortex (Fiser et al., 2010;Haefner et al., 2016;Hoyer & Hyvärinen, 2003).These observations suggest that the sampling-based explanations can connect with all three of Marr's (1982) celebrated explanatory levels: computational (through implementing Bayesian inference), algorithmic (via a tractable computational mechanism), and implementational (through potentially mapping on to neural activity).Taking a sampling-based approximation perspective to model choices suggests decision-making should be conceptualized as the problem of integrating a sequence of stochastic hypotheses into categorical decisions.The key distinction with other probabilistic models such as SDT and the SPRT is that we specifically define the "evidence" as samples of hypotheses, abstracting away from noisy sensory percepts or memory traces used in previous models, including computational-level models (see Figure 2B).This implies that it is computational noise in the inference process that is the primary source of variability in behavior.

The Autocorrelated Bayesian Sampler
Here we outline a rational process for producing probability judgments, estimates, confidence intervals, choices, confidence judgments, and RT based on a sampling approximation of the posterior probability of fine-grained hypotheses, which we call the ABS.Our key theoretical contribution is to create links between the sampling process and each of the six behavioral measures.This is possible because samples of the fine-grained hypotheses contain all of the relevant information to produce these (and indeed many other) aspects of behaviors.
Continuing our numerosity example (see Figures 1 and 2), the ABS produces behavior based on the posterior probability of the hypotheses, PðhjsÞ, which is calculated using Bayes rule: where h is a hypothesis, s is a stimulus, PðsjhÞ is the likelihood of a stimulus given a hypothesis, P(h) is the prior probability of a hypothesis,3 and P(s) is the overall probability of observing the stimuli across all possible hypotheses included in the internal model.In the numerosity task, for example, the hypothesis space reflects all possible numbers of dots that may have appeared on-screen, while the posterior distribution could be represented as a Gaussian distribution with mean equal to 24-the true number of dots.The variance of the distribution may stem from all kinds of uncertainties including, for example, perceptual noise and/or uncertainties associated with the processing of information.
This general framework applies far beyond numerosity, of course.For example, it can be applied to intuitive physics when h is a complete object trajectory and s is the initial movement of an object (e.g., Battaglia et al., 2013;Hamrick et al., 2015;Sanborn et al., 2013), language production when h is the next word in a sentence and s are the preceding words (e.g., Chater & Manning, 2006;Levy et al., 2008), and common-sense reasoning about other minds when h is a social goal of other agents and s is a sequence of actions performed by those agents (e.g., Baker et al., 2008Baker et al., , 2009)).Similarly, Bayesian models have also been successfully implemented in explaining effects in other areas of psychology such as vision (e.g., Yuille & Kersten, 2006), motor control (Körding & Wolpert, 2004), causal reasoning (e.g., Abbott & Griffiths, 2011;Bramley et al., 2017), reading (Norris, 2006), and learning (e.g., Courville & Daw, 2007;Gershman et al., 2010).
Next, a set of hypotheses is sampled from the fine-grained posterior distribution, and these samples directly and straightforwardly support all six of our measures (Figure 2B).Probability judgments are based on the relative proportion of the samples (e.g., the number of samples with numerosities greater than 25).Estimates are based on a summary statistic of the samples (e.g., the mean sampled numerosity or the value of the most recent sample).Confidence intervals are based on the quantiles of the samples (e.g., ordering five samples and using the numerosities of the second and fourth sample as the bounds of a 60% confidence interval).Choices are based on the preponderance of the samples (e.g., depending on whether more than half the samples have numerosities greater than 25).Confidence judgments are (like probability judgments) based on the relative proportion of the samples that agree with the choice.RT are a function of the number of samples drawn (e.g., assuming that on average drawing four samples takes longer than three).
To generate concrete predictions from the model, and assess the match with human behavior, we need to outline three further aspects of the model: the choices of sampling algorithm, prior on responses, and stopping rule, to which we now turn.

The Sampling Algorithm
We assume that the mind conducts sampling-based approximations by drawing samples of hypotheses in proportion to the posterior probabilities associated with each hypothesis (e.g., Chater et al., 2020;Dasgupta et al., 2017;Griffiths et al., 2012;Vul et al., 2014;Zhu et al., 2020).Rather than reviewing the extensive literature on sampling algorithms in statistics and computer science (see Andrieu et al., 2003, for an overview), we focus on algorithms that have been previously shown to match human behavior in some area of psychology.
The simplest sampling algorithm is direct sampling, in which independent and identically distributed (i.i.d.) samples are drawn (e.g., Vul et al., 2014).However, a lot must be known about the target distribution to draw i.i.d.samples: people need (at least implicitly) to know the posterior probability of every hypothesis, which fails to alleviate the computational intractability problem that motivates the need for sampling-based approximations.Another difficulty for direct sampling is descriptive.Human hypothesis generation is not a process of drawing independent samples, as direct sampling requires.Instead, what comes to mind now depends on what came to mind in the past (Dasgupta et al., 2017;Gilden et al., 1995;Zhu, Leo ´n-Villagrá, et al., 2022).
In light of this, researchers have recently started to explore a family of more sophisticated sampling algorithms called MCMC (Robert & Casella, 2004).MCMC algorithms explore the hypothesis space using only local knowledge about the probability distribution, greatly reducing the knowledge required to generate samples.The key idea of MCMC is that, in its simplest form, it represents only a single hypothesis at a time, and probabilistically transitions between hypotheses in proportion to their posterior probabilities.The local transitions induce a serial dependence between samples, akin to the local transitions in human hypothesis generation (Bramley et al., 2017;Fränken et al., 2021).
In our own work, we have found that an extension of MCMC, named MC 3 , provides a close match to the dynamics of repeated human judgment, capturing the observed long-range autocorrelations between estimates, as well as the heavy-tailed distribution of absolute differences between successive estimates (Spicer et al., 2022b;Zhu et al., 2019Zhu et al., , 2021;;Zhu, Leo ´n-Villagrá, et al., 2022).We therefore use MC 3 as the sampling algorithm in the present model, though the specific mechanics of this algorithm beyond dependencies between samples are not necessary for almost all of the behaviors targeted here (see Appendix A, for algorithmic details).In other words, with the exception of explaining the cross-trial autocorrelation results which requires quantitative characterizations of the dependence in samples, the key condition for a sampler to reproduce the qualitative model behaviors (e.g., comparing average model behaviors between experimental conditions) is simply that sampling is local and autocorrelated.Thus, most model predictions can be replicated using many other MCMC sampling algorithms, including the widely used Random Walk Metropolis algorithm, so long as samples are positively correlated across time.
Using dependent samples influenced our choice for how the ABS produces estimates.In past work, estimates have been based on the most recently generated sample or by averaging over samples (Lieder et al., 2018;Vul et al., 2014).While the mean of a set of independent samples is clearly a better estimate of the underlying mean than a single sample, with dependent samples, earlier samples are more likely to be biased by the starting point than later samples.For this reason, we chose to use the most recent sample as our estimate.However, these two approaches do not predict qualitatively different behavior in aggregate (see Appendix E, for details). 4roducing confidence intervals, however, requires more than a single sample, and instead can reflect statistics of the entire set of samples: for example, the 2.5% and 97.5% quantiles of the samples can represent a 95% confidence interval of the target distribution.This approach can only be applied directly for large samples.With small samples, we produce more fine-grained intervals by following Juslin et al. (2007) and use linear interpolation to fill in the gap between the two quantiles of the samples.
We also assume that sampling takes time.For simplicity, we model the time necessary to produce N samples as a Poisson process: while time taken to produce a new sample is random, the samples are generated at a constant rate (λ samples per sec).In a Poisson process, the waiting time between samples is exponentially distributed, and the time necessary to generate N samples follows an Erlang distribution: f ðtÞ ∼ ErlangðN, λÞ: (4) The mean and variance of RT for a sample size of N are E½t = N λ and V½t = N λ 2 respectively.Using a Poisson process allows us to more closely link our approach to existing models such as the Poisson random walk (PRW) model (Blurton et al., 2020;discussed below), though the results in this article would be qualitatively the same under a wide variety of assumptions of how long it takes to generate each additional sample.This is because many empirical results only require assuming the samples were generated sequentially and generating more samples typically takes more time.Exponential waiting times are assumed here to explain the shape of RT distributions, particularly the observation that the RT for probability judgments (which we assume to have been produced using a fixed number of samples) are positively skewed (see Appendix F).5 The Bayesian Monte Carlo Prior on Responses Samples of the fine-grained hypotheses generated from our sampling algorithm can be readily used to make a choice.In our numerosity example, if asked to decide whether the number of dots that appeared on-screen is greater than 25, the hypothesis space should be partitioned into two subspaces with 25 on the boundary line.Samples that indicate greater than 25 dots or not support the corresponding hypotheses.In other words, evidence is directly translated from the samples, here taking one of two values.And also, unlike the evidence used in SPRT, there is no inherent uncertainty about which alternative each sample supports.For the numerosity example, the generated sample can denote any number of dots in the hypothesis space, but it can only support one alternative in decisions: if the sample was 23, it only supports the hypothesis that there were less than 25 dots on screen.Similarly for M-alternative choices (M > 2), the hypothesis space can be partitioned into M subspaces with hypothesis samples from each subspace supporting the corresponding alternative.
These samples implicitly carry information about the probability that each choice alternative is correct.For example, when asked about the probability that the number of dots is greater than 25, the relative frequency of evidence in favor of, rather than against, the event should inform the probability estimate.But, as explored in Zhu et al. (2020), people should not directly use the relative frequency of the hypotheses as a probability estimate.This is especially true when sample sizes are small because the relative frequency tends to be extreme.Indeed, a single sample would lead to a probability estimate of either 0 or 1.This problem can be solved by incorporating a prior on responses to temper the relative frequency in the estimates of the probability that each choice alternative is correct, an approach that in statistics is called Bayesian Monte Carlo (Gelman et al., 2013;Rasmussen & Ghahramani, 2002). 6 The Bayesian Sampler model of Zhu et al. (2020) used a fixed prior on responses, and for mathematical simplicity, this was chosen to be a Beta distribution, because this is the conjugate prior for probability estimates.The Beta distribution is bounded by 0 and 1, and has two parameters, α 0 and α 1 , which determine its shape: when both parameters exceed 1, the Beta distribution is unimodal with a peak in the middle of the range (i.e., at α 1 −1 α 0 +α 1 −2 ); when both parameters equal 1, it is uniform; and when both parameters are less than 1, it is bimodal with peaks at both 0 and 1.Most critically, using the Beta distribution as the prior enables evidence to act as pseudocounts in the parameters.For S(A) pieces of evidence of event A, S(¬A) of event not-A, and a Beta(α 0 ,α 1 ) prior, people will have a posterior distribution for probability estimates that is distributed according to Beta(α 0 + S(A), α 1 + S(¬A)).The Bayesian Sampler model used the expected value of this posterior distribution as its probability estimate, which is also simple to calculate: where N = S(A) + S(¬A) denotes the total number of samples that were generated and translated into evidence.Both the prior parameters (which affect α 0 , α 1 ) and the sampling process (which affects S(A) and N) affect the expected value.As the prior parameters are defined to be nonnegative (i.e., α 0 , α 1 ≥ 0), the Bayesian Sampler's estimated probabilities tend to avoid extreme values and regress to the mean of the prior (i.e., α 0 /(α 0 + α 1 )).
Here, we generalize the prior on responses used in the original Bayesian Sampler in two ways.The first is to make it multivariate: in many situations, people can be asked to judge a multivariate event where the hypothesis space should be partitioned into many subspaces.For example, when asked "what is the probability that the hottest day of the week will be Sunday?,"there are seven options to consider ("Sunday hottest," "Monday hottest," and so on).In this case, the Dirichlet distribution, a multivariate generalization of the Beta distribution, is the natural conjugate prior.For an M-variate Dirichlet prior, Dir(α), with α = (α 0 , α 1 , … , α M−1 ), people report the mean posterior distribution as their probability estimate: This view of probability estimates implies an indifference point (when the underlying probability and the estimated probability matches) that depends on the number of alternatives (see Figure 3A).The second way in which we generalize the prior on responses of the Bayesian Sampler is to allow it to adapt to experience (e.g., the trial history in an experiment).In Bayesian data analysis, when no prior information is available, a default prior is typically recommended (Gelman et al., 2013).However, for many real-world applications and especially for everyday cognitive tasks, historical data (e.g., past experiences of the same task, data from previous similar tasks or from observing others' performing the task) are available which can help people can construct an appropriate prior.For example, if repeatedly choosing between the same two alternatives, historical choice data should provide useful information such as the base rate, which in turn can help construct a prior on responses to guide future decisions.How to construct an adaptive prior based on historical data is a topic of debate in statistics and computer science because it is difficult to determine how much to generalize previous experience to new situations (Chen et al., 2000;Diaconis & Ylvisaker, 1979;Ibrahim et al., 2015). 7For simplicity, we assume that people only use information from the immediately previous trial to develop their adaptive prior for the present trial: in binary choice, a noninformative, uniform prior (Beta(1,1)) is adjusted to favor the option the feedback indicated was correct, thus becoming either Beta(2,1) or Beta(1,2).
The adaptive prior on responses, in conjunction with the generated samples, then determines the model's estimated probability of a categorical alternative being correct.This estimated probability is used both as the model's probability estimate and its confidence judgment in whether a choice is correct. 8The equivalence between the two is not unique to our model-it has been previously posited as the Bayesian Confidence Hypothesis (Kepecs & Mainen, 2012;Mamassian, 2016;Pouget et al., 2016), and has attracted both support (Calder-Travis et al., 2020) and criticism (Li & Ma, 2020).

The Stopping Rule
Any model of judgment or decision that depends on the sequential accumulation of evidence needs a rule determining at what point to stop collecting evidence and make a decision.When to stop drawing samples should depend on both the costs (e.g., metabolic, opportunity, etc.) of sampling as well as the task-specific benefits of additional samples for providing a good response.
For probability judgments, estimates, and confidence intervals, in the absence of a clear alternative stopping rule, we make the simplest possible assumption: that a fixed number of samples are drawn to answer each question (though we revisit this point in Appendix F).A fixed number of samples will allow the model to produce indifferent probability judgments (e.g., judging a binary event to have a probability of 0.5) as is often observed in the human data, 9 and 6 The Bayesian Monte Carlo prior on responses is different from the prior on hypothesis in Equation 3. Specifically, the Bayesian Monte Carlo prior should capture full or partial information about the frequencies of the relevant behavioral outcomes in past trials.Consider again the binary choice in the numerosity example where people were asked to judge whether the number of dots is greater than 25.In this case, the prior on responses should reflect, to some extent, prior belief in different probabilities that each response is correct; and this prior knowledge could be acquired through feedback of correctly choosing greater-than-25 and of correctly choosing the alternative less-or-equal-than-25.An additional difference is that when knowledge of the probabilities is precise, even if there is uncertainty in the hypotheses, the effect of the Bayesian Monte Carlo prior reduces.Thus, in Equation 5, as the sample size, N, approaches infinity then PðAÞ tends to P(A).
7 Incorporating historical information into new situations is known as power prior in the statistics literature, which is also closely related to the ideas of metalearning (or learning-to-learn) and hierarchical Bayesian modeling.
8 Confidence judgments are often made on various ordinal rather than probability scales, though analyses of confidence judgments often just assume that they are monotonically related (Li & Ma, 2020;Shekhar & Rahnev, 2021a), as we do here.For comparability across different ordinal scales we present all of the model predictions on the probability scale rather than specifying those relationships. 9A reanalysis of the four probability judgment experiments presented in Sundh et al. (2021) shows that participants produced indifferent probability judgments (i.e., exactly 50 on a 0-100 scale) on about 6% of trials.
fits with the even distribution across reaction times observed in our own experiments (see Figure F1 in Appendix F).
However, for making decisions, rather than probability judgments, a fixed sample is likely to be too simple.If the samples so far leave the evidence finely balanced regarding which decision to make, then it is likely that more data will be collected.While it is possible in principle to derive an optimal stopping rule for the sampling process in this model, unlike with the SPRT, the optimal rule is not analytically tractable and can instead only be computed using dynamic programming (see Appendix C).So, again for simplicity, we use a well-known heuristic stopping rule instead: the max-minusnext rule, which counts the difference in evidence between the top two hypotheses, and terminates the sampling process whenever the accumulated difference exceeds a threshold.This simple heuristic stopping rule has also been shown to approach the performance of an optimal SPRT even in multi-alternative settings (Dragalin et al., 1999(Dragalin et al., , 2000)).For binary choices, this reduces to just the difference in the number of samples in favor of each alternative, which has been proposed in past work (Hamrick et al., 2015;Vul et al., 2014).The decision-making panels of Figure 2B demonstrates the max-minusnext stopping rule with a threshold value of 2.
While the choice of stopping rule does not change how samples are used to produce the different measures, it does influence the content of the samples and the variability of the sample size and hence responses times.So, for example, in our model, while the RT for a probability judgment which is assumed to have a fixed sample sizes will follow an Erlang distribution (see Appendix F, for further justification), RT for a choice (which assumes optional stopping) will follow a mixture of Erlang distributions (see Appendix B, for details).
Explaining Key Empirical Targets in Probability Judgments, Estimates, Confidence Intervals, Choices, Confidence Judgments, and RT We now demonstrate the explanatory power of the ABS.We focus on behavioral results that deviate from the Bayesian ideal embodied in models like the SPRT (see Table 1), simulating these using a consistent set of parameters (detailed in Appendix A).To facilitate understanding of the active ingredients of the model, we also show results from three restricted variants of the full ABS model.The noprior variant removes the adaptive prior (this is equivalent to fixing the prior to Beta(0,0)) while keeping the remaining components.The direct-sampling variant uses independent samples instead of the autocorrelated samples while keeping the remaining components.The fixed-sample-size variant always uses a fixed number of samples (N = 5) to form behaviors for both judgments and behaviors.

Biases in Probability Judgments
The wide range of systematic biases in probability judgments are perhaps the most direct evidence against purely normative probabilistic models as the basis for a descriptive psychological account.We find that the prior on responses and local nature of the sampling algorithm of the ABS (which help to reduce the computational burden of the model by reusing old and useful computations and using only local knowledge of the posterior distribution respectively) suffice to produce many of these biases.Note that in the ABS, there are no biases in the underlying posterior probabilities; biases arise solely from the algorithmic process by which the posterior is sampled and judgments and decisions are generated.
Using prior knowledge to temper the probability estimates was the basis of the Bayesian Sampler model (Zhu et al., 2020).The ABS works in the same way, except that it uses autocorrelated, rather than independent, samples. 10As shown in Figure 3A, the prior in the 10 The autocorrelation of the samples does not qualitatively alter the overall model predictions for probability judgments (with the exception of implicit subadditivity and implicit superadditivity, discussed below) because autocorrelated sample sizes can be corrected to produce the effective sample size of independent samples using the following equation: where ρ t is the degree of autocorrelation at lag t.For the parameters we used in the simulations, the effective sample size is on average 16.80% of the autocorrelated sample size (95% CI [16.40%, 17.21%]).In addition, in the studies we refer to, there is very rarely any feedback.Without feedback, we assume the ABS prior does not change from trial to trial, making it identical to the fixed prior of the Bayesian Sampler for binary events.
Bayesian Sampler produces a linear bias toward conservative judgments (Zhu et al., 2020) where people avoid the extremes in their probability judgments (Erev et al., 1994;Fiedler, 1991;Peterson & Beach, 1967).This type of conservatism captures the results of a series of probabilistic identities investigated by Costello, Watts, and colleagues (Costello et al., 2018;Costello & Watts, 2014, 2018), which were constructed by adding and subtracting various mean judgments across combinations of events.While all these identities would equal zero if participants reported coherent probabilities (on average), mean judgments were zero for some identities and substantially different from zero for others.The results from the entire set of identities, including conditional probability judgments of dependent events, were well fit by the Bayesian Sampler's linear conservatism bias (Zhu et al., 2020).As the average behavior of the ABS is approximated by the Bayesian Sampler especially when the effects of local sampling are not strong (e.g., where there are random initializations of the local sampler), the ABS will produce these results as well.
The sample size and prior on responses of the ABS can be dissociated by examining the mean-variance relationship in probability judgments.When probability judgments are based on sampled outcomes, the relationship between the mean probability estimates and the variance of the estimates will follow an inverse U-shaped ("rainbow-shaped") curve (see Figure 4A).The prior on responses then constrains the range of possible probability estimates that an agent can produce, thereby lowering the relative variance and pulling the curve both inward and downward (see Figure 4B).For example, for a binary event with a uniform prior, if a single sample is drawn, probability judgments will be either 0.33 or 0.67, and total variance will be relatively lower than for the pure proportions of sampled outcomes (taking now account of the prior on responses).
Overall, the Bayesian Sampler predicts a shrinkage of the meanvariance curve for probability judgments, and this was empirically validated in four experiments (Sundh et al., 2021).For the same reasons as the Bayesian Sampler, the ABS model predicts this shrinkage of the mean-variance curve as well (see Figure 4B).
Moreover, explicit subadditivity in probability judgments also occurs as a direct consequence of using the prior on responses.Explicit subadditivity is when the estimated probability of an event (A 0 ) is lower than the sum of estimated probabilities for events (A 1 , A 2 , … , A M′ ) where A 0 is the disjunction of those M′ mutually exclusive events.That is: where probability theory requires that these should be equal.An explicit subadditivity bias has generally been observed in betweenparticipant designs in which participants were asked explicitly to judge the probability of each of the M′ events and their disjunction, A 0 , so that a total of M′ + 1 probability estimates were recorded (e.g., Tversky & Koehler, 1994).According to the sampling account, for each query, because participants do not know the full range of questions to be asked, they will treat the event to be judged as a binary event; that is, participants will sample instances and noninstances of that event (i.e., A m vs. not-A m ), thus requiring a Beta prior on responses.The resulting estimate of each P(A m ) will therefore be inflated by regression to the mean.The regression-tomean effect then applies multiple times on aggregate to the righthand side of Equation 7and only once to the left-hand-side, predicting a subadditivity bias for low probability events.As a corollary, the greater the number of disjunctive hypotheses, the more probability judgments will be queried on the right-hand-side of Equation 7, which should lead to a greater degree of subadditivity bias.For M′ component hypotheses, the predicted difference between the sum of the M′ probability estimates and the probability estimate of the disjunction can be derived as follows: where the assumptions were fixed sample size (N) and symmetric prior on responses, Beta(α 0 , α 0 ).Indeed, the empirical findings suggest a positive relationship between M′ and the degree of explicit subadditivity bias, and the Bayesian Sampler correctly captures the relationship (see Figure 5).Moreover, when the probability of the disjunction of M′ mutually exclusive events was exactly 1 (so that one of the disjunctive hypotheses must be true by logic), participants were sometimes asked to only judge the probabilities of M′ component hypotheses but not their disjunction.In this case, model predictions can be analytically approximated as ðM ′ − 2Þ α 0 N+2α 0 under the same assumption as before.This prediction also matches the empirical pattern known as the binary complementarity: on average, no subadditivity bias was observed for mutually exhaustive events when M′ = 2 (Tversky & Koehler, 1994).The ABS model inherits these predictions from the Bayesian Sampler.
Similarly, this regression-to-mean effect predicts the conjunction fallacy (Costello & Watts, 2017;Zhu et al., 2020).The conjunction fallacy arises where the estimated probability for a conjunctive event is greater than that for its constituent events PðA 0 ∩A 1 Þ ≥ PðA 0 Þ, whereas probability theory requires the probability of conjunctive events to be less or equal with their constituents, A 0 and A 1 (Tversky & Kahneman, 1983).The conjunction fallacy occurs in the Bayesian Sampler when the regression-to-mean applies more to the conjunctive event than to the constituent events.Specifically, it is assumed that fewer samples of the more-complex conjunctive events can be generated or tallied in a fixed amount of time; and the prior thus produces a greater regression-to-mean effect for smaller sizes (see Equation 6).This allows the Bayesian Sampler to predict abovechance levels of conjunction fallacies when the conjunction and constituent event both have low probability (Zhu et al., 2020), as is often the case in experiments (Costello & Watts, 2017).The ABS model also inherits this prediction from the Bayesian Sampler.
In contrast with the explicit judgments of M′ + 1 probabilities above, both subadditivity and its opposite effect, superadditivity, have been observed in so-called implicit experimental designs.In implicit designs, only two probability judgments are made: one for the unpacked descriptor (e.g., "baby bottles and other bottles made of glass") and one for the simple disjunctive descriptor (e.g., "bottles made of glass"; Dasgupta et al., 2017;Sloman et al., 2004).Unpacking to typical examples (e.g., a baby bottle in the category of bottles made of glass) leads to subadditivity: examples (e.g., a shampoo bottle in the category of bottles made of glass) leads to superadditivity: Dasgupta et al., 2017;Sloman et al., 2004).Again, since A 0 was unpacked into M mutually exclusive events (A 0 = A 1 ∪ A 2 ∪ … ∪ A M ), probability theory requires the two probability estimates to be equal.Previous work with autocorrelated sampling models (Dasgupta et al., 2017;Sanborn & Chater, 2016) accounted for this effect by assuming that the descriptor influenced the local sampler's starting point: typical unpacking initializes the sampler in a high probability region of the hypothesis space, while atypical unpacking initializes it in a low probability region.As a result, the proportion of hypotheses supporting the event's occurrence will be highest for typical unpacking, intermediate for the simple disjunctive descriptor (assuming it results in a random starting point), and lowest for atypical unpacking.We believe that the ABS will inherit this prediction because it produces autocorrelated samples, though we do not reproduce it here because auxiliary assumptions about the locations and probabilities of hypotheses are needed to do so.This explanation of implicit subadditivity and superadditivity depends on local sampling.They cannot be predicted by the Bayesian Sampler model (Zhu et al., 2020; see a similar argument against a "regressive model" in Tversky & Koehler, 1994), which assumes independent sampling.
Interestingly, people's probability estimates are also found to exhibit so-called "partition dependence."That is, they regress to 1 M where M is the number of alternatives that people are encouraged to consider (see Figure 3B, for a summary; Attneave, 1953;Bardolet et al., 2011;Fox & Rottenstreich, 2003;Varey et al., 1990).For example, asking "what is the probability that Sunday will be hotter than any other day next week?"encouraged participants to treat the event as binary, and their estimates were observed to be biased toward 1 2 , while asking, "what is the probability that the hottest day of the week will be Sunday?,"encouraged participants to consider seven possible outcomes, and estimates were observed to be biased toward 1 7 (Fox & Rottenstreich, 2003).In ABS, framing the probability query as judging an M-variant event invokes a Dirichlet prior with M parameters, Dir(α 0 , α 1 , … , α M−1 ), which for a binary event reduces to a Beta prior, Beta(α 0 , α 1 ).Partition dependence effects can be explained by assuming that people have no a priori reason to believe one event occurs more often than another event: α 0 = α 1 = ⋯ = α M−1 and so probability estimates are predicted to be biased toward 3A).In the ABS, the impact of this noninformative prior should be more pronounced in situations where people are less knowledgeable about the probability estimation task or less confident in a learning context (reflecting fewer samples), matching the empirical results (See et al., 2006).

Choice Accuracy and RT
The Bayesian Monte Carlo process for choice and RT correctly predicts four key relationships between choice and RT.First, and in common with many other evidence accumulation models, the ABS predicts a trade-off between accuracy and speed where increasing decision thresholds lead to, on average, more evidence being accumulated (and thus higher accuracy) as well as longer RT.This trade-off between accuracy and speed has been widely documented in the literature (Garrett, 1922;Johnson, 1939;Pachella, 1974;Ratcliff & Rouder, 1998;Wickelgren, 1977).
Second, unlike many models, the ABS predicts that correct and incorrect responses have unequal average RT.The empirical result is that, when accuracy is emphasized (or in difficult tasks), errors are usually slower than correct responses.By contrast, when speed is emphasized (or in easy tasks), errors are usually faster (Luce, 1986;Ratcliff et al., 2003;Ratcliff & Rouder, 1998;Swensson, 1972).This empirical pattern is surprisingly difficult to match for models that accumulate relative evidence to symmetric bounds: these models predict that the response time distributions for correct responses and errors will always be the same, regardless of choice accuracy (Link & Heath, 1975;Vickers, 1979).To produce slow errors, the usual route is to add variability to the strength of the "signal," or the drift rate in DDMs (Ratcliff & Rouder, 1998).While both strong signals and weak signals will produce equal mean RT, weak signals are both more error-prone and slower.So, with an equal mixture of strong and weak signals, there will be more slow errors and more fast correct responses.
The ABS produces slow errors in a different way.Instead of adding cross-trial variation to the signal strength, or independent within-trial variation to the signal strength (Diederich & Oswald, 2016), slow errors result from the local sampling algorithm producing autocorrelated samples.For example, if the sampling algorithm begins far above the decision boundary (e.g., the red subspace in the posterior of hypotheses illustrated in Figure 2B), then the initial samples will almost all favor the correct response, while if the sampling algorithm begins far below the decision boundary (e.g., the blue subspace in the posterior of hypotheses) then the proportion of correct samples will almost all favor the incorrect response.Slow errors also require optional stopping, because with a fixed stopping rule the response distribution is itself fixed.This can be seen in the simulation in Figure 6B: both autocorrelation and optional stopping (i.e., the no prior variant) are needed to produce errors that are onaverage slower than correct responses.
Fast errors, often found in easy tasks, are produced in a different way.The usual route to producing fast errors is to assume variability in the starting point of the evidence accumulation process (Laming, 1968;Ratcliff & Rouder, 1998;Ratcliff et al., 2003).In the ABS, the adaptive prior on responses is assumed to change in response to the outcomes of the preceding trial.This encourages repeating past successes, but also introduces cross-trial variability in the starting point of the accumulator.This is because the accumulator will be biased toward whichever response was correct on the last trial, and assuming that (as is usual in experiments) the correct response randomly varies between trials, it will sometimes be closer to the correct threshold and sometimes closer to the error threshold.For those latter trials, the amount of evidence required to reach the error threshold is reduced, leading to a shortened mean RT for errors.As with slow errors, optional stopping is also necessary: only with both the adaptive prior on responses and optional stopping (i.e., the direct sampling variant) do fast errors appear in the simulation in Figure 6D.
The differences between the simulations of the "difficult-accuracy" condition (Figure 6B) and the "easy-speed" condition (Figure 6D) track the conditions in which slow errors and fast errors are found.We assume that: (a) greater emphasis on accuracy causes the threshold to be higher, and consequentially more pieces of evidence are needed to terminate the sampling algorithms, and (b) easier stimuli makes the evidence more homogenous (e.g., samples are more likely to point to the same response).11As a result, the "easyspeed" condition involves integration over homogenous but smaller amounts of evidence than in the "difficult-accuracy" condition.In other words, the starting point of the accumulator has more influence, while the degree of autocorrelation has less influence, on determining the predicted behavior in the "easy-speed" condition than in the "difficult-accuracy." Across Figure 6B, D, only the full ABS model matches the empirical observations that slow errors are more common in the "difficult-accuracy" condition, while fast errors are more common in the "easy-speed" condition.
As in other models (e.g., Blurton et al., 2020), the assumptions of an exponential waiting time between consecutive samples and the optional stopping rule correctly reproduce many distributional properties of RTs including that (a) there tends to be one mode in the distribution and (b) distributions with higher means are more positively skewed.Further regularities in the shapes of RT distributions were stressed by Ratcliff et al. (2015) using quantile-quantile (Q-Q) plots (see Figure 7A).Plotting the quantiles of RT from one difficulty condition against the quantiles from another difficulty condition, the empirical Q-Q plots reveals near-linear relationships and a fan shape: increasing task difficulty has its greatest impact on the tails of the distribution with the near linearity suggesting similar RT distribution shapes across difficulty conditions.As shown in Figure 7B, the ABS captures the fan shape and near-linear regularity.The direct-sampling variant shows results that are closer to linear, as would be expected if the autocorrelation in samples causes the upper tails in RT distributions to spread out even more in harder tasks.Also of interest is the fixed-sample-size variant, because it always collects the same number of samples for all difficulty levels, produces identical quantiles between RTs from one level of difficulty and those from another, and thus does not match the empirical data.

Confidence in Decisions
From a Bayesian perspective, it is natural to map decision confidence onto the posterior probability that the decision is correct, a mapping which has been called the Bayesian Confidence Hypothesis (Kepecs & Mainen, 2012;Mamassian, 2016;Pouget et al., 2016).For the SPRT, posterior probability is updated as sensory samples are observed, and its posterior probability at the time of choice is simple: it is the posterior probability when the threshold is reached, because as evidence collection stops once this occurs (see Figure 2A confidence).The SPRT thus predicts that decision  confidence is determined only by the threshold values, because the threshold captures the amount of evidence favoring one hypothesis or the other.Given that the threshold value is fixed prior to, and independent of, the characteristics of a particular trial, this means that confidence will be the same for all trials on which the same hypothesis is chosen.12Unlike the SPRT, the ABS does not have direct access to the posterior probability that a response is correct (i.e., its confidence).Instead it needs to estimate this probability given a set of samples (see Figure 2B confidence).Fortunately, the form of the adaptive prior on responses (a Beta distribution in the case of binary choice) makes this estimate easy to update as samples are sequentially generated.At the start of the trial, the adaptive prior on responses reflects the prior belief in different probabilities that each response is correct.Using the binary choice example, assume a prior for choice A of Beta(i, j) (and a prior for choice B of Beta(j, i)).When coming to a decision, samples in favor of each response, S(A) and S(B), are sequentially collected until the decision process is terminated by the stopping rule.The confidence after N samples (i.e., N = S(A) + S(B)) is then The max-minus-next heuristic stopping rule terminates the sampling algorithm when the quantity of evidence favoring one choice exceeds a threshold, Δ = ji + SðAÞ − ðj + SðBÞÞj > 0. The final decision confidence can then be rewritten as follows: where the confidence judgments predicted by the ABS are decided by both the threshold values (Δ) and the amount of evidence accumulated (N; this is the same as the number of samples generated because evidence is directly mapped from hypothesis samples; for example, a sample of 27 dots will be converted into a piece of evidence for the proposition that the number of dots is greater than 25): the greater the number of samples generated before a decision is reached, the lower the confidence in that decision.This is because the ABS embodies a prior over the strength of signal in the Bayesian Monte Carlo process, and the longer the sampling process continues, the more likely the signal is weak, and so that the posterior probability that the decision is correct correspondingly decreases.This is in contrast to the SPRT: in the SPRT, confidence is unaffected by additional sampling because confidence is determined by a fixed decision threshold.
The first of these effects, the positive relationship between confidence and stimulus discriminability, follows from variation in the strength of the signal in the ABS.More discriminable stimuli will result in more homogenous evidence supporting one alternative (i.e., samples will more consistently support one response alternative over the other), and because decision confidence is a transformation of the proportion of samples that support the chosen response, more discriminable stimuli will on average produce higher confidence judgments (see Equation 8).Conversely, on more difficult trials, the evidence will be more heterogeneous and so the ABS predicts lower average decision confidence.Figure 8A shows this qualitative effect arising in ABS model simulations, in which confidence is expressed on a probability scale which is ordinally related to the scale with which the empirical data were collected.This pattern is produced by all model variants (see Table 2).
Second, average confidence ratings tend to be higher for correct responses than for incorrect responses (e.g., Ariely et al., 2000;Baranski & Petrusic, 1998;Vickers, 1979Vickers, , 2014;;Vickers & Packer, 1982).This so-called "resolution-of-confidence" effect also holds true even if stimulus difficulty is held constant (Baranski & Petrusic, 1998) and even if choice and confidence are simultaneously elicited from participants (Kiani et al., 2014;Ratcliff & Starns, 2009;Van Zandt, 2000).Once again, the SPRT cannot properly explain this effect given that its thresholds are fixed prior to, and independently from, the characteristics of particular trials (e.g., it is constant across all trials or randomly drawn from a fixed distribution).However, if we assume that people have the correct generative model of the task (i.e., the probability of generating a sample that supports the correct alternative is the largest among all other alternatives), the ABS predicts that correct responses will on average be made with higher confidence.This is tied to the explanation for slower errors above: autocorrelations cause errors to be slower on average, and slower responses produce lower confidence judgments (see Equation 9).Therefore, the ABS predicts a resolution-of-confidence effect in experimental conditions that produce slow errors (see Figure 8B).As this effect requires both optional stopping and autocorrelated samples, as also are required to produce slow errors, only the full model and the no prior variant produce it (see Table 2).
Third, studies have shown that the metacognitive judgments in confidence ratings generally carry less information about the accuracy of a decision than would be predicted by a purely normative account like the SPRT.Thus, there seems to be a systematic deficit in "metacognitive efficiency" (Shekhar & Rahnev, 2021a, b).To give an intuition, imagine a participant is asked to make a decision whether to respond A or B to a stimulus.The participant's ability to discriminate between the alternatives (i.e., d′) can be calculated, based on SDT, by using the percentage of A stimuli that are correctly identified (i.e., hits) and the percentage of B stimuli that are incorrectly identified as A stimuli (i.e., false alarms).This standard d′ measure can also be extended to metacognition by choosing a confidence criterion and recalculating the hit and false alarm rates from confidence judgments that exceed this criterion to produce a meta_d′. 13SDT predicts that d′ equals meta_d′ for any confidence criterion and so predicts that metacognitive judgments are always efficient (while the SPRT predicts constant confidence judgments and so cannot be evaluated using this measure).By contrast, a value of meta_d′/d′ < 1 would indicate that information available for the decision is lost in 13 Informativeness of choices and confidence ratings are measured as stimulus sensitivity d′ and meta_d′ respectively (Fleming & Lau, 2014;Maniscalco & Lau, 2012).More specifically, d′ = ϕ −1 (hit rate) − ϕ −1 (false alarm rate) where ϕ −1 is the inverse of the cumulative Gaussian distribution.meta_d′ is calculated in the same manner but with the hit rate and the false alarm rate tallied according to a criterion value that partitions confidence ratings.More specifically, the hit rate is the proportion of trials in which participants reported high confidence given a correct response, whereas the false alarm rate is the proportion of trials in which participants reported high confidence given an incorrect response; and the confidence criterion value determines whether a confidence judgment is considered high or low.part or in whole when making confidence judgments.Empirically, metacognition has been found to be inefficient, and moreover meta_d′ decreases relative to d′ as the confidence criterion increases, meaning that higher confidence ratings are less informative than lower confidence ratings (Shekhar & Rahnev, 2021a, 2021b).Metacognitive inefficiency has been explained by adding additional noise to confidence judgments (Shekhar & Rahnev, 2021a).
While it would be straightforward to add noise to the ABS confidence judgments, surprisingly this additional noise is not necessary to produce such metacognitive inefficiency; in fact, there are multiple routes for the ABS to produce this effect already offered in the current specification.A first route derives from more informed decisions based on larger numbers of samples being overall less confident decisions.For example, imagine using a stopping rule with Δ = 2 and a symmetric Beta(1,1) prior on responses.If a decision is made based on only a total of two samples, then both will have to be in favor of the chosen response and confidence will be 75% (i.e., plugging these values in Equation 8: 2+1 2+2 = 75%).However, if a decision is made based on a total of 100 samples then only 51 can have supported the chosen alternative (because the stopping rule requires Δ = 51 − 49 = 2) and confidence will be about 51% (i.e., plugging these values in Equation 8: 51+1 100+2 ≈ 51%).Thus, with optional stopping, lower confidence decisions will be based on more samples (and so have longer RTs) and hence will be more informative (see Equation 9).A second route derives from basing confidence judgments on discrete samples of hypothesis counts rather than the Gaussian distributed sensory evidence assumed by SDT; this applies even if samples are independent, a fixed number of samples are generated, and no prior is used (see Appendix D).Therefore, the ABS predicts decreasing metacognitive efficiency for more extreme confidence judgments not only for the full model (see Figure 8C), but also for all its variants (see Table 2).
Finally, confidence is empirically observed to systematically vary with RTs, with positive (cross-condition) and negative (cross-trial) relationships between confidence and RTs (see Figure 8D).When people are forced to respond more quickly in a particular experimental condition, their confidence reduces, which is in line with the standard speed-accuracy trade-off, assuming the confidence positively covaries with accuracy (Irwin et al., 1956;Vickers & Packer, 1982).Both the SPRT and the ABS can capture the positive (crosscondition) relationship simply by varying the threshold according to experimental conditions: emphasizing accuracy moves the threshold

Confidence in decisions
Positive relationship between confidence and stimulus discriminability Positive (cross-conditions) relationship between confidence and RT Little or no overconfidence in evaluation of confidence intervals

Cross-trial autocorrelation in RT and estimates
Large long-range autocorrelation in estimation time series Lesser long-range autocorrelation in RT time series Note.N/A denotes that the model is nonapplicable to the empirical effect because it does not produce the relevant behavioral measure.SPRT = sequential probability ratio test; RT = response times.
further away from the starting point of the accumulator (and vice versa in the speed condition).Higher threshold values in the SPRT lead to more extreme final log odds and therefore higher confidence readouts.Higher threshold values in the ABS (i.e., larger Δ) naturally lead to higher confidence as shown in Equation 9.However, within a condition, people are more confident in decisions they reach quickly-intuitively, the "easy" trials are decided quickly and with high confidence (Baranski & Petrusic, 1998;Vickers & Packer, 1982).As noted above, the SPRT cannot explain this because the strength of evidence at which a decision is made depends only on the threshold, which is determined prior to, and hence independently from, the characteristics of any particular trial.The ABS can explain this negative (cross-trial) relationship because earlier termination (for a fixed threshold Δ) implies that there will be a higher proportion of evidence supporting the chosen alternative.As a result, the ABS predicts that within a condition, faster decisions will be given with higher confidence.

Confidence Intervals
So far, we have considered confidence in decisions.But confidence reports can also be elicited for estimates by asking for confidence intervals.Commonly a participant is given a probability first and then asked to produce an interval (by giving upper and lower bounds) that correspond to the probability (e.g., "give the smallest interval which you are 60% certain to include the number of dots which appeared onscreen: between ____ and ____ dots").However, this procedure can also be reversed: participants can be shown an interval of some quantity of interest and then asked to evaluate the probability that the stimulus falls within that interval (e.g., "what is the probability that the number of dots which appeared onscreen falls in the range of 23 to 25?"; Juslin & Persson, 2002).
In the ABS, confidence interval production and evaluation both are driven by very similar mechanisms to those underlying the naïve intuitive statistician model of Juslin et al. (2007).Taking a set of samples, a confidence interval can be produced by using the lower and upper bounds of the sample coverage (i.e., empirical quantiles of the samples) as the lower and upper bounds of the confidence interval.When the values of the quantiles are not explicitly represented in the sample (e.g., deriving a 93% CI based on 5 samples), linear interpolation was assumed to fill in the gap between the samples (Juslin et al., 2007).This mechanism correctly predicts the considerable overconfidence in interval production found empirically (see Figure 9A, dots; Juslin et al., 2003Juslin et al., , 2007)).This is because for small sample sizes, the empirical quantile of the sample will have a shorter range than the confidence interval from the posterior because distributional tails tend to be underrepresented within a few samples.Therefore, the proportion generated from the sample will be too small, producing an overconfident interval in our simulations (see Figure 9B, dots).One might then question why interval production overconfidence is not corrected in the same manner described for probability judgments above where useful prior knowledge is incorporated-this lack of correction for confidence interval production is what was "naïve" about the naïve intuitive statistician model.Corrections for intervals, however, depend on the functional form of the distribution, so that a general correction process is difficult to establish in the ABS.While the standard computation of a confidence interval assumes a Gaussian distribution, for unknown distributions confidence intervals are usually produced by bootstrapping.For the purposes of producing the confidence interval for a sample, as opposed to producing the confidence interval for a mean, bootstrapping is essentially what the ABS does.
In contrast to confidence interval production, confidence interval evaluation shows very different empirical results: here there is little to no overconfidence with only a small degree of conservatism at the extremes of subjective probability (see Figure 9A, squares; Juslin et al., 2003).This arises in the ABS (see Figure 9B, squares), using the simplest possible assumption (and following Juslin et al., 2007) that people answer this question by generating samples and calculating the proportion that fall within the provided interval.As noted by Juslin et al. (2007), this proportion is an unbiased estimator, and hence shows good calibration. 14

Decisions Affecting Later Estimates
Besides eliciting confidence judgments after choices, experimenters have often asked participants to provide separate secondary responses to the same stimulus.One example is the decisionestimation task where people first choose, say, whether the number of dots which appeared on-screen was greater or smaller than 25, and then are asked, immediately following the choice, to estimate the number of dots.In this setting, an estimate can be influenced by the preceding choice (e.g., Jazayeri & Movshon, 2007;Tversky & Kahneman, 1974).In cognitive judgments, estimates have often been observed to be pulled toward a preceding arbitrary value-the well-known phenomenon as the anchoring bias (Epley & Gilovich, 2006;Tversky & Kahneman, 1974).For example, in the famous study of Tversky and Kahneman (1974), participants were first asked to choose whether the percentage of African countries in the United Nations was higher or lower than a value (h * ), and then give an estimate of that percentage.The comparison value used in the choice, h * , was seen to be generated randomly and so should have been irrelevant to the distribution of hypotheses (and thus irrelevant to the estimate too), but estimates were biased toward h * .
However, in an almost identical paradigm of decision-estimation tasks, perceptual judgments showed the opposite effect: estimates of aspects such as dot orientation or direction of motion have been observed to be pushed away from h * (Jazayeri & Movshon, 2007;Luu & Stocker, 2018;Zamboni et al., 2016).The phenomenon is better known as the repulsion effect found in perceptual tasks.
Existing models of anchoring cannot predict the repulsion effect and vice versa.This is because they only predict one direction of bias (e.g., Jazayeri & Movshon, 2007;Luu & Stocker, 2018;Strack & Mussweiler, 1997;Tversky & Kahneman, 1974), and thus fail to capture the co-occurrence of anchoring and repulsion.While this would be tenable if anchoring and repulsion were each specific to their respective (cognitive or perceptual) domains, a recent empirical investigation suggests otherwise (Spicer et al., 2022a).In this work, we noted that the location of the comparison value, h * , relative to the distribution of hypotheses has not typically been the same across cognitive and perceptual paradigms.Indeed, it was found empirically that the relative location of h * determines whether the subsequent estimates will be pulled toward or pushed away in both cognitive and perceptual tasks.Specifically, estimates of the stimulus value are drawn toward values of h * which are distant from the true value of the stimulus (replicating the anchoring effect) but pushed away from values of h * which are near to this true value (replicating the repulsion effect; Spicer et al., 2022a).This finding is consistent with a common general-purpose algorithm underlying decision-making in both cognition and perception.
The anchoring effect, the repulsion effect, and their dependence on the relative position of h * are captured by the ABS assuming that the set of samples generated to make the choice is then reused to produce the estimate rather than expending further cognitive resources on generating new samples, thus creating a link between these responses.Each effect is then attributable to one of the core components of the model when making the initial choice: anchoring is produced by the autocorrelated sampling algorithm, and repulsion by the optional stopping rule.To explain anchoring, the ABS follows the approach of Lieder et al. (2018) and assumes that local sampling algorithm uses h * as an initial hypothesis.For a small number of iterations, the local sampler will then be biased toward the initial hypothesis.Anchoring is then produced in our simulations for the full model and all variants except the direct sampling variant (see Figure 10 and Table 2).
To explain repulsion, we first note that in the ABS h * effectively partitions the hypothesis space into two binary response regions.The sampling algorithm is adaptively terminated when a sufficient number of samples support one alternative over the other, with the amount determined by the threshold parameter (i.e., Δ).This adaptive stopping rule produces a repulsion bias if the estimate is also based on the same set of samples (Zhu et al., 2019), because the sampling process will terminate only when the weight of evidence favors one hypothesis rather than when the evidence is finely balanced: In effect, optional stopping for choice biases the subsequent estimate away from indifference (i.e., the decision boundary).Thus, repulsion is produced by the full model and all of the variants except for the fixed sample size variant (see Figure 10 and Table 2), and a larger sample size for the initial decision would reduce both anchoring and repulsion effects.

Cross-Trial Autocorrelations in Estimates and RTs
Substantial cross-trial autocorrelations are an important, and often unexplained, aspect of human behavior.For example, long-range dependencies in estimates and in RTs known as 1/f noise have been observed in many cognitive and perceptual tasks and can explain more variance in behavior than the experimental manipulations (Gilden, 2001;Gilden et al., 1995;Wagenmakers et al., 2004). 15n these tasks, participants were instructed to repeatedly estimate fixed physical quantities (e.g., a 1-s temporal interval or a 1-in.spatial interval) or to repeatedly choose between two options.The statistical features of the time series produced by participants were analyzed in the frequency domain, with the high-frequency components corresponding to trials that are close together, whereas low-frequency components correspond to trials that are well separated.The power of each of these components for explaining the time series are then calculated (using a spectral density analysis; Gilden, 2001;Gilden et al., 1995;Sheu & Ratcliff, 1995).Standard statistical processes show different relationships between frequency and power: in a random walk power falls off with 1/f 2 noise (i.e., a slope of −2 in log-log power spectra), whereas white noise (also called independent sampling or direct sampling) has a flat power spectrum (i.e., 1/f 0 noise and a slope of 0).In a time-series containing long-range serial dependence, as is typical in human data, power spectra typically have a slope between −1.5 and −0.5, and are thus categorized as 1/f noise.The long-range autocorrelations in 1/f noise are not straightforward to produce, generally requiring complex processes to do so (Gardner, 1978).Further complicating the picture, while time-series of estimates have long-range autocorrelations that are classed as 1/f noise (Gilden, 2001;Gilden et al., Wagenmakers et al., 2004;Zhu et al., 2021), RT time series fluctuate as 1/f noise but with a log-log slope that is shallower than that of estimates (Van Orden et al., 2003;Wagenmakers et al., 2004).As shown in Figure 11, the ABS qualitatively reproduces the observed autocorrelations in time series of RTs and estimates.The cross-trial autocorrelation in estimates is predicted by the cross-trial carryover of the sampler's location in the autocorrelated MC 3 algorithm (Zhu et al., 2018;Zhu, Leo ´n-Villagrá, et al., 2022): the initial location of the sampler for the present trial is the last sample for the preceding trial.In comparison, the RT time series is predicted to be less autocorrelated because samples generated by the MC 3 are accumulated to a threshold to produce the RT; this is a nonlinear transformation of autocorrelated samples which "whitens" the power spectrum.Simulations of the full model demonstrate both these effects, and as the effects are driven by the MC 3 algorithm it occurs for all variants except for the direct sampling variant (see Figure 11 and Table 2).

Summary
Using a consistent set of parameter values, we have shown that the ABS qualitatively captures empirical results and relationships observed across probability judgments, estimates, confidence intervals, choices, confidence judgments, and RT (see Table 2).The wide range of predicted behaviors is based on an internal probabilistic model using a fine-grained set of hypotheses.The process of inferring the posterior probability of the hypotheses is governed by Bayes' rule and approximated using an autocorrelated sampling algorithm.While the autocorrelation in the sampling algorithm is motivated to make the sampling process computationally efficient, it turns out to be crucial for explaining many empirical effects such as slow errors, anchoring, and cross-trial autocorrelations.Assuming each sample generated is costly, turning these samples into choices relies on an optional stopping rule that trades the benefits of larger samples against the cost of sampling.In turn, the optional stopping rule helps explain empirical effects such as the repulsion effect, the resolution of confidence, and metacognitive inefficiency.The probabilistic model also learns from trial history, using the adaptive prior.This prior helps explain effects such as conservatism, the conjunction fallacy, partition dependence, and fast errors.As shown in Table 2, all components are necessary to explain the full range of behavior.The ability of this model to account for such a wide assortment of human behavior, as we discuss further below, is evidence for this rational process: that people generate samples from a probabilistic representation and then make simple and sensible use of the samples to produce behavior.

Comparison With Competing Models
There are many models that can produce at least a subset of the empirical effects that the ABS does, and many were briefly mentioned in the text above.Here we compare the ABS first to other models of probability judgments and then to drift-diffusion models of choice, response time, and confidence.

Models of Probability Judgments
Intensive modeling efforts have also been directed at explaining human probabilistic judgments, spurred on by the identification of biases, particularly those summarized in Table 1, demonstrating that people's judgments systematically deviate from the rules of probability theory (e.g., Costello & Watts, 2014;Dasgupta et al., 2017;Hilbert, 2012;Nilsson et al., 2009;Peterson & Beach, 1967;Tversky & Koehler, 1994;Zhu et al., 2020).Many models have assumed that probability judgments follow a deterministic process, albeit one that violates the rules of probability theory.For example, one type of model, geared toward accounting for conjunction fallacies, assumes that probability estimates of conjunctions are the weighted average of the probabilities of their constituent events, which produces above-chance conjunction fallacy rates and can reproduce several probabilistic identities (Fantino et al., 1997;Nilsson et al., 2009Nilsson et al., , 2016)).However, these models require additional mechanisms to match the empirically observed combination of above-chance and below-chance rates of conjunction fallacies that the ABS can produce (Fisk & Pidgeon, 1996;Nilsson et al., 2009).
A different type of deterministic approach, at least in the way it has been implemented to explain conjunction fallacies, is based on quantum probability.Here probabilities are based on projections of event subspaces.If the events are compatible, probability judgments are indistinguishable from classical probability theory, but if the events are incompatible then interference produces probability judgments that deviate from classical probability theory.These deviations are such that conjunction and disjunction fallacies will occur at rates above chance, and in this way, quantum probability can produce both above-chance and below-chance conjunction fallacies depending on how the events are represented (Busemeyer et al., 2011).Quantum probability has explained a wide range of probabilistic biases, including some not covered here (Pothos & Busemeyer, 2022).However, there are also probabilistic identities that quantum probability cannot reproduce, that are predicted by sampling-based models, including the ABS (Costello & Watts, 2018;Zhu et al., 2020).
A third deterministic approach is support theory, which was developed to explain subadditivity biases.The core assumption of support theory is that the probability of event descriptions is evaluated rather than the probability of the events themselves and does not incorporate the probabilities of events that are not immediately available (e.g., those not mentioned in the descriptor of events).This approach elegantly explains both a range of implicit subadditivity results, as well as explaining why subadditivity does not occur for mutually exclusive binary events.However, it does not produce the later finding that an atypical unpacking of events produces implicit superadditivity (Sloman et al., 2004), and requires additional mechanism such as an "ignorance prior" which pulls probability judgments toward indifference between the available responses (Fox & Rottenstreich, 2003).These different mechanisms have echoes in the ABS.In the ABS, a hypothesis is "available" only if it has been sampled, and the event description influences the starting point of the local sampler.Further, the prior on responses is a principled version of the ignorance prior, one that is uncertain about the underlying probabilities because often only a small number of samples is available.
Recent approaches have rejected purely deterministic approaches and explored the alternative possibility that stochastic mechanisms explain the biases in probability judgments.For example, simple unbiased response noise has been shown to produce subadditivity (Bearden et al., 2007;Brenner, 2003).However, unbiased response noise alone does not explain why subadditivity still occurs for median judgments.A more promising alternative is to consider corruptive noise in memory or evidence accumulation, which can produce stronger biases (e.g., Costello & Watts, 2014;Erev et al., 1994;Hilbert, 2012).In a leading stochastic model, Probability Theory plus Noise (PT+N), people are assumed to first draw independent samples from a probabilistic representation, and unbiased "counting noise" is added to individual samples to reflect an error-prone cognitive system (Costello & Watts, 2014).This counting noise pulls probability judgments toward indifference and allows the PT+N to capture empirical results such as explicit subadditivity, the conjunction fallacy, and a wide range of probabilistic identities (Costello et al., 2018;Costello & Watts, 2014, 2017, 2018).The PT+N has impressive empirical coverages, although it has recently been argued that it does not fully reproduce all the mean-variance relationship in probability judgments (Sundh et al., 2021): while it will produce the inverted U-shaped relationship between the mean and variance of judgments, the curve will not be pulled inward eliminating extreme (near 0 or 1) probability judgments as is observed in the empirical data and as the ABS predicts.This mean-variance relationship also stands as a challenge to deterministic models because it is not easily produced by simply adding response noise to a deterministic model.

Drift-Diffusion Models
One important family of models that deserves more extensive discussion is the family of DDMs (Drugowitsch et al., 2012;Gold & Shadlen, 2007;Krajbich & Rangel, 2011;Ratcliff, 1978;Ratcliff et al., 2016).While there are many members of this family, they all describe decision-making as a stochastic process similar to that of a biased random walk (or a biased diffusion process, in continuous time) in which the path of the accumulator is, on average, biased by the drift rate (Bogacz et al., 2006;Ratcliff & McKoon, 2008).For binary choices, this is determined by the difference in the evidence signals supporting the two alternatives.In line with the stopping rule of the SPRT, the threshold reached in the DDM decides the choice and the time taken to do so the response time.However, unlike the static summation process of log-likelihood ratios in the SPRT, the accumulator in the DDM also diffuses because the accumulator is corrupted by noise (typically white noise).In perceptual tasks, the drift rate is related to which choice is objectively correct (Ratcliff & McKoon, 2008), whereas in high-level cognitive tasks where people are choosing their preferred option the drift rate is assumed to be related to the relative appeal of the alternatives (Krajbich et al., 2010;Krajbich & Rangel, 2011).
There are generally strong theoretical links between these models and the normative framework of the SPRT: in the continuous limit, the SPRT converges on the DDM and the drift rate and the corruptive noise in the DDM can jointly mimic the calculation of likelihood ratios in the SPRT (Bogacz et al., 2006).However, implementing an optimal statistical decision test in the form of the DDM also generates a number of useful theoretical and empirical insights that were not originally part of the SPRT.First, the psychologically implausible assumption that people are required to have global knowledge of the generative model of the task to calculate the exact likelihoods (e.g., Pðs t jAÞ or Pðs t jBÞ) is implicitly relaxed by the DDM because the drift rate and diffusion noise are free parameters that are recovered from fitting to behavioral data.Thus, the DDM does not need to calculate with the exact cumulative differences in evidence as supposed by the SPRT, greatly improving the DDM's computational plausibility, given that the exact likelihood ratios are almost always impractical to compute in real time except in simple toy problems.
Second, extensions of the DDM can also account for empirical features such as those noted above which are not accounted for by the SPRT.For example, slow and fast errors (Ratcliff & Rouder, 1998;Townsend & Ashby, 1983) can be produced by further assuming that model parameters are variable across trials (Laming, 1968;Ratcliff, 1981;Ratcliff & Rouder, 1998;Rouder, 1996).In particular, varying the drift rates trial-by-trial generates slow errors, while varying the starting point of the accumulator predicts fast errors (Laming, 1968;Ratcliff & Rouder, 1998).The ABS model works in a similar fashion, as autocorrelation acts like variability in drift rates and a biased prior of evidence acts like variability in the starting point of the accumulator.
Moreover, the benefits of using the DDM instead of the SPRT also apply to explaining confidence judgments.The SPRT predicts the confidence ratings to be identical between correct and incorrect responses because, for a fixed symmetric threshold, there will always be the same level of evidence difference accumulated favoring the selected option, 16 and the probability that the chosen option is correct is simply read out from the final state of the accumulator.As outlined above, such predictions are contradicted by empirical data in which choice accuracy and confidence are positively related (Baranski & Petrusic, 1998;Dougherty, 2001;Vickers, 1979;Yeung & Summerfield, 2014).To reconcile the confidence data with the SPRT, one kind of DDM introduced another assumption in which the same drift-diffusion process continues to run for a period of times after the choice has been made but before the confidence judgments (Pleskac & Busemeyer, 2010).As the accumulator has a bias toward the correct choice, this continued accumulation after the choice and before the confidence judgment, no longer bounded by the fixed threshold of the decision, will drive the confidence ratings toward supporting the correct choice.Hence, with this additional assumption, this DDM can correctly predict that people should report higher confidence in correct responses than in errors, and moreover that the resolution of confidence effect grows with the delay between choosing and reporting confidence.However, since the temporal structure supposed by this assumption is that confidence occurs after the choice, this DDM cannot explain why the relationship between choice accuracy and confidence also appears when confidence judgments are given simultaneously with a decision (e.g., Kiani et al., 2014;Li & Ma, 2020).In explaining these data, researchers have assumed that confidence decreases with response time (Calder-Travis et al., 2020;Kiani et al., 2014), and as is predicted by the ABS.
Alternative versions of the DDM, such as the RTCON model, have been developed to capture no-choice confidence rating (Ratcliff & Starns, 2009).RTCON assumes that each confidence rating has an independent diffusion process and the first diffusion process to reach the threshold determines the confidence rating.So, for seven confidence ratings, there are seven diffusion processes racing to the threshold.RTCON captures many key empirical relationships between confidence and RT (Ratcliff & Starns, 2009), but appears to have difficulty capturing the interaction between confidence and choice when these judgments are made sequentially (Pleskac & Busemeyer, 2010).The approach of RTCON has been generalized to continuous-response paradigms in which there are an infinite number of potential responses: the circular diffusion model (Smith, 2016) and the spatially continuous diffusion model (Ratcliff, 2018).The two models have recently been integrated into a unified framework where geometric similarity among response options is represented (Kvam & Turner, 2021).
Overall, the family of models encapsulated by the DDM successfully describe behavior in tasks far different from the perceptual tasks for which it was initially developed (Bogacz et al., 2006;Ratcliff et al., 2016), including tasks in which there is little to no perceptual noise, such as value-based decisions (Busemeyer & Townsend, 1993;Milosavljevic et al., 2010;Usher & McClelland, 2004) and recognition memory tasks (Ratcliff et al., 2004(Ratcliff et al., , 2011;;Starns & Ratcliff, 2014).Because of its strong normative underpinnings in the SPRT, and its psychologically plausible assumptions, the DDM has been widely used in psychology, economics, and neuroscience (Drugowitsch et al., 2012;Fehr & Rangel, 2011;Ratcliff & Smith, 2015;Ratcliff et al., 2016).Indeed, the DDM has become the default framework in many areas of decision-making research.That being said, the broader scope of behavioral responses including choice, RT, confidence, and estimates captured by the ABS have not yet been united within a single implementation of the DDM.This is partly because the task representation needed for choice is different than that needed for estimates or confidence intervals (crucially, the DDM represents an accumulated value-a summary statistics of the sample-but does not retain the sample itself).Thus, while the ABS and the DDM share similar descriptive capabilities, the ABS arguably has the advantage in terms of parsimony given the breadth of behavior covered within its single framework, capturing relationships, such as the combination of anchoring and repulsion effects described above, not explained by current DDM approaches.

Toward Complete Task-Specific Models
The ABS so far has explained the generic judgment and decisionmaking process.But how this might be applied to specific tasks?Fortunately, we can relate our approach to existing models in the literature that take quite a similar approach to that taken here, and indeed have helped inspire our work.Here we describe two successful existing models, Nosofsky and Palmeri's (1997) exemplarbased random-walk (EBRW) model and Blurton et al. (2020) visual attention model, and we outline how using the ABS would involve only minor modifications (such as adding autocorrelations) to them.Thus, we can view these existing models as complete task-specific models in the ABS framework.
The EBRW operates in a hypothesis space of exemplars which represent categories (Nosofsky & Palmeri, 1997).This representational assumption is inherited from the generalized context model where each individual exemplar is situated as a point in a multidimensional psychological space, and similarity between exemplars decreases with the distance between points in the space (Nosofsky, 1984;Shepard, 1987).Building on this representation, EBRW further assumes that exemplars are retrieved sequentially as in a random walk process, predicting the time course of categorization and recognition decision-making (Nosofsky & Palmeri, 1997).Similar ideas can be found in the PRW model of visual attention (Blurton et al., 2020;Bundesen, 1990).In this model, a series of tentative categories (i.e., hypotheses) is proposed and accumulated until one category has accrued enough samples more than any other category (Blurton et al., 2020).The generation of tentative categories is governed by the theory of visual attention (Bundesen, 1990).
Across the two models, there is a common mechanism that integrates over hypotheses for response selection.While neither model was originally motivated from normative principles, they can both be seen as special cases within our framework in which decisions are driven by samples of hypotheses (see Appendix B, for detailed comparisons), but with the addition of features such as autocorrelated samples.More specifically, the exemplar-based representation of hypotheses of the EBRW can be adopted by the ABS when modeling categorization tasks, suggesting how the ABS could be applied to explain the effects of similarity and practice in categorization and RT, which have been captured by the EBRW.When combined with a Bayesian theory of visual attention, the ABS could also be generalized to account for human eye movement and object localization.In addition, the correspondence between the PRW and the direct sampling variant of the ABS provides a bridging condition that allows the ABS to account for detailed fits in response time distributions.Furthermore, adding autocorrelations and an adaptive prior to both the EBRW and PRW models generalizes these models to capture a wider range of empirical effects such as those found in confidence judgments.This demonstrates how the ABS can be applied to, and work well in, specific tasks.

Discussion
The ABS is a step toward a unified rational process of human behavior.Through our analysis, we have identified two key ideas that are necessary for such a unified framework: probabilistic models and approximate inference.Approximate inference via accumulating hypothesis samples means that the ABS views response time in a different way from most existing approaches, as primarily determined by the time required to mentally sample hypotheses, rather than to accumulate more sensory data.We discuss this further below, pointing to possible reconciliations regarding these views of the role of time as well as possible reconciliations between diffusion processes (and noisy probability judgment models) and the ABS.Next, we explore how the ABS could be extended both to multialternative tasks and how complex probabilistic representations could be incorporated.Finally, we discuss and comment on the extent to which the ABS has a rational, Bayesian, basis, and the prospect of quantitatively fitting the model to psychological data.

Contrasting Views on the Role of Time
The existing dominant view on the role of time in decisionmaking is to collect and integrate sensory evidence.In the binary choice example, the likelihood ratio between the two alternatives is represented exactly at any moment and the odds of the correct response are constantly updated in light of new sensory evidence or newly retrieved memories.This view has been adopted by many models including SDT, the SPRT, and by and large DDM approaches.
By contrast, the ABS takes a very different view on the role of time because it sees perception and cognition as emerging from probabilistic representations and computations.Instead of coding a single value of the sensory input, people are assumed to implicitly encode multiple values of the sensory input with their subjective uncertainty about those values.But this posterior is difficult to represent exactly for virtually all cognitive tasks; thus, approximation is needed to access it.The passage of time is then viewed as being used for generating more samples from the posterior to refine this approximation.Thus, time matters because it allows the computational process of sampling the posterior to unfold, not because additional sensory data must be accumulated.In the limit, the gradual refinement of the posterior belief should lead to a convergence to the optimal choice.These contrasting perspectives were also studied in more detail in Lengyel et al. (2015) which presents empirical evidence supporting the view of posterior approximation.It is, however, important to note that the two views on the role of time are not mutually exclusive.One possible reconciliation could be that the brain first conducts evidence integration, and then a posterior based on the sensory evidence can be approximated with sampling, or these processes could be overlapping.Further analyzing the aspects of the two views on the role of time may be an important topic for future research.

Diffusion, Noise, and Sampling
While we have contrasted other models and the ABS, there are also potential links between these approaches.Considering stochastic models of probability judgments, it may be possible to extend a sampling model with noisy counting, such as the PT+N, to explain the vast majority of the effects that we explored above.In our previous work, we have shown that the PT+N and the Bayesian Sampler models can mimic one another's predictions of average probability judgments (Zhu et al., 2020).Building on this, it is possible to envision generalizing the PT+N in the same way we generalized the Bayesian Sampler to the ABS.First, rather than using independent samples, the PT+N could instead use a local, autocorrelated sampler such as MCMC or MC 3 .Indeed, Costello and Watts (2018) have begun to explore this possibility by introducing a type of autocorrelation or carry-over effect allowing earlier samples to influence later probability judgments.Second, when making decisions, PT+N could also use optional stopping rather than a fixed number of samples to account for RT data.This could be a promising alternative to the ABS, although more work is required to develop this rough sketch into a formal model and determine how well it reproduces human data.
For diffusion models, an interesting starting point is the recent interest in continuous diffusion models (Kvam, 2019;Kvam & Busemeyer, 2020;Kvam et al., 2022;Ratcliff, 2018).The main focus of the continuous diffusion models was to describe the cognitive processes underlying tasks that involve continuous responses such as orientation estimation (Ratcliff, 2018) and pricing (Kvam & Busemeyer, 2020).The accumulator is typically depicted in a two-dimensional space.Without any bias in starting point, the accumulator initializes in state [0, 0] (i.e., the origin).The amount of evidence accumulated is described as the distance from the origin, and there is a directional bias toward the option that is most favored at that moment.A continuous absorbing threshold (e.g., a semicircle whose center is the origin) defines the space of possible trajectories; when the threshold is reached, the diffusion process is terminated and a response is triggered.The size of the threshold regions that correspond to each response has been adjusted to make some responses more or less likely (Kvam & Turner, 2021).
This kind of mechanism could link a continuous diffusion process to the ABS.As long as the size of the threshold associated with each discrete fine-grained hypothesis is proportional to its posterior probability, then the continuous diffusion process will effectively be sampling from the posterior on fine-grained hypotheses.17If the fine-grained hypotheses that are sampled are then processed to produce judgements and decisions as in the ABS, this would formally link the models, through substituting a diffusion-based sampling algorithm for the local sampling algorithm currently used.Thus, the distinction between diffusion and sampling may not be clear cut, and they may potentially be viewed as parts of a larger framework.

Multialternative ABS
Asking people to choose among more than two alternatives is often a useful strategy to test the generalizability of computational models developed to explain binary choice data.As previously noted, many aspects of the ABS are readily applicable to such choices: the hypothesis space can be divided into as many regions as required by the query, and the Beta prior generalizes to a Dirichlet distribution when considering more than two options.The optimal stopping rule derived from dynamic programming is, however, more difficult to calculate within realistic times when choosing between more than two alternatives (see Appendix C, for detail).As we have noted, the max-minus-next heuristic is often considered as a good approximation to the optimal stopping rule in multialternative choices (Dragalin et al., 1999(Dragalin et al., , 2000) ) and, indeed, there is computational work suggesting that humans adopt the max-minusnext stopping rule to choose among many alternatives (Brown et al., 2009).Indeed, recent work in evidence accumulation models find that a variety of binary and multiple-choice phenomena can be modeled as accumulating "advantage"-the difference in evidence supporting one versus another, which is conceptually related to the max-minus-next stopping rule (Miletić et al., 2021;van Ravenzwaaij et al., 2020), and the ABS could be equivalently implemented in this framework.This heuristic stopping rule can also explain the bestknown empirical result on the relationship between choice and RT, Hick's Law (Brown et al., 2009): that RT increases logarithmically with the number of alternatives (Hick, 1952;Proctor & Schneider, 2018).
Multialternative choice tasks with confidence have also recently been used to argue against the Bayesian confidence hypothesisthat confidence in a choice is the posterior probability of that choice.Li and Ma (2020) found that the best-fitting model for confidence ratings in a three-alternative choice task was not the posterior probability of the chosen option, but the difference between the probability of the chosen option and the probability of second most probable option.This result could be reconciled with the Bayesian confidence hypothesis through the ABS, assuming a stopping rule such as the max-minus-next heuristic.More formally, consider a three-alternative choice with options A, B, and C, with respective accumulated evidence i, j, and k.Further assuming that i > j > k, then option A is chosen where i−j = Δ following the max-minus-next stopping rule.The confidence difference between the best and the second-best is thus, Diff = Conf A − Conf B = i−j i+j+k , and the total amount of evidence is related to this difference, i + j + k = i−j Diff .Given that, we can rewrite the confidence for the chosen option A as follows: Thus, the best-fitting model in Li and Ma (2020), which was used to argue against the Bayesian confidence hypothesis, is proportional to the predictions of the ABS using the max-minus-next stopping rule.

Extending to Complex Representations
The behaviors discussed here are low-dimensional, with responses being situated within one-or two-dimensional spaces.As a result, the transformation from hypothesis samples to behavior is relatively simple, sometimes just with a linear mapping.But the scope and diversity of human behavior is much broader: many complex human behaviors are both high-dimensional and embedded in hierarchically organized spaces.Drawing, for example, or even copying line-drawings, requires sophisticated mental processes that represent a description of a drawing's parts (e.g., lines and circles) and higher-order relations (e.g., repetition and hierarchy; Tian et al., 2020;Van Sommers, 1984).The motor system that implements routines and trajectories for turning these rich, structured representations into motor commands to produce drawings is also doing a more complex task than in low-dimensional behavior (e.g., complex trajectories may to be segmented into discrete, and hierarchically organized actions).Thus, the hypothesis space from which outputs are selected can be open-ended, and hierarchically organized at a range of levels of abstraction.While it is difficult to see how sensory accumulation models such as the SPRT or the DDM might extend to such cases, it is at least possible in principle to see how a Bayesian, sampling-based approach might operate.For example, many existing models of cognition, perception, and motor control involve Bayesian inference over compositional symbolic representations (e.g., Lake et al., 2017)-and the relevant computations can only be approximated, often using sampling (frequently, using standard MCMC).An interesting direction for future research is how far these Bayesian models can be mapped into fine-grained data relating detailed measures of high-dimensional output (including, for example, accuracy and variability) to fine-grained performance features, such as timing, and autocorrelations across trials.In general, where a Bayesian cognitive model can be defined, a sampling approximation to that model can be created, and compared with detailed process data from experiments.Thus, the ABS provides a possible bridge from simple, but intensively studied, decisions concerning binary choice or one-dimensional estimation, to models of cognition operating at full scale.

Assessing the Rationality of the ABS
Bayesian models of cognition, pitched at Marr's computational level, combines all available trial information (that is, prior knowledge of hypotheses, p(h), and the likelihood of the data presented in the trial given a hypothesis, pðhjsÞ) using the rules of probability theory.In doing so, Bayesian models fully extract trial information with 100% efficiency (Zellner, 2002).
We see the ABS as Bayesian in two ways.First, the ABS provides a sample-based approximation to an underlying Bayesian representation of a task.Combining the ABS with a Bayesian representation produces a rational process model (Griffiths et al., 2012).In other words, it is an algorithmic approximation of a computational-level model, which transforms it into a process model.A rational process model does not align perfectly with the underlying computational-level model because sampling approximations inevitably lead to loss of information.As a result, sampling models predict mistakes, systematic biases, and variability in behavior that are due to using stochastic samples.
Although we developed the ABS with a Bayesian perspective in mind, the underlying model does not necessarily have to be Bayesian.The probabilistic representation could be simply the relative frequencies of past events, without an associated probabilistic model (e.g., Costello & Watts, 2014).Alternatively, the probabilistic representation might not even be described as optimal or rational (Tauber et al., 2017).All that is required is that the representation can be written as a probability distribution, which covers a wide range of representations-after all, any finite set of nonnegative numbers can be normalized to become a probability distribution.
The second way in which the ABS is Bayesian is that the Bayesian Monte Carlo approach is used to interpret the samples that are generated by the model itself.This allows the ABS to take advantage of context-free expectations about the underlying probabilities, which improves probability (and confidence) judgments when a small number of samples give only imprecise information about those probabilities.We assume a conjugate prior on responses, making this process computationally very simple.Our analysis suggests that people do incorporate this prior on responses in forming behaviors-as we have seen, this assumption explains many classical empirical effects in probability judgments.Simple adaptivity in constructing the prior on responses (e.g., adapting to immediate feedback from the preceding trial) also helps explain human data such as fast errors.
This raises the question of whether the ABS makes optimal use of the sampling mechanism it has available.This is a different kind of normative concern than just fully extracting trial information.Here the question is whether the ABS is resource rational (Bhui et al., 2021;Lieder & Griffiths, 2020).Aside from the Bayesian Monte Carlo prior on responses, which could be argued to be resource rational, there is the smaller scale temporal tradeoff between either generating another sample which takes time or stopping and making do with the samples collected so far.The optimal stopping rule relies on solving a difficult dynamic programming problem (detailed in Appendix C).Indeed, given the fact that the optimal stopping rule is generally computationally intractable, we assumed another approximation to the optimal stopping rule with the max-minus-next rule, which has been independently suggested to well-approximate the performance of optimal stopping rule in the information theory literature (Dragalin et al., 1999).This optional stopping rule helps explain empirical patterns such as repulsion, slow errors, and resolution of confidence.
In short, we argue that exploiting algorithmic approximations to the optimal solution is the key feature of the ABS that justifies it as a rational process model.Our proposal, nonetheless, does not address the metalevel theoretical question concerning how the mind allocates cognitive resources across these approximation algorithms (e.g., the local sampler, the Bayesian Monte Carlo process, and the approximate dynamic-programming solution for stopping).A fully resource rational analysis would further specify the balance between the times spent on each approximation algorithm and the incentives from the task.Whether an optimal allocation of the limited cognitive resources is at play merits future investigations.

The Prospects for Quantitatively Fitting the ABS
Quantitatively fitting a psychological model and determining meaningful parameter values are crucial for evaluating and comparing models.However, it can be challenging to apply standard fitting methods to the ABS when the assumptions of the model do not align with those of the fitting methods.We will elaborate the discrepancies in assumptions and provide suggestions for fitting methods that can be used to overcome these issues.
First, standard likelihood-based fitting methods, such as the maximum likelihood method, assume independence of behavioral data across trials, while the ABS assumes positive correlations both across and within trials, making it difficult to obtain a robust fit.Second, the ABS's autocorrelated sampling process (as illustrated in Figure 12 top) is inherently stochastic, which means that even with the same inputs, predicted behaviors will not be identical.Moreover, there is no closed-form solution or accurate approximation of the distribution of the predicted behaviors under the ABS, preventing a closed-form likelihood function of behavioral data given the ABS.
One way to improve the robustness of the fit is to use a group of trials instead of individual trials.Grouping can be helpful because it makes the data more independent at the group level.Various methods such as quantiles can be used to group behavioral data.For example, Heathcote et al. (2002) have demonstrated that grouping RT data by sample quantiles produces a more efficient and less biased estimator.To take this method further, one could use likelihood-free techniques such as approximate Bayesian computation (ABC) as a more principled way to evaluate models with grouplevel data or any other summary statistics (Turner & Van Zandt, 2012).With the ABC method, we can simulate a series of behaviors predicted by the ABS using a set of parameter values, and compare the summary of the simulated behaviors with the summary of human behavioral data.Then we adjust the parameter values based on the similarity of the two summaries, and repeat the process until a satisfactory threshold is reached.The ABC method appear to be the most suitable fitting approach for our model due to its ability to handle issues such as autocorrelation and the absence of likelihood functions, as well as the ability to control for differing model flexibility.Some initial work has been done in fitting estimates using ABC (Spicer et al., 2022b;Zhu, Leo ´n-Villagrá, et al., 2022) and relatedly fitting summary statistics of probability judgments using linear regression (Sundh et al., 2021).While in probability judgments of the prior are uniquely identifiable (Sundh et al., 2021), it would need to be established that the full range of model parameters are uniquely identifiable for the ABS to be used as a measurement model to interpret observed behavior in the way that DDMs are.

Conclusions
We have outlined a rational process model of human behavior: the ABS.The ABS is rooted in a Bayesian framework, where the cognitive system is presumed to have an internal probabilistic model, which describes how the sensory data is generated in the real world.But rather than representing and computing with probabilities, we assume that the cognitive system uses a tractable approximation of the posterior which is realized via a local sampling algorithm.Distinct aspects of these posterior samples are relevant for different types of query and using simple and natural transformations, they provide a unified explanation of probability judgments, estimates, confidence intervals, choices, confidence judgments, and the time course of the posterior sampling accounts for RT.
Our framework shifts the locus of explanations for the accumulation of sensory input to computation (through sampling) over internal hypotheses.Thus, in our framework, the time course, and Note.Here, the sampler was automatically terminated when five samples were generated, while the dashed lines denote potential future samples if continued.Samples were compared to a decision boundary of 25 (red dots: evidence for lower than 25 dots, blue dots: evidence for greater than or equal to 25).The five samples were then integrated with a prior on responses (here used an asymmetric prior, Beta(2, 1)), reaching a posterior of Beta(5, 3).The mean of this posterior on responses was then used to generate probability judgments or confidence judgments in decision-making.See the online article for the color version of this figure.
variability, of behavior is primarily explained in terms of an internal, noisy, computational process (involved in sampling from the hypothesis space), rather than through perfect Bayesian computation using noisy sensory data.We demonstrate the usefulness of our theory by reproducing key pair-wise relationships and stylized facts for the six behavioral measures and point the way toward extending the approach to the complex probabilistic models required to describe the richness of human behavior.that the interstep time between two successive tentative categories is exponentially distributed with a rate C. Thus, for N samples generated in this way, the response time should follow an Erlang distribution (the sum of N exponentially distributed intervals): where, in the binary case, the average speed of generating a tentative category C = v A + v B where v A , v B represent the strengths of category A, B respectively.In other words, the probability of a tentative category being A is p A = v A v A +v B and that of being If we consider the category A as the upper bound and the category B as the lower bound, the random walk of the accumulator has an increment of +1 with probability p A and −1 with probability p B .Coupling the Poisson process of tentative category generation with an optional stopping rule whose accumulator starts at z (0 < z < a) and evidence thresholds k A = a, k B = 0, we can derive the cumulative distribution function of first passage time to state k B (Equation A5 of Blurton et al., 2020): , a, zÞ, (B3) where Fðtjn,v A + v B Þ is the cumulative distribution function of the Erlang distribution above.Because of the fixed thresholds (0, a) and starting point (z), the number of samples (or number of increments) in the random walk to reach one of the thresholds is a random variable.n can be as small as z (for reaching threshold B) or a − z (for reaching threshold A) if all the tentative categories generated in the process are for the same category.But due to the randomness in generating tentative categories, n can also be far greater than those numbers and the probability of n has a closed form solution (Blurton et al., 2020;Feller, 1968 Equation A1).Marginalizing over all possible ns produces the cumulative distribution function (CDF) of RT.
This basic version of the PRW is mathematically very close to both the decision process of the EBRW (as noted in Blurton et al., 2020) and a restricted version of the ABS that uses direct sampling and no adaptive prior.First, we can interpret p A , p B as posterior probabilities of category A and B respectively and they also normalize to 1. Second, the Erlang distribution of RT given N samples is also assumed in the ABS (compare Equations 6 and B2).Third, the random walk process with an increment of +/−1 is equivalent to the heuristic stopping rule used in the ABS where the differences in number of samples is tracked.While p A , p B , z are free parameters for the PRW model, our reinterpretation will constrain their values.p A , p B as the posterior probability should be further constrained by the mental representation of stimuli.
Beyond this basic formulation, the PRW also includes extensions to improve its fit to the data and its scope.To produce fast or slow average errors (in comparison to average correct responses), the PRW incorporates, following the approach used in DDMs, trial-bytrial variability in category strengths and starting points.To explain the variability in the leading edge of response time distributions, the PRW allows for processing rates to vary within a trial.The full version of the ABS produces fast or slow errors because of an adaptive prior and autocorrelated sampling respectively.While we did not explore the ABS's predictions for variability in the leading edge of response time distributions, it would be interesting to see if autocorrelated sampling-effectively varying processing rates within a trial-produces this effect as well, or whether other mechanisms, such as a variable nondecision time, are needed.
The PRW also has been extended beyond two-alternative choice tasks to any number of alternatives by assuming that evidence for one response suppresses evidence for all other responses.The suppression is strong enough that only one counter has nonzero counts at any one time.This extension differs from our proposal in the Discussion of using a max-minus-next stopping rule for multialternative choice as there is no inhibition, but future work is needed to explore what contrasting predictions these mechanisms make.

(Appendices continue)
cumulative Gaussian distributions that respectively intersect with the hit and false alarm distributions (which are cumulative binomial) become closer together as the confidence criterion becomes more extreme: mete_d′(x) < mete_d′(y) and x > y ≥ Np.The quantitative difference between the two mete_d′ of 0.16 is small but noticeable in this example (see Figure D1).However, this bias becomes smaller when N is large because the data-generating binomial distribution is better approximated by a Gaussian distribution.

Figure D1
An Illustration of Metacognitive Inefficiency Arising Not From Loss of Information in Confidence Judgment, but From Incorrectly Assuming Gaussian Generating Distributions Note.Two confidence criteria were shown (x, y, where x > y).Hit rates were calculated based on the binomial distribution Bin(N,p) (black circles) given its intersection with a confidence criterion, whereas similarly false alarm rates were calculated using the symmetric binomial distribution Bin(N, 1 − p) (black squares).In this illustrative example, N = 12 and p = .8.Using a Gaussian to compute mete_d′ (difference in the horizontal positions of the solid curves) will lead to a decrease in value when the confidence criterion increases: mete_d′(x) < mete_d′(y).See the online article for the color version of this figure.

Figure 2
Figure 2 Schematic Illustrations of the Computational Mechanisms and Potential Behavioral Outputs of the SPRT (A) and the ABS (B)

Figure 3
Figure 3 Relationship Between the Underlying Probabilities of Event A and the Average Probability Judgments of A Predicted by the ABS

Figure 4
Figure 4Mean-Variance Relationships in Probability Judgments

Figure 5
Figure 5 Bias in Explicit Subadditivity, Computed as the Sum of the Probability Estimates of Each Component Hypothesis Minus the Probability Estimates of Their Disjunction, Increases as the Number of Component Hypotheses Increases Figure 6 Slow and Fast Errors

Figure 7
Figure 7 Sample Quantile-Quantile Plots of Response-Times Distributions for Different Levels of Task Difficulty

Figure 8
Figure 8 ABS Simulations Showing Effects of Confidence in Decisions Figure 9Generating and Evaluating Confidence Intervals

Figure 10
Figure 10Anchoring and Repulsion Effects Figure 11Power Spectra for Time Series of Estimates and RT

Figure 12 Further
Figure 12Further Illustrations of the Autocorrelated Sampling Process (Top) and the Bayesian Monte Carlo Process (Bottom), Expanding the Illustrative Example of Numerosity in Figure2B

Table 1
Key Empirical Effects of the Six Behavioral Measures

Table 2
Which Models Reproduce the Key Empirical Targets