Bayesian reinforcement learning: A basic overview

We and other animals learn because there is some aspect of the world about which we are uncertain. This uncertainty arises from initial ignorance, and from changes in the world that we do not perfectly know; the uncertainty often becomes evident when our predictions about the world are found to be erroneous. The Rescorla-Wagner learning rule, which specifies one way that prediction errors can occasion learning, has been hugely influential as a characterization of Pavlovian conditioning and, through its equivalence to the delta rule in engineering, in a much wider class of learning problems. Here, we review the embedding of the Rescorla-Wagner rule in a Bayesian context that is precise about the link between uncertainty and learning, and thereby discuss extensions to such suggestions as the Kalman filter, structure learning, and beyond, that collectively encompass a wider range of uncertainties and accommodate a wider assortment of phenomena in conditioning.


Introduction
How organisms learn is a long-standing question of great interest in the field of behaviour and cognition.The Rescorla-Wagner (RW) rule, which was introduced just over 50 years ago explained learning, particularly the acquisition of associations between conditioned and unconditioned stimuli, through the gradual reduction of prediction errors (Le Pelley, 2004;Rescorla & Wagner, 1972).The model posits that the strength of a learned association between conditioned stimuli (CS) and an unconditioned stimulus (UCS) is a function of the prediction error elicited by the UCS, i.e., the difference between experienced and predicted outcome.This use of the prediction error to occasion learning, and indeed various of the ways that the predictions are made, were prefigured by the delta rule (Widrow & Hoff, 1960), which is a core algorithm in machine learning (Barto et al., 1981).
RW successfully accounts for critical aspects of multiple phenomena in classical conditioning including forward blocking and conditioned inhibition (Rescorla & Wagner, 1972), and has been widely adopted.However, a number of subsequent studies has revealed its limitations in explaining phenomena such as extinction and backward blocking (see below ;Miller et al., 1995;Robbins, 1990;Shanks, 1985).Since then, alternative associative models have been proposed which can explain some of these phenomena (Van Hamme & Wasserman, 1994;Dickinson & Burke, 1996;Wagner, 1981).These models have advanced the field beyond the initial learning models of Rescorla-Wagner (1972), Mackintosh (1975) and Pearce-Hall (1980).However, the field has continued to search for a more comprehensive and principled explanation of learning (Alonso, 2014;Alonso & Mondragón, 2014).
In line with the recognition that learning is essential for adaptive behaviour in multiple domains, including perception and attention, there has been emphasis on what the goal of learning is and why it is appropriate (a question living at the computational level of explanation in the terms of Marr, 1982).This complements the question of the processes involved in learning (which lie at the algorithmic level of explanation in Marr's framework; e.g. the RW rule).This emphasis led to the adoption of Bayesian or approximately Bayesian treatments of conditioning (Sutton, 1992;Dayan et al., 2000;Kruschke, 2008;Jacobs & Kruschke, 2011).
Bayesian treatments start from the simple premises that animals learn because they find themselves to be uncertain about some underlying state of the environment (the link between CSs and USs) and make potentially noisy observations that pertain to this state (presentations of CSs and USs).Learning involves processing the observations to unearth what they imply about the state, thereby changing the degree of uncertainty.Bayesian treatments provide a formal benchmark for what, and how fast, it is possible to learn; at least if they are correctly specified.Owing to this, they permeate the whole of decision theory (Ferguson, 2014), and many fields in machine learning (Bishop & Nasrabadi, 2006;MacKay, 2003).
More specifically, Bayesian treatments involve two intimately coupled components.The first is a so-called generative model of the environment, which can also be seen as a probabilistic form of cognitive map (Tolman, 1949).The generative model specifies various properties of the environment such as its volatility (as examined, for instance, by Mathys et al., 2011;Piray & Daw, 2020), of the observations in general, such as their noisiness or stochasticity, and of the observations in particular, here in the form of the links between CSs and the US(s).With this information, the generative model provides an a priori predictive scheme.The second component is a so-called recognition model which maintains the animal's beliefs about the state of the environment and updates it in the light of the observations.In the Bayesian jargon, the beliefs are often considered to be prior expectations; the observations are defined by likelihoods; and the updating is inference, in which priors and likelihoods are combined to make posterior expectations.The posterior is then adjusted in the light of how the generative model indicates the environment might change and forms the prior for the next observation.The operation of the recognition model is the learning algorithm, in the sense of Marr (1982); the generative model just licenses the way that recognition works.Note that, although the Bayesian treatment is suffused with probability distributions, the recognition operation is actually deterministic (although it can be approximated using stochastic algorithms: Daw & Courville, 2008).
Non-Bayesian treatments of conditioning (including RW) have been reviewed elsewhere (e.g., Pearce, 1994Pearce, , 2013;;Le Pelley et al., 2016).They usually consider that animals make and update point estimates of the state of the environment and, though it lies outside the scope of this paper, in some limits, can closely or exactly approximate Bayesian inference.Those that reference uncertainty to determine the course of learning, such as Pearce and Hall (1980), are even closer, as we see later.
Here, we provide a basic overview of learning models that are conceptually linked with the RW rule but have adopted a whole-hearted Bayesian approach.Our starting point will be a fundamental Bayesian embellishment of RW with rich links to phenomena in classical conditioning.Subsequently, we will investigate variations of this model.We do not attempt to provide all the equations for the generative or recognition models, but rather focus on the concepts concerned by employing a practical exampleconsuming ice creamto illustrate these abstract concepts in a more relatable manner (Fig. 1).There are rich links between some of these models and suggestions about learning that come from other quarters (Gallistel & Gibbon, 2000;Grossberg, 1982), but which we unfortunately do not have the space to examine.

Canonical models
Bayesian models of conditioning are based on the simple premise that animals learn because they are uncertain about hidden or latent characteristics of the world (such as the associations between stimuli and affectively charged outcomes); with this uncertainty either coming ('inborn') from initial ignorance, or ('thrust upon them') from incompletely known changes of the world (Dayan et al., 2000;Kruschke, 2008).The same uncertainty limits the animals' predictive capacities and licenses plasticity (which is often known as conditioning).Different Bayesian models represent this uncertainty accurately (in closed form) or via various deterministic (Daw et al., 2008;Kruschke, 2011) or stochastic (Daw & Courville, 2008;Gordon et al., 1993) approximations.Different models of the latent characteristics, of their lability over time, and of their connection to observations that the animals can make (such as pairings between stimuli and outcomes) provide accounts of many different results in conditioning.Learning in Bayesian models is equivalent to inference about the latent characteristics based on probabilistic accounts of the various aspects listed above.
Perhaps the simplest Bayesian model in the realm of classical conditioning is the Kalman filter model.The Kalman filter was first introduced as an algorithm to estimate states over time given incomplete and noisy observations (Kalman, 1960; although Kalman himself has decried the Bayesian interpretation) and is widely used in the field of control theory and engineering (Auger et al., 2013).Following an important analysis from Sutton (1992), it was adopted to account for classical conditioning (Dayan et al., 2000;Dayan & Kakade, 2000;Gluck et al., 1992), because it offers a paradigmatic account of learning.
In terms of the options above, the generative model underlying the standard Kalman filter assumes that the latent characteristics are weights or associations connecting stimuli with outcomes in an additive linear manner like the experience of having a bite of an ice cream with multiple ingredients such as fruit and chocolate bits (Fig. 1A), with these weights potentially changing over time (as the ingredients change characteristics).The Kalman filter can accommodate a few options for these changes: least common is deterministic drift (as if the weights are assumed to rise or fall steadily); another possibility is a form of forgetting, in which the weights relax or shrink back to 0 over time; the most important is a form of Gaussian random walk in which the weights undergo unpredictable changes.Although all three of these change processes mandate persistent plasticity, only the third requires information to be continually acquired from the environment.Observations are pairings of one or more stimuli with a quantity of an outcome, according to the linear mapping, and with added outcome Gaussian noise.Inference is computationally straightforward because of the simple form of the modeland turns out to hinge on the same prediction error as the RW rulethe difference between the output and the summed linear prediction of the mean of the outcome.In contrast to the RW rule, the Kalman filter takes the animal's uncertainty into account in the learning process.
Let us consider the Kalman filter model for classical stimulus-reward conditioning.Conditioned stimuli (e.g., lights and tones; denoted with i) are represented as a vector, x(n) = {x i (n)}, typically with binary entries, which capture their presence and absence in trial n. r(n) is the reward (for example, delivery of food pellets) in that trial.In the model, the true relationship (the latent characteristic) between conditioned stimuli and reward can be represented by a set of weights w(n)= {w i (n)}, which are assumed to change in the current specification according to a Gaussian random walk with variance ω 2 .This variance is analogous to the variability in the quality of ingredients for ice cream (Fig. 1B).The distribution over the reward based on the stimuli and the parameters is the Gaussian: (1) with mean and aleatoric (i.e., irreducible) uncertainty η 2 , which is a form of expected uncertainty (Yu & Dayan, 2005; see below).Inference in the Kalman filter is based on a Gaussian posterior distribution w with mean ŵ(n) and covariance matrix Φ(n), which reports the joint uncertainty about the weights.Often, the covariance matrix is approximated by just diagonal entries such as , with zero covariance between the weights for stimulus A and B (Φ AB = 0).In this case, the learning rule for the mean weight ŵi associated with stimulus i's relationship with the unconditioned stimulus is: where the prediction error: is the same as in the Rescorla-Wagner model (if we ignore the Rescorla-Wagner model's sometime adjustment of the magnitude of r(n)).In both models, learning occurs proportionally to the size of the prediction error thus accounting for blocking (which occurs when an initial set of The cumulative delight experienced from these time steps can be viewed as the total reward, which can be estimated using the machinery of temporal difference learning (Sutton, 1988).F) When people choose between two ice cream shops, they may encounter shops that utilize different assortments of suppliers.What seem to be the same ingredients might come from different suppliers, and so be associated with very different degrees of delectability.Here, the ice cream shop is a sort of context (setting the occasion for the qualities of the ingredients; Fraser & Holland, 2019); but it could equally be the same shop at different times, as in (B), requiring inference about a latent cause.; P. Kang et al. learning trials pairing one stimulus, A, with reward leads to a perfect prediction, minimizing the potential for a subsequently added second stimulus, B, to learn anything about the reward).When multiple elements are associated with the same outcome, such as when an ice cream contains two ingredients that are consistently paired (e.g., cookies and cream; Fig. 1C), their predictions of the outcome can be intrinsically linked. (5) where j sums over all the stimuli (here, both A and B).In this formulation, the learning rates are high for stimuli whose weights are relatively uncertain; but are all reduced if the environment is very noisy (i.e., if η 2 is large).In the Kalman filter, the rule for updating the uncertainties, ϕ 2 i (n), is, perhaps oddly, independent of the prediction error it tends to decrease from observation and increase because of the random walk ω 2 .Nevertheless, this can already capture some aspects of the attention-based learning model of Pearce and Hall (Pearce & Hall, 1980), according to which the associability of stimuli is associated with their uncertainty.We defer a more comprehensive treatment to the later section on volatility.
The off-diagonal components of the covariance matrix Φ(n) can help to explain phenomena with which the Rescorla-Wagner model struggles, such as backwards blocking (Daw et al., 2008;Dayan & Kakade, 2000) or highlighting (Kruschke, 2006).Note, though, that, by themselves, they do not extend as far as other effects such as sensory preconditioning (Brogden, 1939).Backwards blocking is just like forwards blocking, except that the conditions are presented in reverse order, so that the compound of stimuli A and B is first associated with reward (say r(n) = 1), and then A alone is presented with reward.The consequence is that the association of B with reward becomes attenuated (backwards blocked).Such paradigms thus involve what is known as retrospective revaluation, because it appears that an existing association with stimulus B is changed retrospectively.As this happens based on trials on which stimulus B is not presented, the RW rule cannot explain the phenomenon, given that it confines learning to those stimuli that are present on a trial.
There are various alternative ideas for how this might happen (e.g., Dickinson & Burke, 1996;Le Pelley & McLaren, 2001;Miller & Matzel, 1988;Van Hamme & Wasserman, 1994).For instance, Van Hamme & Wasserman (1994) suggested that the learning rate for absent stimuli (here, B) might be non-zero.However, this certainly cannot be true for all absent stimuli (since only one out of an infinite collection of possible, but not presented, stimuli appear to be affected), and the logic for the learning rate to be negative (as it needs to be for backwards blocking to work) is not completely clear.Aitken and Dickinson (2005) analyse the modified version of Wagner's Sometimes Opponent Process (SOP; (Wagner, 1981)) suggested by Dickinson and Burke (1996) (called MSOP) to account for retrospective revaluation.This suggests that learning can occur to B when it is not presented because it is associatively predicted by A (in much the same way as the reward itself).Associative prediction is an important possibility, and it is related to work on structure learning and cognitive maps that we cover later.However, Aitken and Dickinson (2005) use simulations to show that while MSOP can account for some forms of retrospective revaluation, it cannot account for backwards blocking.
From a Bayesian perspective, we should think about the exchange-ability of trials.If the variance of the random walk governing the change of w(n) was ω 2 = 0, then there would be no basis for a difference in inference based on trial order.Thus, forwards and backwards blocking should lead to the same outcome.The way the Kalman filter copes with this in backwards blocking is to note that in the first set of trials, the animal might learn that w A + w B ∼ 1 , but be completely uncertain about the value of w A − w B .If it then learns (in the second set of trials) that, in fact w A ∼ 1, all by itself, then it should conclude that w B ∼ 0. Formally, the component of the learning rule in Eq. ( 3) that we wrote as During the first set of backwards blocking trials, the animal should learn that the off-diagonal term Φ AB (n) < 0 is negative (representing the uncertainty about w A − w B ), which means that there can be 'unlearning' about stimulus B even when only stimulus A is presented.To see this, we can be explicit about the numerator of Eq. ( 6) when just stimulus A is presented, so that x(n) is the vector Hence, even though stimulus B is not present (evident in the vector x(n) = [1, 0] T ) there is a negative learning rate for B, which is what arranges for the backwards blocking effect.The denominator of Eq. ( 6) is positive, and so only scales the learning for both.The extra learning term is restricted to absent stimuli whose predictions are correlated with that of the stimulus that is actually present (stimulus A).This provides a precise account of the rationale for the negative learning rates suggested by Van Hamme and Wasserman (1994).
One of the characteristics of backwards blocking is that the attenuation of B's association with reward is less than in forward blocking (Miller & Matute, 1996;Shanks, 1985).This arises in an approximate version of the Kalman filter characterization (with partial incorporation of this off-diagonal term, and an allowance for the weights to change over trials, i.e., ω 2 > 0) (Daw et al., 2008).The same approximation has been applied to another form of primacy in classical conditioning known as highlighting (Kruschke, 2006).
The Kalman filter model has been rather widely adopted to capture uncertainty in the environment in decision neuroscience (Chakroun et al., 2020;Daw et al., 2006;Jepma et al., 2018).For example, it has been used to capture the uncertainty necessary to manage the trade-off between exploration and exploitation in a multi-armed bandit task (Daw et al., 2006).Also, it helps capturing individual differences in uncertainty estimations in clinical settings (Addicott et al., 2021;Wiehler et al., 2021) and in developmental trajectories of exploration and exploitation behaviours across adolescence (Jepma et al., 2020).
One of the key assumptions in the generative model underlying this simple linear Gaussian version of the Kalman filter is that the predictions associated with the stimuli can be added to make a net prediction (as in equation ( 3); Fig. 1A).There are also other generative models which express related forms of additivity, such as a noisy-logic model (Kruschke, 2008).Additivity in prediction is directly associated with the sort of competition in learning that is evident in conventional blocking.
One can also imagine different circumstances in which the predictions associated with the stimuli instead compete to create a net prediction (Fig. 1D; think, for instance, of stock market analysts making predictions about the level of the stock market in one week's time, or of the possibility that the ice-cream eater chooses subconsciously, and perhaps randomly, which of multiple ingredients to focus on for a particular bite).Alternative statistical models license such competitionincluding the additive (Erickson & Kruschke, 1998) or the factorial (Dayan & Long, 1998;Jacobs et al., 1991) mixture of experts, in which there are either learned weights determining the influence of particular stimuli, or in which these weights arise as a result of modelling the reliability of the predictions associated with each stimulus (Ahmadi, P. Kang et al. 2020), with the more reliable predictor exerting a greater influence over the net prediction.In these cases, the responsibility for any prediction error, and so adjustment of the associated weight, might be assigned differentially to those stimuli that were more responsible for making the prediction.Effects along these lines have been used to capture the suggestion from Mackintosh (1975) that stimuli that are more reliable predictors (rather than the more uncertain ones that Pearce and Hall, 1980, favoured) should be preferred in learning, and, more generally, paradigms such as downwards unblocking (Holland & Kenmuir, 2005;Mackintosh, 1975) that cannot be readily explained using the conventional Kalman filter model that we described above.Note, though, that competition in prediction is less directly associated with competition in learning, so explanations of conventional blocking become more complicated.
We observed above that in the conventional Kalman filter, the schedule for the uncertainty ϕ 2 i (n) about the association of stimulus x i (n) only depends on the schedule with which x i (n) is presented, and not the prediction error associated with the presentation of the stimulus.This is particular to the specific generative model underlying the Kalman filter and requires full quantitative knowledge of the sources of uncertainty.When this is not true, then estimates of ϕ 2 i (n) that drive learning can be influenced by δ 2 (n), for instance in the original approximate scheme from Sutton (1992) whence we began.
More systematic treatments of this arise in cases in which the learner is not only ignorant about the association between conditioned and unconditioned stimuli because of a lack of initial knowledge and change but is also ignorant about the uncertainties concerning this association in the generative model for the same reasons.This arises, for instance, in environments in which the association is constant for a while, but then makes a step changehere, uncertainty has a richer temporal profile.There are many investigations of this in which the sort of interactions among potentially predictive stimuli that underpin the RW rule have not been the focus (e.g., Yu & Dayan, 2005;Behrens et al., 2007;Nassar & Gold, 2013).However, there have also been some more generic attempts to develop learning models for volatile environments (Fig. 1B).Two treatments of this case are the Hierarchical Gaussian Filter (Mathys et al., 2011;Stephan & Mathys, 2014) and the rather less approximate volatile Kalman filter model (Piray & Daw, 2020, 2021).
Roughly speaking, in models of this form, the variance of the random walk ω 2 (n) is itself allowed to vary over trials and is estimated through the process of inferencebeing boosted by large prediction errors that change the weights substantially.In other words, the learning rate is updated based on trial-by-trial volatility, as opposed to the fixed volatility in the traditional Kalman filter model.Similarly, the irreducible (aleatoric) noise, η 2 can be estimated, as a form of baseline prediction error.Learning in these models accelerates in volatile environments, and decreases with aleatoric uncertaintyphenomena related to the dynamic learning rates proposed in Pearce-Hall (Pearce & Hall, 1980) and Mackintosh (Mackintosh, 1975) models.These models have yet to be applied to the full range of conditioning phenomena.However, Piray and Daw (2021) used as test cases conditioned suppression (Pearce & Hall, 1980) and partial reinforcement extinction effects (Gibbon et al., 1980;Haselgrove et al., 2004;Rescorla, 1999).In conditioned suppression, the omission of outcome after pretraining indicating high volatility and no change in stochasticity facilitates learning compared to control (i.e., no omission of outcome).The model successfully captured the increased learning rate and high volatility in the omission group (Piray & Daw, 2021).Conversely, partial reinforcement, compared to full reinforcement, reflects higher stochasticity but no change in volatility, resulting in a lower learning rate for the partial reinforcement group.This effect was also explained by the model (Piray & Daw, 2021).

Kalman TD model
Although the Kalman filter model is a canonical model that accounts for diverse phenomena of classical conditioning, it cannot explain within-trial value updates as it considers time only at the trial level.This implies that phenomena such as secondary conditioning, which depend on time within a trial, cannot be accommodated.In secondary conditioning, an animal learns that stimulus B predicts a reward; and then that stimulus A predicts stimulus B. It then can be shown to expect a reward following stimulus A. The RW rule would expect, if anything, that stimulus A would predict the absence of reward (a form of conditioned inhibition).Sutton (1988) and Sutton & Barto (1990) suggested the temporal difference (TD) learning rule to address such phenomena.This is based on the idea that the prediction associated with a stimulus should really be of the summed future rewards within a trial, where the worth of rewards in the future are discounted at a rate of γ per trial.This would be like the cumulative delight experienced from all the time steps of consuming the ice cream (Fig. 1E).Algorithmically, it had the effect of replacing the RW prediction error of eq (4) by the temporal difference prediction error: at time t in trial n, where the stimulus and reward evolve over time.As is extensively documented elsewhere (e.g., Niv & Schoenbaum, 2008), this rule offers a more comprehensive view of many conditioning phenomena, and this prediction error has also been associated with the activity of the neuromodulator dopamine (Montague et al., 1996).However, standard TD learning is not Bayesian in that it does not maintain uncertainties about the weights.A TD-based extension to the Kalman filter model has been suggested (Geist & Pietquin, 2010;Gershman, 2015).Very crudely, this replaces the generative model in eq (1) by on the grounds that the reward at time t in a trial should be the difference between two successive predictions of the summed reward in the whole trial at time t and time t + 1.It has been demonstrated that this model outperforms the Kalman filter model without the TD extension and the TD model without the Kalman filter in explaining second-order extinction (Gershman, 2015).In second-order conditioning, first stimulus B is directly associated with reward.Then stimulus A is associated with stimulus B. As a result of the A->B association and the B->reward association, A comes to elicit a conditioned response even though it has never been paired with reward.In second-order extinction, extinguishing stimulus B decreases responding to stimulus A. 1 This phenomenon updates values within-trial, which requires a real-time temporal difference learning framework and also includes the change of associations by absent stimuli enabled by adopting the Kalman filter model.Thus, combining Kalman-filter and temporal difference models explains this type of conditioning better than each model alone.

Structure learning -Latent cause models
Another family of learning models assumes that understanding not only the relation between stimuli and outcomes but also the context (or the structure) in which learning takes place affects the outcomes (Braun et al., 2010a;Braun et al., 2010b;Collins & Frank, 2013;Collins & Koechlin, 2012;Courville et al., 2004Courville et al., , 2006;;Gershman & Niv, 2010;Lloyd & Leslie, 2013;Redish et al., 2007).There are several variants of such models (Braun et al., 2010a;Braun et al., 2010b;Courville et al., 2004Courville et al., , 2006)).There is also a relationship to occasion setting (Fraser & Holland, 2019), which is the way that conditioning has treated contexts, albeit ones that are signaled by explicit rather than implicit or internal cues.
An influential subset of this family of models involves the idea of latent causes.Models along these lines suggest that learning is comprised of two processes: associative learning and structure learning.Associative learning is often a form of what we have so far discussed, inferring the relationship between CSs and USs.However, these associations might differ under different causal circumstances; and structure learning is a process that figures out which latent causes generated the observed associations (Gershman & Niv, 2012;Lloyd & Leslie, 2013).Latent causes, which are a simple case of what Tolman (1949) might have called a cognitive map, might last for a while, then change, and then recur (e.g.understanding the context related to ice cream quality such as different assortments of ingredients suppliers -see Fig. 1F; or the same shop at different times -see Fig. 1B) -a possibility of particular importance for characterizing phenomena such as extinction.One sophisticated version of these models involves a form of Bayesian nonparametric structure called a Dirichlet process mixture or Chinese restaurant process (Griffiths et al., 2003;Teh et al., 2007), which allows new causes to be inferred when the existing collection is inadequate.
Although more fully Bayesian methods have also been suggested, one way that classical conditioning might arise in a latent cause model is via an intuitive two-step Expectation-Maximization (E-M) algorithm (Dempster et al., 1977;Gershman et al., 2017;Neal & Hinton, 1998).This is based on the observations (a) that if the animal knew which cause was active, then conditioning could carry on as above; and (b) that the resulting prediction errors are a hint as to which cause best fits the current predictive circumstance.Thus, a form of self-consistency can be used for learning latent causes.
Latent cause models can explain challenging phenomena, such as extinction, in a characteristic manner.Extinction under the latent cause model can proceed in two ways: either changing the association within the current cause as a form of unlearning, or inferring a new cause, which would be unsullied by past associations.In the latter case, learning and extinction are driven by different latent causes.For instance, if latent cause C 1 is associated with the stimulus-reinforcer context and latent cause C 2 with no relationship between stimulus and reinforcer (consistent with extinction), then responding appropriate to latent cause C 1 would resume if the animal infers that this latent cause is active.The latter was suggested as underpinning recovery from extinction on a new day or the effect of reminders (Gershman & Niv, 2012;Lloyd & Leslie, 2013;Gershman et al., 2015Gershman et al., , 2017;;Soto et al., 2014; see also Bouton et al., 2012), when the animal could be persuaded that C 1 was active.True extinction would depend on the animal learning that the association in C 1 had been abolished, something that was suggested as arising when the abolition of the association happened gradually.Such a process has also been observed in motor (Pekny et al., 2011;Taylor et al., 2011) and visual learning (Preminger et al., 2007(Preminger et al., , 2009)).Note the subtle difference from the suggestion that extinction arises when a new stimulus-no-reinforcer association is actively learned that can compete with the original association (Pearce and Hall, 1980;Esber & Haselgrove, 2011;Wagner, 1981).

Discussion
We reviewed representative computational models for classical conditioning that followed on from the Rescorla-Wagner model, mainly with regard to their guiding principles (Fig. 1).By working within a Bayesian framework at a computational level, we raised a set of issues particularly concerning continuous and discrete change and contexts in the world, and additive and competitive prediction.
A number of studies has shown that Bayesian computations, and approximations to them, offer useful explanations of behavioral and neural phenomena.Algorithmic suggestions include Markov chain Monte Carlo methods that depend on stochastic sampling (Daw & Courville, 2008).Studies on the neural substrates of uncertainty computations have implicated the cingulate cortex, rostrolateral prefrontal cortex (frontal pole), and dorsolateral prefrontal cortex (Pearson et al., 2011;Tomov et al., 2020) as well as the dopaminergic system (Chakroun et al., 2020).Also, cholinergic, dopaminergic and noradrenergic systems as well as mesolimbic and cingulate cortex areas have been associated with uncertainty and volatility processing (Behrens et al., 2007;Diaconescu et al., 2014;Iglesias et al., 2021;Powers et al., 2017;Yu & Dayan, 2005).Furthermore growing evidence supports the suggestion that the hippocampal-entorhinal system and vmPFC play critical roles in the representation of more sophisticated relational knowledge and the structure of different types of task dimensions (Behrens et al., 2018;Chan et al., 2021;Garvert et al., 2017;McKenzie et al., 2014;Samborska et al., 2022;Schuck et al., 2016;Whittington et al., 2022;Wikenheiser & Schoenbaum, 2016;Wilson et al., 2014;Zhou et al., 2019).
Of course, Bayesian computations can readily become computationally intractable, and so these algorithms and implementations will typically amount to environmentally adapted heuristics.Unearthing what those heuristics approximate in the terms of the sorts of framework we have offered here is a tantalizing task for the future.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig.1.Illustration of learning models.To elucidate the learning processes in each model, we consider a scenario in which you are enjoying a delectable ice cream and are learning about the relationship between the ingredients and the overall subjective utility.The experience of savoring the ice cream serves as an illustrative example of a reward-based learning process.A) The delight of each mouthful may represent an additive combination of its constituent ingredients, such as fruit, chocolate bits, or sprinkles, reflecting a global prediction in the Rescorla-Wagner model.B) If one is the owner of an ice cream store, it is essential to monitor the quality of ingredients for making accurate reward predictions and maintaining the quality of the finished product.Suppliers providing these ingredients may alter their quality over time.When the quality of ingredients fluctuates a lot over time, the learning rate would be increased so that the owners can adapt to the situation accordingly.These changes could happen at a steady rate over time as in a fixed variance random walk (in the standard Kalman filter); they could have different speeds over time (as modelled in the hierarchical Gaussian filter,Mathys et al., 2011 or the volatile Kalman filter; Piray & Daw, 2020), or could be contextually dependent (e.g., when ingredients are either rotten or good, as in a latent cause model;Gershman & Niv, 2010, 2012)  When individuals are presented with an ice cream containing two ingredients that are consistently paired, they are unable to discern the individual delightfulness of each ingredient in isolation.This situation has parallels with the Kalman filter, which infers anti-correlation in estimates of these individual predictions.D) In the case of certain flavors, where the ingredients are relatively coarse, e.g.cookie pieces or chunks of strawberry, the pleasure derived from each mouthful may be dominated by a single ingredient.However, it remains uncertain which specific ingredient dominates in each mouthful, resembling the competitive combination principle in versions of Pearce-Hall and Mackintosh learning rules.E) While the moment of ice cream consumption is brief, the experience of delight unfolds over several time steps before reaching its culmination.The cumulative delight experienced from these time steps can be viewed as the total reward, which can be estimated using the machinery of temporal difference learning(Sutton, 1988).F) When people choose between two ice cream shops, they may encounter shops that utilize different assortments of suppliers.What seem to be the same ingredients might come from different suppliers, and so be associated with very different degrees of delectability.Here, the ice cream shop is a sort of context (setting the occasion for the qualities of the ingredients;Fraser & Holland, 2019); but it could equally be the same shop at different times, as in (B), requiring inference about a latent cause.; The Kalman filter captures such a situation through anticorrelation in estimates of the individual predictions.To combine the new information with what has been learned so far, the Kalman filter model weighs the prediction error with the associability term α i (n), called the Kalman gain (Equation (5), which acts as the learning rate in the Rescorla-Wagner model.The critical difference between the Kalman filter model and the Rescorla-Wagner model is that the learning rate for stimulus i in the Kalman filter model is adjusted over time as a function of the uncertainty.