Learning and forgetting using reinforced Bayesian change detection

Agents living in volatile environments must be able to detect changes in contingencies while refraining to adapt to unexpected events that are caused by noise. In Reinforcement Learning (RL) frameworks, this requires learning rates that adapt to past reliability of the model. The observation that behavioural flexibility in animals tends to decrease following prolonged training in stable environment provides experimental evidence for such adaptive learning rates. However, in classical RL models, learning rate is either fixed or scheduled and can thus not adapt dynamically to environmental changes. Here, we propose a new Bayesian learning model, using variational inference, that achieves adaptive change detection by the use of Stabilized Forgetting, updating its current belief based on a mixture of fixed, initial priors and previous posterior beliefs. The weight given to these two sources is optimized alongside the other parameters, allowing the model to adapt dynamically to changes in environmental volatility and to unexpected observations. This approach is used to implement the “critic” of an actor-critic RL model, while the actor samples the resulting value distributions to choose which action to undertake. We show that our model can emulate different adaptation strategies to contingency changes, depending on its prior assumptions of environmental stability, and that model parameters can be fit to real data with high accuracy. The model also exhibits trade-offs between flexibility and computational costs that mirror those observed in real data. Overall, the proposed method provides a general framework to study learning flexibility and decision making in RL contexts.

A third case of update distribution can be used: the agent can actually consider that the new value of ∼ x j+1 would likely be drawn from a mixture of the prior and posterior approximate marginal distributions, similar to the distribution implemented in Eq 9 (and possibly Eq 13): The next section details the update equation in these three cases.

C.2 Approximate posterior updating of non-selected actions
In order to develop the update equations of the variational parameters in the case of counterfactual learning, one must first derive an ELBO for this specific case. Recall that, if x j+1 were observed, the ELBO would have the form: In the case of an unobserved datapoint, we take the expected value of the ELBO under some model of the evolution of One can easily see that the update of the variational parameters of the action not taken under the optimistic assumption that the environment will not change takes the form of: .
Note that the posterior predictive variance of x j+1 is equal to the sum of the expected variance and the variance of the mean. A similar update paradigm can be used for the opposite limit case where it is assumed that the distribution of x j+1 is totally undetermined (i.e. that it has come back to the marginal prior distribution p(x j |θ 0 )): .
Finally, the mixed approach consist in considering that the value x j+1 is a weighted average of the two given the learned mixing coefficient: Using this update scheme, the agent erases his memory of the posterior distribution at a rate dictated by the stability of the environment, and θ j → θ 0 as j → ∞. Together with the decision algorithm presented in 2.6, this result shows that, as the posterior distributions broadens and approaches the initial prior, the likelihood of choosing this action will also increase because the expected random noise of the evidence accumulation process will also increase.

C.3 Delayed approximate posterior updating
Another approach for the agent to consider the evolution of the environment when selecting an action whose outcome has not been seen for a long time is to simulate the expected waning of the previous update across the elapsed interval of time.
Let us consider the case where the agent has chosen an action (e.g. left) at the trial j, and then the opposite action (right) for n trials. We assume that, during these n trials, she has been updating only the value of the right action, leaving the approximate posterior over the distribution parameters of the value of the left action untouched. When selecting left at the trial j + n, she considers that the weight of the posterior component of the prior distribution has decreased exponentially for n trials. The prior then looks like: p(z|θ j−n ) w n p(z|θ 0 ) 1−w n Z In order to compute the NCVMP update of the approximate posterior parameters over w, we then need to compute the expected value of w n . If q j (w) is set to be a beta distribution with parameters φ j = {φ 1j , φ 2j }, we have where B(a, b) = Γ(a)Γ(b) Γ(a+b) . The expected value of the squared decay w 2n (used to compute the variance in the expected value of the second order Taylor expansion of the log-partition function, see A) can be computed readily with the same formula by replacing n by 2n.
Unlike the previous method, the shape parameter of the gamma distribution over the variance does not need to be greater than 1 here, as we do not need to simulate the variance of the non-selected action value. However, this is true only for the single stage-case, as the multi-stage case also bases its value updates on expected squared values of rewards, thereby requiring α σ 0 > 1 too.