Identifiability of Gaussian Bayesian bandit models

The Kalman filter, combined with heuristic choice rules such as softmax, UCB, and Thompson sampling, has been a popular model to identify the role of uncertainty in exploration in human reinforcement learning. Here we show that the Kalman filter combined with a softmax or UCB choice rule is not fully identifiable. By this structural identifiability, we mean that with unlimited data, the true parameter values are determinable. Perhaps surprisingly, the Kalman filter with Thompson sampling is fully identifiable.


Introduction
There has been much interest in identifying the role of uncertainty in exploration in human reinforcement learning (Daw, O'Doherty, Dayan, Seymour, & Dolan, 2006;Gershman, 2018;Knox, Otto, Stone, & Love, 2012;Speekenbrink & Konstantinidis, 2015;Wilson, Geana, White, Ludvig, & Cohen, 2014;Wu, Schulz, Speekenbrink, Nelson, & Meder, 2018). Restless multi-armed bandit tasks are a useful paradigm to empirically investigate this (Daw et al., 2006;Speekenbrink & Konstantinidis, 2015), as they require continued exploration long after all options have been tried initially. A prominent learning model for human behaviour in such tasks is the Kalman filter (Daw et al., 2006;Gershman, 2015;Speekenbrink & Konstantinidis, 2015). The Kalman filter provides a principled and computationally efficient way to track both estimated value and the uncertainty in these estimates. Combining the Kalman filter with heuristic choice rules which aim to balance exploration and exploitation, such as the softmax, upper-confidence bound (UCB), or Thompson sampling, offers a powerful and flexible computational framework to assess the role of uncertainty in human reinforcement learning. When the Kalman filter is an adequate descriptive model of how agents learn (expected) rewards, estimating the relevant parameters of the Kalman filter provides a window into their inductive biases, such as how variable they believe rewards are, and how changeable the environment is over time.
This paper addresses to what extent the parameters of models combining the Kalman filter with heuristic choice rules are structurally identifiable, in the sense that with unlimited data, we would be able to determine the true value of their parameters. For identifiable models, parameter estimates can be appropriately compared between or within people and related to neural functioning. While the Kalman filter combined with a softmax or UCB rule is not fully identifiable, perhaps surprisingly, the Kalman filter with Thompson sampling is.

Gaussian restless bandits
We will focus on a simple and canonical version of a restless bandit, where we assume that rewards R t,i at time t for bandit i are continuous and normally distributed around the average reward µ t,i for that option at that time, while the average rewards vary over time according to a simple random walk: where N (m, v) denotes a Gaussian (normal) distribution with mean m and variance v. We refer to σ 2 ξ as the innovation variance and σ 2 ε as the noise variance. Although we focus on an environment like above, for which the Kalman filter is optimal, the results hold for any task in which a Kalman filter is assumed to be an (approximate) learning model for average rewards.

Bayesian learning and decision models Kalman filter
The Kalman filter (Kalman, 1960;Kalman & Bucy, 1961) is an efficient and algorithm to compute the posterior distributions for µ t,i for linear Gaussian dynamical systems such as that defined by Equations 1 and 2. Assuming that the innovation and error variances are known, and assuming a Gaussian prior for the initial mean: µ 0,i ∼ N (m 0 , v 0 ), the posterior distributions are all Gaussian: where C 0:t = (C 1 , . . . ,C t ) and we have taken the liberty to define C 0 = ∅ and apply the same notation for R 0:t .
The Kalman filter provides an efficient way to sequentially calculate the mean m t, j and variance v t, j of these posterior distributions. The Kalman filter update equations are: and with the Kalman gain: 686 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 At time t, before making a choice, the Bayesian value of each bandit (its average reward) can be derived from the prior predictive distribution The softmax rule can be viewed as a stochastic choice rule in which the probability of choosing a bandit depends solely on the estimated mean rewards m t,i . It can be stated as: where the inverse temperature parameter γ allows choices to vary from uniformly random (γ = 0) to deterministically always choosing the option with the highest estimated mean reward (γ → ∞).

Upper confidence bound (UCB)
The upper confidence bound (UCB) strategy can be defined as follows: where the upper confidence bound is defined as and the parameter β defines the width of the confidence interval, e.g. setting β ≈ 1.96 results in always choosing the bandit with the highest 95% upper confidence interval.
As the UCB rule is deterministic, it is generally not applied in the manner above to human choices. One way to allow for deviations is to use an "epsilon-greedy" style implementation, such that the bandit with the highest UCB is chosen with probability 1 − ε, while with probability ε, a bandit is chosen uniformly at random. Another -more popular -stochastic version of the UCB rule is to use a softmax version (e.g. Daw et al., 2006;Speekenbrink & Konstantinidis, 2015;Wu et al., 2018):

Thompson sampling
Thompson sampling (Thompson, 1933;May, Korda, Lee, & Leslie, 2012), like the UCB rule, depends on both estimated value and the uncertainty in those estimates. In words, it matches the probability of choosing a bandit with the probability that it has the highest expected reward. A Bayesian decision rule, this is based on the prior predictive distributions p(µ t, j |C 0:(t−1) , R 0:(t−1) ): P(C t = i|C 0:(t−1) , R 1:(t−1) ) = P(∀ j = i :m t,i >m t, j ) (11) wherem t,i ∼ N (m t−1,i , v t−1,i + σ 2 ξ ) is a sample from the prior predictive distribution of the mean µ t,i . In contrast to the softmax and UCB rule, there are no further adjustable parameters, it only needs (sensible) values for the environmental parameters m 0 , σ 2 ξ , and σ 2 ε .

Model identifiability
Identifiability of a statistical model roughly means that any change in model parameters implies a change in the likelihood. More formally, a model with parameters θ ∈ Θ, where Θ is the parameter space, is identifiable when, for (almost) all possible observations c ∈ C , Identifiability of the KF-SM model , is not identifiable. The problem here is that we can rescale the variance parameters v 0 , σ 2 ξ , and σ 2 ε by a common scaling factor α, such that v 0 = αv 0 , σ 2 ξ = ασ 2 ξ , and σ 2 ε = ασ 2 ε , and get the same likelihood for θ sm and Hence, θ sm and θ sm lead to identical Kalman gain k 1,i for all bandits i. In fact, the Kalman gain is identical at all t > 1. For θ sm , the posterior variance is v 1,i = (1 − k 1,i )(v 0,i + σ 2 ξ ) = (1 − k 1,i )(αv 0,i + ασ 2 ξ ) = αv 1,i , from which it follows that v t,i = αv t,i for all t > 0. Hence, we can replace v 0,i in Eq 13 by v t,i , which shows that k t,i is identical for θ sm and θ sm for all t > 0.
This means that only the relative values of v 0 , σ 2 ξ , and σ 2 ε , are identifiable in the KF-SM model. By fixing one of the variance parameters to an arbitrary value (not equal to 0), the remaining parameters are identifiable.

Identifiability of the KF-UCB model
The Kalman filter UCB model (KF-UCB), with θ ucb = (γ, m 0 , v 0 , σ 2 ξ , σ 2 ε ), is not identifiable. Although the likelihood of this model depends both on the means m t,i and variances v t,i , rescaling v 0 , σ 2 ξ , and σ 2 ε by a common factor α, as above, will again provide identical likelihood values. As shown above, the prior predictive variance then becomes v t,i + σ 2 ξ = α(v t,i + σ 2 ξ ) and setting by β = β/ √ α, the likelihood is identical for θ ucb and θ ucb = (β , m 0 , v 0 , σ 2 ξ , σ 2 ε ). The same will hold for the stochastic versions of the KF-UCB model. Again, one of the variance parameters can be fixed to an arbitrary value = 0, which will result in the other parameters being identifiable.

Identifiability of the KF-TS model
The Kalman filter Thompson sampling (KF-TS) model with θ ts = (m 0 , v 0 , σ 2 ξ , σ 2 ε ) is identifiable. While scaling the variances as above will provide the same prior predictive means, the prior predictive variances will be affected uniquely, and with that the choice probabilities.

Conclusions
We have shown that only the Kalman filter with Thompson sampling is fully identifiable, while the Kalman filter with softmax or UCB is not. For the latter models, one of the variance parameters needs to be fixed to an arbitrary value. While this often has been done (e.g. Speekenbrink & Konstantinidis, 2015), the reason has not been laid out clearly. While identifying all the parameters may not be a primary concern, in many cases researchers are interested in comparing parameter estimates between people, in correlating these with neural signals. In these cases, it is important to realise what the consequences are of only having access to e.g. relative variances. When the Kalman filter with Thompson sampling is an adequate descriptive model, as all parameters are identifiable, it is possible to compare people according to e.g. the level of their prior uncertainty.
The results about identifiability presented here generalize immediately to the"mean-stable"version of the Kalman filter (where σ 2 ξ is fixed to 0). In this case, the other variances (v 0 and σ 2 ε ) are still not identifiable, so one of these needs to be fixed to an arbitrary value = 0. Generalization to other Gaussian learning models, such as Gaussian Process regression (e.g. Wu et al., 2018;Schulz, Speekenbrink, & Krause, 2018), will also be relatively straightforward.
Parameter identifiability is an important but often overlooked aspect of computational modelling of empirical data. We have focused here on the structural identifiability of relatively simple models, and we could show that particular models were not fully identifiable. For more complex models, structural identifiability may not be as straightforward to determine analytically. In those cases, one may attempt to address the identifiability of a model by more "empirical" methods, such as assessing the "flatness" of the profile likelihood (Raue et al., 2009). Such methods may also be able to detect practical non-identifiability (in the sense that a particular data set is insufficient to estimate the parameters with any precision). As the models become more complex, assessing the structural and practical identifiability of models will become an increasingly difficult but important aspect of computational cognitive neuroscience.