Probabilistic Successor Representations with Kalman Temporal Differences

The effectiveness of Reinforcement Learning (RL) depends on an animal's ability to assign credit for rewards to the appropriate preceding stimuli. One aspect of understanding the neural underpinnings of this process involves understanding what sorts of stimulus representations support generalisation. The Successor Representation (SR), which enforces generalisation over states that predict similar outcomes, has become an increasingly popular model in this space of inquiries. Another dimension of credit assignment involves understanding how animals handle uncertainty about learned associations, using probabilistic methods such as Kalman Temporal Differences (KTD). Combining these approaches, we propose using KTD to estimate a distribution over the SR. KTD-SR captures uncertainty about the estimated SR as well as covariances between different long-term predictions. We show that because of this, KTD-SR exhibits partial transition revaluation as humans do in this experiment without additional replay, unlike the standard TD-SR algorithm. We conclude by discussing future applications of the KTD-SR as a model of the interaction between predictive and probabilistic animal reasoning.


Introduction
An impressive signature of animal behavior is the capacity to flexibly learn relationships between the environment and reward. One approach to understanding this behavior involves investigating how the brain represents different stimuli such that credit for reward is generalised appropriately. Predictive representations, like the Successor Representation (SR) (Dayan, 1993), generalise over stimuli that predict similar futures and can provide a useful balance between efficiency and flexibility (Gershman, 2018;. SR learning is faster to adapt to change than model-free (MF) learning, particularly changes in reward location, and supports more efficient state evaluation than model-based (MB) algorithms, which use time-consuming forward simulations to evaluate state. Since this efficiency depends on caching long-term expected state occupancies, however, the SR is worse than MB at handling changes in the environment's transition structure. In neuroscience and psychology, the SR offers a compelling explanation for a range of behavioural and neural findings Stachenfeld, Botvinick, & Gershman, 2017;Gardner, Schoenbaum, & Gershman, 2018;Garvert, Dolan, & Behrens, 2017).
While the SR offers a solution to some of the shortcomings of model-free learning, existing methods for estimating the SR, such as temporal difference (TD) learning, do not take into account uncertainty. Here, we attempt to rectify this by drawing on the Kalman TD (KTD) method for value learning (Geist & Pietquin, 2010), which explains a range of animal conditioning phenomena that standard TD cannot explain (Gershman, 2015). KTD-SR gives the agent an estimate of its uncertainty in the SR as well as the covariance between different entries of the SR. We show how this augments the SRs capacity to support revaluation following changes in transition structure.

The successor representation
We define an RL environment to be a Markov Decision Process consisting of states s the agent can occupy, transition probabilities T π (s |s) of moving from state s to states s given the agent's policy π(a|s) over actions a, and the reward available at each state, for which R(s) denotes the expectation.
An RL agent is tasked with finding a policy that maximises its expected discounted total future reward, or value: where t indexes timestep and γ, where 0 ≤ γ < 1, is a discount factor that down-weights distal rewards.
The value function can be decomposed into a product of the reward function R and the SR matrix M (Dayan, 1993): M is defined such that each entry M(s, s ) gives the expected discounted future number of times the agent will visit s from starting state s, under the current policy (Dayan, 1993): where I(s t = s ) = 1 if s t = s and 0 otherwise. Each row M(s, :) in this matrix constitutes the SR for some state s, thus representing each state as a vector over future "successor states." Factorising value into an SR term and a reward term permits greater flexibility because if one term changes, it can be relearned while the other remains intact (Dayan, 1993;Gershman, 2018). We first consider the SR in a tabular setting with deterministic transitions and a fixed, deterministic policy. This means that there is only one possible state s t+1 following any predecessor state s. In this setting, the SR matrix rows of two temporally adjacent states s t , s t+1 can be recursively related as follows: where φ φ φ(s) is the feature vector (of length n, the number of features) observed by the agent in state s. In this article, we consider problems with discrete state spaces, for which the feature vector φ φ φ(s) is a one-hot vector with an entry for every state and a 1 only in the s th position. Equation 4 is analogous to the Bellman equation for value widely used in RL (Sutton & Barto, 1998), with the vector-valued M(s t , :) in lieu of scalar V (s t ).
We can express the estimated current one hot state vector (based on the SR) as the difference between two successive temporal difference between state features. The (vector valued) successor prediction error, used to update the SR in TD methods, is then given by δ δ

Learning a probabilistic SR using a Kalman Filter
The algorithm described above produces a point estimate of the SR. While useful for approximating expected value, it is not capable of expressing certainty in these estimates. In order to derive a probabilistic interpretation of the SR, we assume that the agent has an internal generative model of how sensory data are generated from the SR parameters that can be learned with KTD (Geist & Pietquin, 2010;Gershman, 2015). This model consists of a prior distribution on the (hidden) parameters, p(m m m 0 ) -where m t = vec(M T t ) is the SR reshaped into a vector -an evolution process on the parameters, p(m m m t |m m m t−1 ), and a distribution of observed (one-hot) feature vectors given the current parameters and observa- As with earlier work on KTD, we assume a where C 0|0 is the prior covariance between SR matrix entries, C v t is the process covariance, describing how the evolution of different parameters covaries, and C n t is the observation covariance, describing covariance in the observations. C 0|0 , C v t and C n t are set by the practitioner (see Table 1).
The purpose of the Kalman Filter is to infer a posterior distribution over that hidden state m m m t given the observations φ φ φ: Under the Gaussian model described above, this posterior distribution is Gaussian with mean m m m t and covariance C t parameters which will be estimated by the Kalman Filter. To set up the filter, we specify an evolution equation describing how the hidden parameters (the SR) evolve over time and an observation equation describing how observation relates to our hidden parameters. These two equations comprise the statespace formulation for KTD SR: where v v v t is the process noise and n n n t the observation noise, ⊗ denotes the Kronecker product and I the identity matrix. We will start from the assumption that the process noise is white, where C mφ t is the covariance between the parameters and the prediction error, and C φ t is the covariance of the prediction error. The notation C t|t = E [C t |φ φ φ 1 ...φ φ φ t ] means that the estimate of the parameter covariance is conditioned on all observations until time t (see Geist & Pietquin, 2010). Importantly, and in contrast to standard TD updates for the SR (Dayan, 1993), the Kalman gain K t is stimulus specific, as it is dependent on the ratio between the covariance in the parameters and the covariance in the observations. This allows for a principled weighting of prior knowledge and incoming data. Note also that the Kalman gain is a matrix, meaning that each entry of the successor representation gets updated with a different gain. See Algorithm 1 for a full description of the method, including how these quantities are computed.
In summary, we have introduced a method of handling uncertainty over SR estimates. This allows for an efficient combination of prior knowledge and incoming information when updating the SR estimates. Furthermore, it allows us to estimate dependencies between different entries in the SR that inform SR updates. This permits non-local updates which, in the case of KTD for value estimation, have proven to better explain animal behaviour than the strictly local updates of vanilla TD (Gershman, 2015). We explore a possible role for non-local updates in the following section.

Partial Transition Revaluation Simulations
A key prediction of standard TD-SR learning is that "reward revaluation" should be supported while "transition revaluation" should not. Momennejad et al. (2017) tested this in humans. In the first phase of their experiment, participants learned two different sequences of states terminating in different reward amounts: 2→4→6→$1 and 1→3→5→$10 (see Figure 1B). In the next stage, half of the participants were exposed to the

Algorithm 1: Kalman TD Successor Representation
Initialization: priors m 0|0 and C 0|0 ; for t ← 1, 2, ... do Observe transition (s t , s t+1 ) ; Correction step ; transition revaluation condition, observing novel 4→5→$10 and 3→6→$1 transitions. The other half experienced "reward revaluation" in the form of novel reward amounts 6→$10 and 5→$1 ( Figure 1A). Importantly, the novel experiences start from intermediate states such that transitions from 1 or 2 are not seen following phase 1. While participants were significantly better at reward revaluation than transition revaluation, they were capable of some transition revaluation as well ( Figure 1C). Accordingly, the authors proposed a hybrid SR model: an SR-TD agent that is also endowed with capacity for replaying experienced transitions ( Figure 1F). This permits updating of the SR vectors of states 1 and 2 through simulated experience.
Here, we simulate this experiment and find that the probabilistic KTD-SR accounts for partial transition revaluation even without replay ( Figure 1D). KTD-SR correctly learns the SR matrix after phase 1 ( Figure 1E) as well as an estimate of the covariance between all entries in the SR matrix, C t|t . Unlike TD-SR, KTD-SR uses the covariance matrix to estimate the Kalman gain and uses that to update the whole matrix. This means that after seeing 3 → 6, it updates not just M(3, :) but also M(1, 6) because these entries have historically covaried (same for M(4, :) and M(2, 5)) ( Figure 1F). To estimate direct rewardr, the agent uses a Rescorla-Wagner rule (Rescorla & Wagner, 1972). Model parameters are listed in Table 1 and experimental parameters are kept the same as in .

Discussion
The SR constitutes a middle ground between model-based and model-free RL algorithms by separating reward representations from cached long-run state predictions. Here we learn a probabilistic SR model using KTD that supports principled handling of uncertainty about state predictions and interdependencies between these predictions. We exploit this feature to show that, unlike standard TD-SR, KTD-SR can perform partial transition revaluation. In later work, we plan to test our model on other tasks that could benefit from KTD-SR in a similar way, such as policy revaluation (a well-known weak spot of TD-SR; Barreto, Munos, Schaul, & Silver, 2016).
We note the relative strengths and weaknesses of KTD-SR when compared to a hybrid-MB-SR approach. Replay re-quires a buffer to store experienced episodes and a sufficient number of replays that information is propagated throughout the SR model. While KTD-SR can incorporate information about long-range in a single update, it must learn and store a large n 2 × n 2 matrix (although dimensionality reduction can reduce this burden; Fisher, 1998). There is compelling evidence in favor of both replay (Carr, Jadhav, & Frank, 2011;Olafsdóttir, Bush, & Barry, 2018) and probabilistic representations (Ma, Beck, Latham, & Pouget, 2006) driving behavior. Future work will consider how the relative tradeoffs of these approaches constrain hypotheses.
We make several assumptions in order to make this model tractable. The Gaussian assumption is clearly violated in the case of one-hot state vectors (i.e. neither φ nor M should have negative entries). However, the model is sufficiently expressive that a good approximation can still be found, and a "successor feature" model could be applied over arbitrary features for which the Gaussian assumption might hold. The random walk process noise is useful for capturing slow changes in the environment, but might be ill-suited for step changes or sub-optimal when the dynamics are predictable. While we assume deterministic transitions and linear function approximation in this work, it is straightforward to extent the model to stochastic transitions and nonlinear function approximation with a "coloured noise" approach (Geist & Pietquin, 2010).