Continual Reinforcement Learning with Multi-Timescale Successor Features

Learning and memory consolidation in the brain occur over multiple timescales. Inspired by this observation, it has been shown that catastrophic forgetting in reinforcement learning (RL) agents can be mitigated by consolidating Q-value function parameters at multiple timescales. In this work, we combine this approach with successor features, and show that by consolidating successor features and preferences learned over multiple timescales we can further mitigate catastrophic forgetting. In particular, we show that agents trained with this approach rapidly recall previously rewarding sites in large environments, whereas those trained without this decomposition and consolidation mechanism do not. These results therefore contribute to our understanding of the functional role of synaptic plasticity and memory systems operating at multiple timescales, and demonstrate that RL can be improved by capturing features of biological memory with greater fidelity.


Introduction
Biological agents, such as primates, are capable of learning and adapting over the course of their lifetime, perhaps driven by preferences or rewards.For example, when they first discover a new location of food, they are able to bring the food back home to feed their families and return to the food site subsequently in coming days.This efficient foraging behaviour is vital for survival, requiring them to quickly remember previously well-traversed routes.
Arguably, another key neural substrate that allows the agent to perform these tasks successfully is place cells.Several theories about the encodings of place cells have been proposed and one of which suggested that place cells seems to encode a predictive representation of future states given the current state (Stachenfeld, Botvinick, & Gershman, 2017) in the form of successor representations (SRs) (Dayan, 1993).
In RL terms, this allows the value function of a given state s to be decomposed into the inner product of SRs and the reward function : v(s) = ∑ s ′ M(s, s ′ )R(s) where M(s, s ′ ) denotes the SR, which encodes the expected occupancy of state s ′ from state s along a trajectory, and R(s) is the reward of state s.We can further extend the idea of SRs to consider both state and action spaces, which give rise to successor features (SFs) ψ(s, a) (Barreto et al., 2017).The Q-value of a stateaction pair can then be computed by using both SFs and a preference (prefs) vector w : Q(s, a) = ψ(s, a)w (Barreto, Hou, Borsa, Silver, & Precup, 2020) and the optimal value function of a given state s is simply: V * (s) = max a Q(s, a).One can think of SFs as the discounted sum of future features observed and the preference vector w as an indication of which state or task that the agent has to pursue.

Method
The main inspiration behind our model is a biologically plausible synaptic plasticity model (Benna & Fusi, 2016)(Fig.1  (a & b)), which we adapted for learning successor features and preferences (Fig 1(c)).On the abstract level, the model initially store memories in fast variable (u 1 ) and then progressively transfer them to slower variables u 2 and u 3 , thus allowing it to preserve memories over a longer period of time.The dynamical variables u 1 , u 2 , u 3 (beakers in Fig. 1(a)) represent different biochemical processes occurring at a large range of timescales, with u 1 representing the strength of the synaptic weight.These dynamic variables can communicate only with their two nearest neighbours, with the exception to the first and last variables (u 1 and u 3 ).The terms g 1,2 , g 2,3 , g 3,4 (tubes connecting the beakers in Fig. 1(a)) define the strength of the bidirectional communication between the dynamical variables.The terms C 1 ,C 2 ,C 3 corresponds to the size of the beakers and together with g 1,2 , g 2,3 , g 3,4 , they determine the timescales on which the processes operate.The dynamics of each variable u k is defined as C k For the last beaker u 3 , the term u 4 is set to 0.
A previous approach has shown some successes in applying the consolidating mechanism to Q-values (Kaplanis, Shanahan, & Clopath, 2018), which we denote the model as Q-learning with consolidation in our experiments (Figure .2).For each environment, we consider the following models as baselines, Q-learning (blue curve), Q-learning with consolidation (orange curve) and successor features (SF) without consolidation but with given preferences (prefs) (dashed curve).Our model uses the consolidation mechanism for both the SF and prefs (red curve).We also consider a model which both SF and prefs do not use the consolidation mechanism (green curve) and a model which only the SF uses the consolidation mechanism (brown curve) as an ablation study.Our results show that applying the consolidation mechanism to both SF and prefs lead to better overall performance of the RL agent.
However, in our model, we attempt to capture the features of biological memory with greater fidelity by applying this consolidation approach in order to learn the successor features SF u k and preferences w u k (Fig. 1(c)).We allow the dynamics of SF u k and w u k to evolve in similar fashion.To compute the Q-value for a given state-action pair, we evaluate only the information from the shallowest beakers, SF u 1 and w u 1 ; Q(s, a) = SF u 1 w u 1 .An ε-greedy policy is used during training with ε set to 0.05 to prioritise exploitation.

Experimental Results
We tested our model in a 10 × 10 gridworld with different tasks and environment dynamics (see Fig. 2).The actions consist of moving up, down, left or right.For each environment, the agent are trained using 5 random seeds.An episode terminates when the goal location is reached or when the maximum number of steps have been taken.The agent receive a reward value of +1 when it reaches the goal location and 0 otherwise.In order to exhibit the nature of a continual RL setting, the task switches at every 1 million steps (one epoch).For each environment, the agent is trained for a total of 12 epochs and we track the performance of the agent based on the total returns per epoch.An ideal continual RL agent should be able to relearn previously seen goal locations quickly, gaining high amount of returns despite experiencing the switch of tasks.
In our experiments, our agent which applies the consolidation mechanism for both successor features and preferences has exhibit this capability of rapid relearning better than the other agents which either do not use the consolidation mechanism at all (blue curves in Figure 2.) or only apply the consolidation mechanism to Q-values (orange curves in Figure 2.) in both tabular gridworld settings (Figure 2.)

Discussion and Future work
Currently, we are evaluating our models on more challenging environments which uses pixel images as observations.In order to learn the high dimensional representations, we can consider using artificial neural networks to approximate the successor features ψ(s, a; θ) where θ corresponds to the parameters of an artificial neural network.The preliminary results are promising and we hope to present these results at the conference.

Figure 1 :
Figure 1: Schematic of a synaptic plasticity model adapted from (Benna & Fusi, 2016) (a & b).The beakers u k represents biochemical processes with communication (liquid) g k−1,k flowing between them bidirectionally to simulate synaptic plasticity at different timescales C k .(c) Model adapted to learn successor features SFs and preferences w at different timescales.

Figure 2 :
Figure2: Results from two different tabular gridworld environments, Env 1 and Env 2. We perform two different tasks in each environment (Env).Top: In Env 1, the environment has no obstacles (walls), and the tasks involve alternating the goal location (red square) between the top left corner and the bottom right corner at the end of each epoch.Bottom: In Env 2, the environment now contains walls (black squares).The tasks involve alternating the goal location on the left and the right side of the environment, which are now surrounded by walls.For each environment, we consider the following models as baselines, Q-learning (blue curve), Q-learning with consolidation (orange curve) and successor features (SF) without consolidation but with given preferences (prefs) (dashed curve).Our model uses the consolidation mechanism for both the SF and prefs (red curve).We also consider a model which both SF and prefs do not use the consolidation mechanism (green curve) and a model which only the SF uses the consolidation mechanism (brown curve) as an ablation study.Our results show that applying the consolidation mechanism to both SF and prefs lead to better overall performance of the RL agent.
Learning in Machine and Brains Fellowship).Doina Precup is supported by CIFAR (Canada AI Chair; Learning in Machine and Brains Fellowship.Computations for the experiments were made on the supercomputer B éluga & Graham, managed by Calcul Qu ébec and the Digital Research Alliance of Canada.The operation of this supercomputer is funded by the Canada Foundation for Innovation (CFI), Minist ère de l' Économie et de l'Innovation du Qu ébec (MEI) and le Fonds de recherche du Qu ébec (FRQ).