Temporal Difference Learning for Recurrent Neural Networks

Truncated back-propagation through time(TBPTT) is one of the most common methods for training artificial recurrent neural networks(RNNs) for temporal credit assignment(TCA). There have been various proposed theories on how neural circuits in the brain might approximate back-propagation algorithm in solving credit assignment problem for feed forward neural networks, but it remains unclear how the equivalent of TBPTT could be implemented in brains. Temporal difference learning with eligibility traces is a key universal approach used in reinforcement learning for multi-step value prediction problems(TCA). In this work, we apply TD learning with eligibility trace to train RNNs which is biologically plausible as it encodes errors locally,does not need a separate backward computation and hence does not have an issue of weight symmetry.


Background
Predictive Coding (Rao & Ballard, 1999), (Friston & Kiebel, 2009) or brain as prediction machine is one of the dominant models of cognitive neuroscience, where all the neuronal dynamics and connectivity is optimized to minimize prediction error. Within this framework brain constructs internal models of the world and it performs constant inference to predict what happens next, this is one of core signal or global objective w.r.t which synaptic weight modifications are performed in brain. These internal models are essentially dynamical systems allows brains to perform multi-step predictions instead of just single step predictions, RNNs are one of the simplest models of dynamical systems in connectionist models with feedback loops and are potentially more close to biological neuronal systems. Hence in this work, we use RNNs to learn internal world models which would also enable deliberate planning and model-based reinforcement learning(RL).

Credit assignment in brain:
The back-propagation of error algorithm (BP) is a fundamental credit assignment mechanism in artificial neural networks but it is impossible to implement in a real brain, but there are many BP inspired proposed theories (Lee, Zhang, Fischer, & Bengio, 2015),re-circulation, and related methods such as contrastive divergence, etc. for understanding how brain might approximate BP including a scalability evaluation of such approaches (Bartunov et al., 2018). (Whittington & Bogacz, 2019) summarizes many of the recently proposed theories on how neural circuits in the brain might approximate BP, and they focus mainly on temporal error(difference) models such as equilibrium propagation (Scellier & Bengio, 2017) which encode error (locally) in the difference between activity phases and these account for spike-time-dependent plasticity (STDP). Another class of models are explicit error models which include predictive coding and dendritic error (Richards & Lillicrap, 2019) models which are inspired by properties of pyramidal neurons. But most of these approaches focus on energy-based models which compute gradients numerically unlike back propagation which computes analytical gradients.
Except equilibrium propagation there has not been much work which focused on biologically plausible TBPTT and it is unclear how brain might implement it. (Lillicrap & Santoro, 2019) points out that compared to BP there is even less conviction about whether and how BPTT can be implemented in brain, but they focus on broader TCA problem where they suggest use of attention and memory-based architectures for TCA.
Though we believe memory is a necessary component of the overall system (as suggested by (Kumaran, Hassabis, & McClelland, 2016)) we believe that much of prefrontal cortex(PFC) is a hierarchical recurrent neural network with attention effectively making it a complex dynamical system and hence credit assignment, TCA of this network is needed.
Dopamine for sensory prediction errors: Dopamine neurons have been thought to report reward prediction error(RPE) and this hypothesis has several lines of evidence, but there seems to be evidence that dopamine neurons respond to novel stimuli too. (Gardner, Schoenbaum, & Gershman, 2018) proposes to extend dopaminergic modulation for sensory prediction errors(SPE), and it has been pointed out that it is unclear what exactly dopamine is contributing to model-based learning because prediction errors do not require TD errors. They apply TD learning to successor representation in RL, which has more information than model-free RL but cannot support deliberate planning which is possible in the modelbased framework. In this work, we learn internal predictive models of the world in recurrent neural networks and apply TD learning with eligibility traces for TCA, and hence it supports a hypothesis that dopamine neurons may contribute to SPEs and model-based RL,and it is also consistent with three factor learning rule.
Temporal difference learning: (Sutton & Barto, 2018), (Sutton, 1988)Temporal error models encodes errors in differences in neural activity across time and that is exactly how temporal difference(TD) methods work too, the conventional (explicit-error) prediction learning methods assign credit by means of difference between predicted and actual observations instead TD methods assign credit by means of difference between temporally successive predictions and hence allow incremental learning and also local 970 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 learning. In multi-step prediction problems, partial information relevant to correctness of prediction is revealed at each step and TD uses this information. The key idea of TD (0)

Multi-step prediction setup
Let us consider a multi-step prediction problem in which experience comes in observation sequences of form For any fixed 'k'(k > n), for each observation the RNN model has corresponding predictions of future stateŝ t+k where 'k' is the future time-step at which we are predicting observation and 'j' is the number of prediction dependent transitions(see below),for true observations j = 0 , 'n' is the number of steps involved in temporal error calculation (eligibility trace). so any observation trajectory is represented by t=T and corresponding predictions for any time-step byŜ t Prediction dependent transition arê Observation dependent transition Multi-step prediction or TD(1) Loss: It is slightly different to traditional monte carlo or TD(1) update we apply in RL where we only have access to target at the end of each episode, here similar to (Venkatraman, Hebert, & Bagnell, 2015), (Bengio, Vinyals, Jaitly, & Shazeer, 2015) we have access to target for each of multi-step prediction.
This can be considered as a case where λ = 1 in TD(λ) setting.
Temporal Difference or TD(0) Loss: Temporal error models or TD(0) setup is exactly similar to traditional setup in RL, but instead of value prediction here we predict the next or future observations given current observation This can be considered as a case where λ = 0 in TD(λ) setting.
The red arrow in fig(3) represents a TD error where a 3-step rollout is acting as a TD target to 4-step rollout, and the difference between h 3 t5 , h 4 t5 is temporal difference(TD) or temporal error for each hidden state(neuron) in RNN, hence is a local learning rule.
Training objective: Combining TD(0) and TD(1) For simplicity we will consider a special case of combining returns(TD targets) in TD(λ) setting where we are averaging for λ ∈ (0, 1) but we would not be considering other values of 'n' in n-step TD(λ) setting, this can also be interpreted as one-step TD regularization added to traditional multi-step objective.γ is the discount factor for recency. Please note that the value of 'n' is the length of eligibility trace which enables us to solve TCA.

Experiments
We consider a simple task where a single layer RNN with 300 nodes is trained to learn dynamics of a simple pendulum(2 dimensional), the input to the RNN is position and velocity of the pendulum and each trajectory/episode length is 1500 time-steps. In all our experiments during training and evaluation, we have a warm-start period of 50 time-steps where the RNN has access to true observations at each time-step and learning is not applied. An important point to note is that we are using backpropagation to train the RNN from input to output per time step i.e. though we do not perform BPTT and use TD learning as a framework to solve TCA we still use BP to update parameters of the RNN w.r.t overall objective,we allow gradient to back-propagate only one time-step by detaching all previous variables in pytorch framework(as illustrated by red line in fig(3)). Blue, green arrows in fig(3) are explicit(supervised) errors for each hidden node (in our experiments we directly get the back-propagated gradient from the output layer to account for explicit errors). We use λ = 0.5 for all experiments Evaluations reported in Table(1) We evaluate the RNN on two different multi-step prediction settings. The first one is traditional multi-step setting, where RNN would have access to information of all the observations until the current state and then we evaluate for different step-lengths how accurate are predictions, step-length varies from traditional one-step prediction to 50 step prediction and these are reported by 'length of prediction' in the below table(1). The second set of evaluations is on increasing complexity of state estimation problem i.e. RNN has access to true observation at some frequency i.e. if 'input sampling frequency' = 1 then it is traditional one-step prediction but if sampling frequency is '5' then RNN looks at true observation only once in 5 steps and hence would have 1 − step, 2 − step, 3 − step, 4 − step, 5 − step predictions. This set of evaluations can be considered as the geometric mean of different step-lengths in multi-step predictions. We vary the sampling frequency from 1 to 100 , in the extreme case of 100 RNN has access to only 15 observations in the entire 1500 time-step trajectory after warm-start. 'k1' in table(1) of TBPTT represents number of time steps gradient is back-propagated,we would like to point out that there isn't lot of hyper parameter tuning done w.r.t λ, γ and etc.

Summary of results:
Here we focus on long term prediction,sampling frequency is low,and also there seems to inherent tradeoff between shor-term and long-term prediction accuracy.
• n-step(n=5) performs better than n=1 for both combined loss, TD(1) and hence it shows evidence that eligibility trace improves TCA.
• TBPTT with scheduled sampling is the best performing model as is also the gold standard, but for very long term prediction or very low sampling frequency TD(0)+TD(1) works better potentially because of self consistency and variance reduction property of TD methods.
• The proposed approach obtains an analytical gradient instead of numerical gradient which was focus of most previous work as they were based on energy based models.

Discussion and Future work
In this work, we applied TD learning for TCA for generic prediction problem on one layer RNN, on a simple domain. As the proposed approach encodes error locally, does not require a backward pass in time, hence it can support stochastic communication. We would like to extend this framework to hierarchical RNNs and also evaluate on more complex domains to evaluate the scalability of this method. It is also important to understand and analyze the computational and theoretical implications of the proposed approach. An important problem with learned computational models of the world is poor long term prediction and we believe this could be a useful step towards that.There is a need for energy efficient hardware for deep learning, and neuromorphic computing with spiking neural networks are one of the promising directions but currently we cannot implement BP on those platforms because BP needs end-end gradient computation, so research on biologically plausible approximations of BP may enable ubiquitous use of such hardware.