Elsevier

Neural Networks

Volume 16, Issue 1, January 2003, Pages 5-9
Neural Networks

Neural Networks letter
Meta-learning in Reinforcement Learning

https://doi.org/10.1016/S0893-6080(02)00228-9Get rights and content

Abstract

Meta-parameters in reinforcement learning should be tuned to the environmental dynamics and the animal performance. Here, we propose a biologically plausible meta-reinforcement learning algorithm for tuning these meta-parameters in a dynamic, adaptive manner. We tested our algorithm in both a simulation of a Markov decision task and in a non-linear control task. Our results show that the algorithm robustly finds appropriate meta-parameter values, and controls the meta-parameter time course, in both static and dynamic environments. We suggest that the phasic and tonic components of dopamine neuron firing can encode the signal required for meta-learning of reinforcement learning.

Introduction

Reinforcement learning is a particularly attractive model of animal learning, as dopamine neurons have the formal characteristics of the teaching signal known as the temporal difference (TD) error (Schultz, 1998). However, crucial to successful reinforcement learning is the careful setting of several meta-parameters. We earlier proposed that these meta-parameters are set and adjusted by neuromodulator neurons (Doya, 2002). We introduce here a simple, yet robust, algorithm that not only finds appropriate meta-parameters, but also controls the time course of these meta-parameters in a dynamic, adaptive manner.

The main issue in the theory of reinforcement learning (RL) is to maximize the long-term cumulative reward. Thus, central to reinforcement learning is the estimation of the value functionV(x(t))=Ek=0γkr(t+k+1),where r(t),r(t+1),r(t+2),… are the rewards acquired by following a certain action policy xa starting from x(t), and γ is a discount factor such that 0<γ<1. The value function for the states before and after the transition should satisfy the consistency equationV(x(t−1))=E[r(t)+γV(x(t))].Therefore, any deviation from the consistency equation, expressed asδ(t)=r(t)+γV(x(t))−V(x(t−1)),should be zero on average. This signal is the TD error and is used as the teaching signal to learn the value function:ΔV(x(t−1))=αδ(t),where α is a learning rate.

The policy is usually defined via the action value function Q(x(t),a), which represents how much future rewards the agent would get by taking the action a at state x(t) and following the current policy in subsequent steps. One common way for stochastic action selection that encourages exploitation is to compute the probability to take an action by the soft-max function:P(a|x(t))=eβQ(x(t),a)a′=1MeβQ(x(t),a′),where the meta-parameter β is called the inverse temperature.

Crucial to successful reinforcement learning is the careful setting of the three meta-parameters α,β and γ.

  • &#x02022;

    The learning rate α is key to maximize the speed of learning, as small learning rates induce slow learning, and large learning rates induce oscillations.

  • &#x02022;

    The inverse temperature β controls the exploitation–exploration trade-off. Ideally, β should initially be low to allow large exploration when the agent does not have a good mapping of which actions will be rewarding, and gradually increase as the agent reaps higher and higher rewards.

  • &#x02022;

    The discount factor γ specifies how far in the future rewards should be taken into account. If γ is small, the agent learns to behave only for short-term rewards. Although setting γ close to 1 promotes the agent to learn to act for long-term rewards, there are several reasons γ should not be set too large. First any real learning agent, either artificial or biological, has a limited lifetime. A discounted value function is equivalent to a non-discounted value function for an agent with a constant death rate of 1−γ. Second, an agent has to acquire some rewards in time; for instance, an animal must find food before it starves; a robot must recharge its battery before it is exhausted. Third, if the environmental dynamics is highly stochastic or the dynamics is non-stationary, long-term prediction is doomed to be unreliable. Finally, the complexity of learning a value function increases with the increase of 1/(1−γ) (Littman et al., 1995).

In many applications, the meta-parameters are hand-tuned by the experimenter, and heuristics are often devised to schedule the value of these meta-parameters as learning progresses. Typically, α is set to decay as the inverse of time, and β to increase linearly with time. However, these algorithms cannot deal with non-stationary environments.

Several robust methods to dynamically adjust meta-parameters have been proposed. We (Schweighofer & Arbib, 1998) proposed a biological implementation of the IDBD (incremental delta bar delta) algorithm (Sutton, 1992) to tune α. The model improves learning performance by automatically setting near optimal synaptic learning rates for each synapse. To direct exploration, Thrun (1992) used a competence network that is trained to estimate the utility of exploration. Ishii et al. (2002) proposed to directly tune β as a function of the inverse of the variance of the action value function. However, although these algorithms are effective in balancing exploration and exploitation, they are specific to each meta-parameters and often make severe computation and memory requirements.

Section snippets

Learning meta-parameters with stochastic gradients

Gullapalli (1990) introduced the Stochastic Real Value Units (SRV) algorithm. An SRV unit output is produced by adding to the weighted sum of its input pattern a small perturbation that provides the unit with the variability necessary to explore its activity space. When the action increases the reward, the unit's weights are adjusted such that the output moves in the direction in which it was perturbed.

Here, we expand the idea of SRV units and take it on to a level higher—that is, we propose

Discrete space and time

We tested our algorithm on a simulation of a Markov decision problem (MDP) task. In the experiment (Tanaka et al., 2002), on each step, one of four visual stimuli was presented on the screen and the subject was asked to press one of two buttons—one leading to a positive reward, the other to a negative reward (a loss). The subject's button press determined both the immediate reward and the next stimulus. The state transition and the amount of reward were such that a button with less reward had

Discussion

We proposed a simple, yet robust and generic, algorithm that does not only finds near-optimal meta-parameters, but also controls the time course of the meta-parameters in a dynamic, adaptive manner.

Since we introduced five extra parameters, i.e. two global parameters, τ1, τ2, and three parameters specific to each neuromodulator neuron, μ, ν, and n, do we need meta-meta-learning algorithms, or even higher order learning mechanisms, to tune these parameters? Let's notice that we have actually

Acknowledgements

We thank Etienne Burdet, Stephan Schaal, and Fredrik Bissmarck for comments on a first draft. This work was supported by CREST, JST.

References (17)

There are more references available in the full text version of this article.

Cited by (205)

  • Learning environment-specific learning rates

    2024, PLoS Computational Biology
View all citing articles on Scopus
1

Current address: ATR Human Information Science Laboratories.

View full text