Neural Networks letterMeta-learning in Reinforcement Learning
Introduction
Reinforcement learning is a particularly attractive model of animal learning, as dopamine neurons have the formal characteristics of the teaching signal known as the temporal difference (TD) error (Schultz, 1998). However, crucial to successful reinforcement learning is the careful setting of several meta-parameters. We earlier proposed that these meta-parameters are set and adjusted by neuromodulator neurons (Doya, 2002). We introduce here a simple, yet robust, algorithm that not only finds appropriate meta-parameters, but also controls the time course of these meta-parameters in a dynamic, adaptive manner.
The main issue in the theory of reinforcement learning (RL) is to maximize the long-term cumulative reward. Thus, central to reinforcement learning is the estimation of the value functionwhere r(t),r(t+1),r(t+2),… are the rewards acquired by following a certain action policy x→a starting from x(t), and γ is a discount factor such that 0<γ<1. The value function for the states before and after the transition should satisfy the consistency equationTherefore, any deviation from the consistency equation, expressed asshould be zero on average. This signal is the TD error and is used as the teaching signal to learn the value function:where α is a learning rate.
The policy is usually defined via the action value function Q(x(t),a), which represents how much future rewards the agent would get by taking the action a at state x(t) and following the current policy in subsequent steps. One common way for stochastic action selection that encourages exploitation is to compute the probability to take an action by the soft-max function:where the meta-parameter β is called the inverse temperature.
Crucial to successful reinforcement learning is the careful setting of the three meta-parameters α,β and γ.
- •
The learning rate α is key to maximize the speed of learning, as small learning rates induce slow learning, and large learning rates induce oscillations.
- •
The inverse temperature β controls the exploitation–exploration trade-off. Ideally, β should initially be low to allow large exploration when the agent does not have a good mapping of which actions will be rewarding, and gradually increase as the agent reaps higher and higher rewards.
- •
The discount factor γ specifies how far in the future rewards should be taken into account. If γ is small, the agent learns to behave only for short-term rewards. Although setting γ close to 1 promotes the agent to learn to act for long-term rewards, there are several reasons γ should not be set too large. First any real learning agent, either artificial or biological, has a limited lifetime. A discounted value function is equivalent to a non-discounted value function for an agent with a constant death rate of 1−γ. Second, an agent has to acquire some rewards in time; for instance, an animal must find food before it starves; a robot must recharge its battery before it is exhausted. Third, if the environmental dynamics is highly stochastic or the dynamics is non-stationary, long-term prediction is doomed to be unreliable. Finally, the complexity of learning a value function increases with the increase of 1/(1−γ) (Littman et al., 1995).
In many applications, the meta-parameters are hand-tuned by the experimenter, and heuristics are often devised to schedule the value of these meta-parameters as learning progresses. Typically, α is set to decay as the inverse of time, and β to increase linearly with time. However, these algorithms cannot deal with non-stationary environments.
Several robust methods to dynamically adjust meta-parameters have been proposed. We (Schweighofer & Arbib, 1998) proposed a biological implementation of the IDBD (incremental delta bar delta) algorithm (Sutton, 1992) to tune α. The model improves learning performance by automatically setting near optimal synaptic learning rates for each synapse. To direct exploration, Thrun (1992) used a competence network that is trained to estimate the utility of exploration. Ishii et al. (2002) proposed to directly tune β as a function of the inverse of the variance of the action value function. However, although these algorithms are effective in balancing exploration and exploitation, they are specific to each meta-parameters and often make severe computation and memory requirements.
Section snippets
Learning meta-parameters with stochastic gradients
Gullapalli (1990) introduced the Stochastic Real Value Units (SRV) algorithm. An SRV unit output is produced by adding to the weighted sum of its input pattern a small perturbation that provides the unit with the variability necessary to explore its activity space. When the action increases the reward, the unit's weights are adjusted such that the output moves in the direction in which it was perturbed.
Here, we expand the idea of SRV units and take it on to a level higher—that is, we propose
Discrete space and time
We tested our algorithm on a simulation of a Markov decision problem (MDP) task. In the experiment (Tanaka et al., 2002), on each step, one of four visual stimuli was presented on the screen and the subject was asked to press one of two buttons—one leading to a positive reward, the other to a negative reward (a loss). The subject's button press determined both the immediate reward and the next stimulus. The state transition and the amount of reward were such that a button with less reward had
Discussion
We proposed a simple, yet robust and generic, algorithm that does not only finds near-optimal meta-parameters, but also controls the time course of the meta-parameters in a dynamic, adaptive manner.
Since we introduced five extra parameters, i.e. two global parameters, τ1, τ2, and three parameters specific to each neuromodulator neuron, μ, ν, and n, do we need meta-meta-learning algorithms, or even higher order learning mechanisms, to tune these parameters? Let's notice that we have actually
Acknowledgements
We thank Etienne Burdet, Stephan Schaal, and Fredrik Bissmarck for comments on a first draft. This work was supported by CREST, JST.
References (17)
- et al.
Opponent interactions between serotonin and dopamine
Neural Networks
(2002) - et al.
Dopaminergic regulation of cortical acetylcholine release: effects of dopamine receptor agonists
Neuroscience
(1993) Metalearning and neuromodulation
Neural Networks
(2002)A stochastic reinforcement learning algorithm for learning real-valued functions
Neural Networks
(1990)- et al.
Control of exploitation–exploration meta-parameters in reinforcement learning
Neural Networks
(2002) - et al.
5-HT and motor control: a hypothesis
Trends in Neuroscience
(1993) - et al.
Monoaminergic interaction in the central nervous system: a morphological analysis in the locus coeruleus of the rat
Compartative Biochemistry and Physiology C
(1991) - et al.
Activity of norepinephrine-containing locus coeruleus neurons in behaving rats anticipates fluctuations in the sleep–waking cycle
Journal of Neuroscience
(1981)
Cited by (205)
Multi-objective Reinforcement Learning - Concept, Approaches and Applications
2023, Procedia Computer ScienceGoals, usefulness and abstraction in value-based choice
2023, Trends in Cognitive SciencesMeta-Learning based efficient framework for diagnosing rare disorders: A comprehensive survey
2024, AIP Conference ProceedingsLearning environment-specific learning rates
2024, PLoS Computational Biology
- 1
Current address: ATR Human Information Science Laboratories.