Meta-learning in Reinforcement Learning

doi:10.1016/S0893-6080(02)00228-9

Neural Networks

Volume 16, Issue 1, January 2003, Pages 5-9

https://doi.org/10.1016/S0893-6080(02)00228-9 Get rights and content

Abstract

Meta-parameters in reinforcement learning should be tuned to the environmental dynamics and the animal performance. Here, we propose a biologically plausible meta-reinforcement learning algorithm for tuning these meta-parameters in a dynamic, adaptive manner. We tested our algorithm in both a simulation of a Markov decision task and in a non-linear control task. Our results show that the algorithm robustly finds appropriate meta-parameter values, and controls the meta-parameter time course, in both static and dynamic environments. We suggest that the phasic and tonic components of dopamine neuron firing can encode the signal required for meta-learning of reinforcement learning.

Introduction

Reinforcement learning is a particularly attractive model of animal learning, as dopamine neurons have the formal characteristics of the teaching signal known as the temporal difference (TD) error (Schultz, 1998). However, crucial to successful reinforcement learning is the careful setting of several meta-parameters. We earlier proposed that these meta-parameters are set and adjusted by neuromodulator neurons (Doya, 2002). We introduce here a simple, yet robust, algorithm that not only finds appropriate meta-parameters, but also controls the time course of these meta-parameters in a dynamic, adaptive manner.

The main issue in the theory of reinforcement learning (RL) is to maximize the long-term cumulative reward. Thus, central to reinforcement learning is the estimation of the value function $V(x(t))=E ∑ k=0 ∞ γ^{k} r(t+k+1),$ where r(t),r(t+1),r(t+2),… are the rewards acquired by following a certain action policy x→a starting from x(t), and γ is a discount factor such that 0<γ<1. The value function for the states before and after the transition should satisfy the consistency equation $V(x(t−1))=E[r(t)+γV(x(t))].$ Therefore, any deviation from the consistency equation, expressed as $δ(t)=r(t)+γV(x(t))−V(x(t−1)),$ should be zero on average. This signal is the TD error and is used as the teaching signal to learn the value function: $Δ V(x(t−1))=αδ(t),$ where α is a learning rate.

The policy is usually defined via the action value function Q(x(t),a), which represents how much future rewards the agent would get by taking the action a at state x(t) and following the current policy in subsequent steps. One common way for stochastic action selection that encourages exploitation is to compute the probability to take an action by the soft-max function: $P(a|x(t))= e^{βQ(x(t),a)} ∑ a′=1 M e^{βQ(x(t),a′)},$ where the meta-parameter β is called the inverse temperature.

Crucial to successful reinforcement learning is the careful setting of the three meta-parameters α,β and γ.

•
The learning rate α is key to maximize the speed of learning, as small learning rates induce slow learning, and large learning rates induce oscillations.
•
The inverse temperature β controls the exploitation–exploration trade-off. Ideally, β should initially be low to allow large exploration when the agent does not have a good mapping of which actions will be rewarding, and gradually increase as the agent reaps higher and higher rewards.
•
The discount factor γ specifies how far in the future rewards should be taken into account. If γ is small, the agent learns to behave only for short-term rewards. Although setting γ close to 1 promotes the agent to learn to act for long-term rewards, there are several reasons γ should not be set too large. First any real learning agent, either artificial or biological, has a limited lifetime. A discounted value function is equivalent to a non-discounted value function for an agent with a constant death rate of 1−γ. Second, an agent has to acquire some rewards in time; for instance, an animal must find food before it starves; a robot must recharge its battery before it is exhausted. Third, if the environmental dynamics is highly stochastic or the dynamics is non-stationary, long-term prediction is doomed to be unreliable. Finally, the complexity of learning a value function increases with the increase of 1/(1−γ) (Littman et al., 1995).

In many applications, the meta-parameters are hand-tuned by the experimenter, and heuristics are often devised to schedule the value of these meta-parameters as learning progresses. Typically, α is set to decay as the inverse of time, and β to increase linearly with time. However, these algorithms cannot deal with non-stationary environments.

Several robust methods to dynamically adjust meta-parameters have been proposed. We (Schweighofer & Arbib, 1998) proposed a biological implementation of the IDBD (incremental delta bar delta) algorithm (Sutton, 1992) to tune α. The model improves learning performance by automatically setting near optimal synaptic learning rates for each synapse. To direct exploration, Thrun (1992) used a competence network that is trained to estimate the utility of exploration. Ishii et al. (2002) proposed to directly tune β as a function of the inverse of the variance of the action value function. However, although these algorithms are effective in balancing exploration and exploitation, they are specific to each meta-parameters and often make severe computation and memory requirements.

Section snippets

Learning meta-parameters with stochastic gradients

Gullapalli (1990) introduced the Stochastic Real Value Units (SRV) algorithm. An SRV unit output is produced by adding to the weighted sum of its input pattern a small perturbation that provides the unit with the variability necessary to explore its activity space. When the action increases the reward, the unit's weights are adjusted such that the output moves in the direction in which it was perturbed.

Here, we expand the idea of SRV units and take it on to a level higher—that is, we propose

Discrete space and time

We tested our algorithm on a simulation of a Markov decision problem (MDP) task. In the experiment (Tanaka et al., 2002), on each step, one of four visual stimuli was presented on the screen and the subject was asked to press one of two buttons—one leading to a positive reward, the other to a negative reward (a loss). The subject's button press determined both the immediate reward and the next stimulus. The state transition and the amount of reward were such that a button with less reward had

Discussion

We proposed a simple, yet robust and generic, algorithm that does not only finds near-optimal meta-parameters, but also controls the time course of the meta-parameters in a dynamic, adaptive manner.

Since we introduced five extra parameters, i.e. two global parameters, τ₁, τ₂, and three parameters specific to each neuromodulator neuron, μ, ν, and n, do we need meta-meta-learning algorithms, or even higher order learning mechanisms, to tune these parameters? Let's notice that we have actually

Acknowledgements

We thank Etienne Burdet, Stephan Schaal, and Fredrik Bissmarck for comments on a first draft. This work was supported by CREST, JST.

References (17)

N.D. Daw et al.
Opponent interactions between serotonin and dopamine
Neural Networks
(2002)
J. Day et al.
Dopaminergic regulation of cortical acetylcholine release: effects of dopamine receptor agonists
Neuroscience
(1993)
K. Doya
Metalearning and neuromodulation
Neural Networks
(2002)
V. Gullapalli
A stochastic reinforcement learning algorithm for learning real-valued functions
Neural Networks
(1990)
S. Ishii et al.
Control of exploitation–exploration meta-parameters in reinforcement learning
Neural Networks
(2002)
B.L. Jacobs et al.
5-HT and motor control: a hypothesis
Trends in Neuroscience
(1993)
T. Maeda et al.
Monoaminergic interaction in the central nervous system: a morphological analysis in the locus coeruleus of the rat
Compartative Biochemistry and Physiology C
(1991)
G. Aston-Jones et al.
Activity of norepinephrine-containing locus coeruleus neurons in behaving rats anticipates fluctuations in the sleep–waking cycle
Journal of Neuroscience
(1981)

There are more references available in the full text version of this article.

Cited by (205)

Robust interplanetary trajectory design under multiple uncertainties via meta-reinforcement learning
2024, Acta Astronautica
This paper focuses on the application of meta-reinforcement learning to the robust design of low-thrust interplanetary trajectories in the presence of multiple uncertainties. A closed-loop control policy is used to optimally steer the spacecraft to a final target state despite the considered perturbations. The control policy is approximated by a deep recurrent neural network, trained by policy-gradient reinforcement learning on a collection of environments featuring mixed sources of uncertainty, namely dynamic uncertainty and control execution errors. The recurrent network is able to build an internal representation of the distribution of environments, thus better adapting the control to the different stochastic scenarios. The results in terms of optimality, constraint handling, and robustness on a fuel-optimal low-thrust transfer between Earth and Mars are compared with those obtained via a traditional reinforcement learning approach based on a feed-forward neural network.
Multi-objective Reinforcement Learning - Concept, Approaches and Applications
2023, Procedia Computer Science
Real-world decision-making tasks are generally complicated and require trade-offs between multiple, even conflicting, objectives. As the advent and great development of advanced information technology, it has evolved into using reinforcement learning (RL) algorithms to tackle the multi-objective decision making (MODM) problems. In this paper, we will first identify the basic concepts and factors when modelling the MODM tasks with reinforcement learning, and then review the traditional RL, such as Sarsa, Q-Learning, Policy Gradients, Actor-Critic, Monte-Carlo learning, and modern deep RL algorithms applied in this process. Furthermore, the specific practical scenarios described in MODM problems will be summarized through analyzing some typical articles. Finally, the future trends of multi-objective reinforcement learning will be discussed.
Goals, usefulness and abstraction in value-based choice
2023, Trends in Cognitive Sciences
Colombian drug lord Pablo Escobar, while on the run, purportedly burned two million dollars in banknotes to keep his daughter warm. A stark reminder that, in life, circumstances and goals can quickly change, forcing us to reassess and modify our values on-the-fly. Studies in decision-making and neuroeconomics have often implicitly equated value to reward, emphasising the hedonic and automatic aspect of the value computation, while overlooking its functional (concept-like) nature. Here we outline the computational and biological principles that enable the brain to compute the usefulness of an option or action by creating abstractions that flexibly adapt to changing goals. We present different algorithmic architectures, comparing ideas from artificial intelligence (AI) and cognitive neuroscience with psychological theories and, when possible, drawing parallels.
Meta-reinforcement learning for adaptive spacecraft guidance during finite-thrust rendezvous missions
2022, Acta Astronautica
In this paper, a meta-reinforcement learning approach is investigated to design an adaptive guidance algorithm capable of carrying out multiple rendezvous space missions. Specifically, both a standard fully-connected network and a recurrent neural network are trained by proximal policy optimization on a wide distribution of finite-thrust rendezvous transfers between circular co-planar orbits. The recurrent network is also provided with the control and reward at the previous simulation step, thus allowing it to build, thanks to its history-dependent state, an internal representation of the considered task distribution. The ultimate goal is to generate a model which could adapt to unseen tasks and produce a nearly-optimal guidance law along any transfer leg of a multi-target mission. As a first step towards the solution of a complete multi-target problem, a sensitivity analysis on the single rendezvous leg is carried out in this paper, by varying the radius either of the initial or the final orbit, the transfer time, and the initial phasing between the chaser and the target. Numerical results show that the recurrent-network-based meta-reinforcement learning approach is able to better reconstruct the optimal control in almost all the analyzed scenarios, and, at the same time, to meet, with greater accuracy, the terminal rendezvous condition, even when considering problem instances that fall outside the original training domain.
Meta-Learning based efficient framework for diagnosing rare disorders: A comprehensive survey
2024, AIP Conference Proceedings
Learning environment-specific learning rates
2024, PLoS Computational Biology

View all citing articles on Scopus

¹: Current address: ATR Human Information Science Laboratories.

View full text

Neural Networks letterMeta-learning in Reinforcement Learning

Abstract

Introduction

Section snippets

Learning meta-parameters with stochastic gradients

Discrete space and time

Discussion

Acknowledgements

Neural Networks

Neuroscience

Neural Networks

Neural Networks

Neural Networks

Trends in Neuroscience

Compartative Biochemistry and Physiology C

Activity of norepinephrine-containing locus coeruleus neurons in behaving rats anticipates fluctuations in the sleep–waking cycle

Journal of Neuroscience

Neural Networks letter
Meta-learning in Reinforcement Learning