Research report
Theory meets pigeons: The influence of reward-magnitude on discrimination-learning

https://doi.org/10.1016/j.bbr.2008.10.038Get rights and content

Abstract

Modern theoretical accounts on reward-based learning are commonly based on reinforcement learning algorithms. Most noted in this context is the temporal-difference (TD) algorithm in which the difference between predicted and obtained reward, the prediction-error, serves as a learning signal. Consequently, larger rewards cause bigger prediction-errors and lead to faster learning than smaller rewards. Therefore, if animals employ a neural implementation of TD learning, reward-magnitude should affect learning in animals accordingly.

Here we test this prediction by training pigeons on a simple color-discrimination task with two pairs of colors. In each pair, correct discrimination is rewarded; in pair one with a large-reward, in pair two with a small-reward. Pigeons acquired the ‘large-reward’ discrimination faster than the ‘small-reward’ discrimination. Animal behavior and an implementation of the TD-algorithm yielded comparable results with respect to the difference between learning curves in the large-reward and in the small-reward conditions. We conclude that the influence of reward-magnitude on the acquisition of a simple discrimination paradigm is accurately reflected by a TD implementation of reinforcement learning.

Introduction

Successful behavior depends on establishing reliable predictions about future events. To select appropriate actions, humans and other animals need to learn which sensory events predict dangers or benefits and which actions improve or worsen the situation of the animal. This learning often relies on positive (reward) or negative feedback (punishment). The neural basis of feedback-based learning is highly conserved across species and much of the basic neural organization in different vertebrate species resembles each other [38], [12]. Countless research has been dedicated to understanding the computational principles mediating feedback-based learning and numerous models have been devised to describe these principles mathematically [36], [8]. Modern, theoretical accounts on feedback-based learning are mostly centered on reinforcement learning algorithms; the most prominent of these is the temporal-difference (TD) algorithm [36], [37], which has been successfully used as a model for behavioral and neural responses during reward-based learning [21], [31]. TD learning is an extension of the Rescorla–Wagner (or also the Widrow–Hoff) learning rule, with a more detailed representation of time [36], [37]. We used the TD model in this study because it is widely used in computational neuroscience and because it is well integrated into machine-learning theory including action selection in decision making.

In TD-algorithms, time is often divided into discrete steps and for each time step the amount of predicted future reward is determined on the basis of sensory stimuli. A comparison of predicted and obtained reward yields a prediction error signal with three basic characteristics: (1) an unexpected reward generates a positive prediction error indicating that more reward was obtained than was predicted, (2) omission of a predicted reward generates a negative prediction error indicating that less reward was obtained than was predicted, and (3) obtaining a fully predicted reward generates no prediction error. This prediction error signal is in turn used to update the reward prediction of sensory stimuli that preceded the reward; a positive prediction error leads to an increase in reward prediction, a negative prediction error to a decrease in reward prediction [31], [33]. Through these mechanisms TD learning can be used to associate a stimulus with a reward (as in classical conditioning) [25], to associate an action with a reward (as in operant conditioning) [22], [1] or also to cause extinction of a previously formed association [26].

The TD-algorithm gained popularity, since the activity of dopaminergic neurons located in the ventral tegmentum and substantia nigra pars compacta of mammals resembles the TD prediction error signal. The dopaminergic system is frequently termed the ‘reward-system’ of the brain and numerous theories have been devised on its exact role in reward. The most prominent theories include reinforcement [35], incentive salience [2] and habit formation [10]. Despite the discussion on the behavioral role of dopamine, there is clear evidence that the activity of dopaminergic neurons bears striking resemblance to the TD error signal. The responses of dopaminergic neurons show positive and negative prediction errors [21], [31], [25] and comply with several assumptions of learning theory [40]. One important prediction of the TD-algorithm is that the error signal is dependent on the size of the reward; a big unexpected reward will generate a bigger error signal than a small unexpected reward. Hence, bigger rewards lead to faster learning than smaller rewards.

The influence of reward-magnitude on animal behavior has previously been investigated with regards to several questions, for example reward-discriminability [7], [14], [15], [17], [24], motivation [4], [5], [6], [9], [19], [43] and choice behavior [18]. In addition, it has been evaluated in the light of response-rates during acquisition [4], [7], [13], [20], [43], and reversal [19]. However, whether the influence of reward-magnitude on learning-rate complies with the predictions of the TD-model has not yet directly been investigated. Such a test requires the use of error-rates instead of measures of response-strength in order to avoid measuring overall differences in performance due to motivational differences [5], [6]. Here we test whether the acquisition of a color-discrimination is modulated by the magnitude of contingent reward and relate our findings to an implementation of the TD-model.

Section snippets

Subjects

Twelve naive homing pigeons (Columba livia) with body weights ranging from 330 g to 490 g served as subjects. The animals were housed individually in wire-mesh cages inside a colony room, had free access to water and grit and during experiments they were maintained on 80% of their free-feeding body weight. The colony room provided a 12 h dark–light cycle with lights on at 8:00 and lights off at 20:00. The experiment and all experimental procedures were in accordance with the National Institute of

Behavior

Of the 12 animals in training, ten reached criterion on the reward-discrimination (three consecutive days over 80% choice of the big reward) and went on to be tested on the color-discrimination. For these 10 animals, the high level of reward-discrimination was maintained throughout all consecutive sessions (Fig. 2). Training of the remaining two animals was discontinued and they were omitted from analysis.

All animals learned the color-discrimination task within 10 days of training, the

Discussion

The aim of the present study was to test a prediction of reinforcement learning models. These models imply that learning-rates depend on the magnitude of reward delivered after correct responses. To assess this prediction, pigeons were trained on a color-discrimination task with different reward-magnitudes. In line with reinforcement learning models, a large-reward led to fast acquisition of the task, whereas a small-reward led to slow acquisition of the task. As an additional measure, the

Acknowledgement

This work was supported by the BMBF grant ‘reward-based learning’.

References (43)

  • G. Collier et al.

    Changes in performance as a function of shifts in the magnitude of reinforcement

    J Exp Psychol

    (1959)
  • L.P. Crespi

    Quantitative variation in incentive and performance in the white rat

    Am J Psychol

    (1942)
  • L.P. Crespi

    Amount of reinforcement and level of performance

    Psychol Rev

    (1944)
  • M.R. Denny et al.

    Differential response learning on the basis of differential size of reward

    J Genet Psychol

    (1955)
  • K. Doya

    Modulators of decision making

    Nat Neurosci

    (2008)
  • R.H. Dufort et al.

    Changes in response strength with changes in the amount of reinforcement

    J Exp Psychol

    (1956)
  • B.J. Everitt et al.

    Neural systems of reinforcement for drug addiction: from actions to habits to compulsion

    Nat Neurosci

    (2005)
  • N. Guttman

    Operant conditioning, extinction, and periodic reinforcement in relation to concentration of sucrose used as reinforcing agent

    J Exp Psychol

    (1953)
  • N. Guttman

    Equal-reinforcement values for sucrose and glucose solutions compared with equal-sweetness values

    J Comp Physiol Psychol

    (1954)
  • P.J. Hutt

    Rate of bar pressing as a function of quality and quantity of food reward

    J Comp Physiol Psychol

    (1954)
  • R. Ito et al.

    Dopamine release in the dorsal striatum during cocaine-seeking behavior under the control of a drug-associated cue

    J Neurosci

    (2002)
  • Cited by (18)

    • Striatal dopamine D1 receptors are involved in the dissociation of learning based on reward-magnitude

      2013, Neuroscience
      Citation Excerpt :

      After the animals learned to reliably choose the large reward over the small reward (three consecutive days of at least 80% choice of large reward), they were trained on the color-discrimination task. For a more detailed description of the initial training procedure, refer to Rose et al. (2009). This initial training was performed for two reasons.

    • The role of dopamine in maintenance and distractability of attention in the "prefrontal cortex" of pigeons

      2010, Neuroscience
      Citation Excerpt :

      With an increasing strength of the association between the target-stimulus and reward this activation is shifted to the target-stimulus (Schultz, 2007). At this stage, dopamine helps forming the stimulus-reward association (Rose et al., 2009; Tsai et al., 2009). When training is completed, the presentation of the target-stimulus triggers dopamine-release to the NCL.

    View all citing articles on Scopus
    View full text