Multiple representations and algorithms for reinforcement learning in the cortico-basal ganglia circuit

https://doi.org/10.1016/j.conb.2011.04.001Get rights and content

Accumulating evidence shows that the neural network of the cerebral cortex and the basal ganglia is critically involved in reinforcement learning. Recent studies found functional heterogeneity within the cortico-basal ganglia circuit, especially in its ventromedial to dorsolateral axis. Here we review computational issues in reinforcement learning and propose a working hypothesis on how multiple reinforcement learning algorithms are implemented in the cortico-basal ganglia circuit using different representations of states, values, and actions.

Highlights

► We review computational issues and possible algorithms for decision making. ► We review recent findings on the neural correlates of the variables in those algorithms. ► Then we propose a hypothesis about parallel and hierarchical modules in the striatum.

Introduction

The loop network composed by the cerebral cortex and the basal ganglia is now recognized as the major site for decision making and reinforcement learning [1, 2]. The theory of reinforcement learning [3] prescribes a number of steps that are required for decision making: 1) recognize the present state of the environment by disambiguating sensory inputs; 2) evaluate the candidate actions in terms of expected future rewards (action values); 3) select an action that is most advantageous; and 4) update the action values based on the discrepancy between the predicted and the actual rewards. Simplistic models of reinforcement learning in the basal ganglia (e.g. [4]) proposed that the cerebral cortex represents the present state and the striatal neurons compute action values [5]. An action is selected in the downstream, the globus pallidus, and the dopamine neurons signal the reward prediction error [6], which enables learning by dopamine-dependent synaptic plasticity in the striatum [7]. Recent studies, however, have shown that the reality may be more complex. Discriminating the environmental state behind noisy observation is in itself a hard problem, known as perceptual decision making [8, 9]. Activities related to action values are found not only in the striatum, but also in the pallidum [10, 11•] and the cortex [12••]. Different parts of the striatum, especially in its ventromedial to dorsolateral axis, have different roles in goal-directed and habitual behaviors [13]. Action selection may be performed not just in one locus in the brain but by competition and agreement among distributed decision networks [14]. Finally, a subset of midbrain dopamine neurons located in the dorsolateral part signal not only rewarding but also aversive signals [15••].

Based primarily on primate studies, Samejima and Doya [16] proposed that different cortico-basal ganglia subloops realize decisions in motivational, context-based, spatial, and motor domains. In this article, we consider how different algorithms of decision making, such as model-based and hierarchical reinforcement learning algorithms, can be implemented in the cortico-basal ganglia circuit with a focus on the ventromedial to dorsolateral axis in the rodent striatum.

Section snippets

Computational axes in action learning

In looking into the computational mechanisms of decision making and reinforcement learning, there are several axes that are useful for sorting out the process.

Model-free reinforcement learning algorithms

In the basic theory of reinforcement learning, the learning agent does not initially know how its actions affect the environmental state or how much rewards are given in what state. The action value-based algorithms, including Q-learning and SARSA, use actual experience of state, action, and reward to estimate the action value function Q(state,action), which evaluates how much future reward is expected by taking a particular action at a given state. An action can be selected greedily or

Model-based analysis of learner's variables

In order to describe how an animal's choices change dynamically depending on the reward experience, a straight forward way is to take a Markov model in which the conditional probability of action choice given previous state, action, and reward is computed. Such non-parametric, hypothesis-neutral description is helpful in measuring goodness of more elaborate model-based explanation [11]. Recent use of normative models, especially those by reinforcement learning algorithms, has turned out to be

Possible implementation in the cortico-basal ganglia network

Based on the computational requirements and possible reinforcement learning algorithms, we now review neural recording and brain imaging results, many of which through the model-based analysis described above, that shed light on how they could be implemented in the cortico-basal ganglia network.

Hierarchical reinforcement learning in the cortico-basal ganglia loops

Anatomically and neurophysiologically, DS and VS have the same basic structure and there is no clear boundary [69], suggesting a possibility that DS and VS work with the same mechanism. On the contrary, input from the cortex has a dorsolateral–ventromedial gradient in the modality: the more dorsolateral striatum receives sensorimotor-related information and the more ventromedial part receives associative and motivational information [69]. These striatal subdivisions send their output through

Conclusion

We reviewed computational issues and possible algorithms for decision making and reinforcement learning and recent findings on the neural correlates of the variables in those algorithms. Then we proposed a working hypothesis: the dorsolateral, the dorsomedial, and the ventral striatum comprise a parallel and hierarchical reinforcement learning modules that are in charge of actions at different physical and temporal scales. The parallelism of the decision modules has been suggested also in the

References and recommended reading

Papers of particular interest, published within the annual period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

References (73)

  • K. Doya

    Modulators of decision making

    Nat Neurosci

    (2008)
  • R.S. Sutton et al.

    Reinforcement Learning

    (1998)
  • K. Samejima et al.

    Representation of action-specific reward values in the striatum

    Science

    (2005)
  • W. Schultz et al.

    A neural substrate of prediction and reward

    Science

    (1997)
  • J.N. Reynolds et al.

    A cellular mechanism of reward-related learning

    Nature

    (2001)
  • R. Kiani et al.

    Representation of confidence associated with a decision by neurons in the parietal cortex

    Science

    (2009)
  • R.P. Rao

    Decision making under uncertainty: a neural model based on partially observable markov decision processes

    Front Comput Neurosci

    (2010)
  • B. Pasquereau et al.

    Shaping of motor responses by incentive values through the basal ganglia

    J Neurosci

    (2007)
  • M. Ito et al.

    Validation of decision-making models and analysis of decision variables in the rat basal ganglia

    J Neurosci

    (2009)
  • K. Wunderlich et al.

    Neural computations underlying action-based decision making in the human brain

    Proc Natl Acad Sci USA

    (2009)
  • C.M. Pennartz et al.

    Corticostriatal interactions during learning, memory processing, and decision making

    J Neurosci

    (2009)
  • P. Cisek

    Cortical mechanisms of action selection: the affordance competition hypothesis

    Philos Trans R Soc Lond B Biol Sci

    (2007)
  • M. Matsumoto et al.

    Two types of dopamine neuron distinctly convey positive and negative motivational signals

    Nature

    (2009)
  • K. Samejima et al.

    Multiple representations of belief states and action values in corticobasal ganglia loops

    Ann N Y Acad Sci

    (2007)
  • N.D. Daw et al.

    Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control

    Nat Neurosci

    (2005)
  • R.S. Sutton et al.

    Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning

    Artif Intell

    (1999)
  • T.G. Dietterich

    Hierarchical reinforcement learning with the MAXQ value function decomposition

    J Artif Intell Res

    (1999)
  • P. Dayan et al.

    Feudal reinforcement learning

  • N.D. Daw et al.

    The computational neurobiology of learning and reward

    Curr Opin Neurobiol

    (2006)
  • G. Corrado et al.

    Understanding neural coding through the model-based analysis of decision making

    J Neurosci

    (2007)
  • D.J. Barraclough et al.

    Prefrontal cortex and decision making in a mixed-strategy game

    Nat Neurosci

    (2004)
  • L.P. Sugrue et al.

    Matching behavior and the representation of value in the parietal cortex

    Science

    (2004)
  • B. Lau et al.

    Dynamic response-by-response models of matching behavior in rhesus monkeys

    J Exp Anal Behav

    (2005)
  • N.D. Daw et al.

    Cortical substrates for exploratory decisions in humans

    Nature

    (2006)
  • H. Akaike

    A new look at the statistical model identification

    IEEE Trans Autom Control

    (1974)
  • H. Kim et al.

    Role of striatum in updating values of chosen actions

    J Neurosci

    (2009)
  • Cited by (95)

    • Brain signals of a Surprise-Actor-Critic model: Evidence for multiple learning modules in human decision making

      2022, NeuroImage
      Citation Excerpt :

      This method is similar to approaches used in statistics and economics (Berger and Pericchi, 1996; Fong and Holmes, 2020; Rust and Schmittlein, 1985; Wang and Pericchi, 2020). While the models’ log-evidence can be approximated Akaikes Information Criterion (AIC) and Bayesian information criterion (BIC), cross-validation is often considered a more robust method for model comparison (Ito and Doya, 2011). For the Surprise Actor-critic (which we later focus on as our winning model of model comparison), we performed (i) posterior predictive checks as well as (ii) a model recovery and (iii) a parameter recovery analysis (Nassar and Frank, 2016; Wilson and Collins, 2019).

    • Reinforcement learning and its connections with neuroscience and psychology

      2022, Neural Networks
      Citation Excerpt :

      Langdon, Sharpe, Schoenbaum, and Niv (2018) reviewed recent findings of the association of dopaminergic prediction errors with model based learning and hypothesized that the underlying system might be multiplexing model-free scalar RPEs with model-based multi-dimensional RPEs. Although there have been numerous advancements in finding neural correlates for model-free reinforcement learning (Delgado, Nystrom, Fissell, Noll, & Fiez, 2000; Hare, O’Doherty, Camerer, Schultz, & Rangel, 2008; Knutson & Gibbs, 2006), the last two decades have witnessed research that bolsters evidence for the existence of a model-based system especially in a combined setting with the model-free learning system (Daw, Gershman, Seymour, Dayan, & Dolan, 2011; Gläscher, Daw, Dayan, & O’Doherty, 2010b; Ito & Doya, 2011; Kool, Cushman, & Gershman, 2018; Seo et al., 2009). Human neural systems are known to use information from both model-free and model-based sources (Daw, Niv, & Dayan, 2005; Gläscher et al., 2010a; Pan, Sawa, Tsuda, Tsukada, & Sakagami, 2008).

    View all citing articles on Scopus
    View full text