A bio-inspired reinforcement learning model that accounts for fast adaptation after punishment

: Humans and animals can quickly learn a new strategy when a previously-rewarding strategy is punished. It is difficult to model this with reinforcement learning methods, because they tend to perseverate on previously-learned strategies - a hallmark of impaired response to punishment. Past work has addressed this by augmenting conventional reinforcement learning equations with ad hoc parameters or parallel learning systems. This produces reinforcement learning models that account for reversal learning, but are more abstract, complex, and somewhat detached from neural substrates. Here we use a different approach: we generalize a recently-discovered neuron-level learning rule, on the assumption that it captures a basic principle of learning that may occur at the whole-brain-level. Surprisingly, this gives a new reinforcement learning rule that accounts for adaptation and lose-shift behavior, and uses only the same parameters as conventional reinforcement learning equations. In the new rule, the normal reward prediction errors that drive reinforcement learning are scaled by the likelihood the agent assigns to the action that triggered a reward or punishment. The new rule demonstrates quick adaptation in card sorting and variable Iowa gambling tasks, and also exhibits a human-like paradox-of-choice effect. It will be useful for experimental researchers modeling learning and behavior. Highlights:


Introduction
Humans can efficiently change strategies when a strategy that used to be rewarding is punished.This ability is an essential part of rewarding behavior generally (Izquierdo et al., 2017), and its impairment can have dire consequences: imagine a deer who perseverates at a favorite watering hole even after an alligator has moved into it, or a gambling addict who continues to use a betting system even after substantial loss.Indeed, impaired "reversal learning" is associated with gambling disorders (Jara-Rizzo et al., 2020;Perandrés-Gómez et al., 2021;Wiehler et al., 2021) and sometimes with drug addiction (Izquierdo & Jentsch, 2012;Zhukovsky et al., 2019) and other psychiatric disorders (Kovalchik & Allman, 2006).
This kind of adaptation is sometimes tested using the Wisconsin Card Sorting Testa cognitive test in which participants must match cards based on one of several features (color, shape, number), and must infer the correct matching criteria from positive and negative feedback (Berg, 1948).Because the test requires participants to identify changes in the criteria and switch matching strategies appropriately, it has historically been used to identify the impaired flexibility associated with brain damage and neurodegenerative disease (Milner, 1963).A more recentlyproposed test is the Variable Iowa Gambling Task (Kovalchik & Allman, 2006), in which participants repeatedly select from several optionseach of which is rewarded with a certain (but unknown to the participant) probability.Through experimentation the participants can identify the high-probability reward option, but the probabilities change periodically, forcing the participants to adapt.
Researchers have often relied on Reinforcement Learning to model adaptation and reversal learning.Reinforcement learning is a field that provides mathematical models of learning and decision making, thought to correspond to learning mechanisms in the dopaminergic and striatal systems (Dayan, 2009;Montague et al., 1996;Schultz et al., 1997;Starkweather & Uchida, 2021;Sutton & Barto, 2018).These models have been used by psychologists and neuroscientists to analyze and understand animal behavior for many years (N.Daw, 2012;Neftci & Averbeck, 2019;Sutton & Barto, 2018).
However, while conventional reinforcement learning models can very effectively learn a particular task or environment (recent reinforcement-learning-based computer algorithms can play Go, Chess, and Starcraft at superhuman levels (Schrittwieser et al., 2020;Vinyals et al., 2019)), they can be slow to learn (requiring large numbers of training examples) (Botvinick et al., 2019), and quite incapable of fast strategy-switching (Chalmers et al., 2016;Chalmers et al., 2016).This causes them to perseverate in previously-rewarding behaviorswhich could make them good models of impaired reversal learning, and perhaps of reversal learning in some animals (Bari et al., 2022), but poor models of healthy human reversal learning.Researchers have tried to compensate for this by adding various ad-hoc parameters to the standard learning equations.These new parameters have directly created perseveration or "stickiness" behaviors, modified behavior based on uncertainty, or created separate learning processes for positive and negative feedback (Metha et al., 2020;Wiehler et al., 2021;Zhukovsky et al., 2019).Other researchers have proposed multiple learning systems acting in parallel: Worthy et al. proposed that the Iowa Gambling Task invokes a reinforcement learning process in some individuals, but a winstay/lose-shift strategy in others (Worthy et al., 2013;Worthy & Maddox, 2014), and Steinke et al proposed that humans solve card sorting tasks through a combination of model-based systems (which "understand" the task and need only learn which matching criteria currently applies) and model-free systems (which learn stimulus-response associations) (Steinke et al., 2019(Steinke et al., , 2020)).
These modifications to the conventional reinforcement learning equations are all valid, and produce the intended reversal-learning behaviors.But they also add significant complexity to the conventional equations − increasing their risk of overfitting.Here we draw inspiration from a neuron-level learning rule to propose a new reinforcement learning rule that might account for the range of possible adaptation behaviors without additional parameters.

Methods and experiments
We describe a candidate reversal-learning mechanism in the form of a new reinforcement learning equation.But rather than creating this equation through the addition of ad-hoc parameters as past work has done, we instead generalize a recently-discovered neuronal-level learning rule.This provides a more principled route, on the assumption that the neuron-level rule captures some basic principle of learning that may reappear at the whole-brain-level.Before describing the proposed new reinforcement learning rule, we will first review a classical formulation of reinforcement learning, and describe the neuronal-level learning rule.

A formulation of reinforcement learning
Suppose an agent perceives a state or stimulus s, and can choose from a set of actions A. If the agent has an estimate of the value of each action a: V(s,a), then a softmax equation can be used to compute the probability π(s,a) of selecting action a given state s: here τ is a "temperature" parameter that sets the strength of the agent's preference for exploiting the action with highest value, over exploring the other actions (for example, for high "temperature" values, the agent's choice of action will be closer to random).The agent may receive a reward r after executing an action, and can calculate the difference δ between its value estimate and this reward.This difference is the reward-prediction-error or "temporal difference (TD) error" (Sutton & Barto, 2018).In single-decision-making tasks like the card sorting or Iowa gambling tasks, this error can be expressed: Note that r can be zero (no reward) or negative (a punishment).This experience can then be used to update the value estimate, according to a learning rate α: It is believed that the phasic activity of dopamine neurons in the midbrain signals the reward prediction error δ.These phasic activity fluctuations are thought to influence plasticity in the striatum (Schultz et al., 1997) (and possibly hippocampus (Mehrotra & Dubé, 2023) and prefrontal cortex (N.D. Daw et al., 2005Daw et al., , 2006)), information from which is likely used to compute the reward prediction error (Starkweather & Uchida, 2021)) in a way that optimizes the organism's expected or perceived value of particular actions, allowing it to learn rewarding behavior.

A new reinforcement learning rule, generalized from a neuron-level rule
Changing behavior in response to punishment or reward was observed even in single-cell organisms (Armus et al., 2006;Dussutour, 2021).Similarly, single cells like neurons can adapt their behavior to maximize energy intake (reward) to avoid starvation (punishment).The energy supplied to a neuron comes from local blood vessels controlled by the combined activity of local neurons.However, neurons' electrical activity consumes a lot of energy.Thus to maximize metabolic energy, a neuron needs to maximize local blood supply by activating other neurons, while minimizing its own activity.Interestingly, it was shown that such maximizing metabolic energy is best achieved by a single neuron by adjusting its synaptic weights by applying a predictive learning rule (Luczak et al., 2022): where a neuron adjusts its synaptic weights (w) to minimize surprise i.e., the difference between actual (x post ) and predicted activity (x post ): where pre and post refer to pre-and postsynaptic neurons, and the tilde indicates the neuron's prediction of its own future activity.This rule is also consistent with experimental findings in awake animals (Luczak et al., 2022), and may even hint at an explanation for consciousness (Luczak & Kubo, 2021), suggesting that it may encapsulate a basic principle of learning, from single cell to whole organism level.
Examining this predictive rule, we see it consists of a prediction-error term, modulated or scaled by a presynaptic activation.We then generalize each of these components to create an analogous reinforcement learning rule, similar to the rule we have proposed previously (Chalmers & Luczak, 2023).The prediction-error component is easy to place in a reinforcement learning context: it is the reward-prediction-error δ.We then need a scaling factor analogous to the presynaptic activation in the neuronal rule.Since presynaptic activation is the input to the postsynaptic neuron and the cause of its resulting activity, a natural reinforcement learning analog could be π(s,a); the input to the agent's environment and the cause of the resulting experience.We can then formulate a reinforcement learning rule as a scaling of δ by π(s,a): or, for a form more similar to equation (3): The analogy between the neuronal learning rule and the new RL rule is illustrated in Fig. 1.

Simulations
We test our new learning rule in two simulated tasks: the variable Iowa gambling task, and a card sorting task.Each task is performed by artificial learning agent(s), which learn from feedback in the task.In the gambling task the agent repeatedly selects from several options.Each option has a hidden reward probability: when selected, the agent is rewarded (+1 reward) with that probability and punished (− 1 reward) otherwise.The task was designed to include one high-reward option with reward probability of 0.9 and one always-punished arm with reward probability of 0. The other options had reward probabilities randomly selected between 0.25 and 0.75.Every 100 steps the reward probabilities were rotated such that all reward probabilities change, and the arm that was previously high-reward becomes always-punished (see Supplementary Data for an example).Note that in our experiments the agents are not capable of detecting the pattern to these changes.
In the card matching task the agent is given the features of a card to be matched, and must select one of several possible matching cards based on those features.The correcting matching criteria are hidden and must be inferred from rewards (+1 reward is delivered when a match is made correctly) and punishments (− 1 reward is delivered when a match is incorrect).In both tasks the correct strategy periodically changes.The standard (Wisconsin) card sorting task uses cards with 3 features (color, shape, number) and presents 4 options for matching.Here we use a numerical version of the task in which cards' feature values are represented by a set of numbers indicating the corresponding matching option for that feature (see Fig. 2).This allows us to create generalizations of the standard card sorting task with arbitrary numbers of features and matching options.Thus we can gradually increase the difficulty of the card sorting task and observe the effect on reversal learning.
For the gambling task we embed our new learning rule in an artificial agent that maintains value estimates for each action, selects actions according to equation (1), and updates its value estimates according to equation (6) (our new rule).We compare this agent's performance to 3 alternatives.The first alternative is a classical temporal-difference (TD) reinforcement learning algorithm (Sutton & Barto, 2018) which updates its value estimates according to equation (3).The second alternative is the gradient bandit algorithm (Sutton & Barto, 2018), another conventional approach to the gambling task which computes the gradient of the expected reward with respect to each action preference, and adjusts the preferences to maximize reward.The third alternative is the upper confidence bound algorithm (UCB) (Auer, 2002), which is an algorithm designed specifically to solve the gambling task by balancing exploitation of high perceived values with exploration of other actions in an optimal way.In our experiments, the UCB algorithm also has the advantage of being automatically reset when the reward probabilities in the task change − this is information the other learning agents do not have, so the UCB algorithm here represents a theoretical upper limit on performance that the other agents are not expected to reach.For each agent we measure the average reward per choice, the win-stay percentage (the chance that the agent will repeat an action after being rewarded) and the lose-shift percentage (the chance that the agent will change actions after being punished).Gradient bandit and UCB algorithms used the Python implementation by Byron Galbraith (Galbraith, 2024).
For experiments in the card sorting task we use Deep Reinforcement Learning: a recent branch of RL which incorporates artificial neural networks.The networks are embedded within and control the actions of an artificial agent, and they perform the core operations of representation, value estimation, and decision making.Reward prediction errors experienced as the agent interacts with its environment are used to adjust the networks in a way that maximizes expected reward for the agent.Because of the close analogy between Deep RL and biological reward-based learning, Deep RL is starting to be used in significant neuroscientific modeling and hypothesis creation (Botvinick et al., 2020).
Here we use an artificial neural network that accepts the features of the current card to be matched as inputs, and outputs estimated values of each match option.Our new learning rule is used to update the network after each reward or punishment, making its estimates more accurate.For comparison we also test a deep reinforcement learning network based on the conventional learning rule.These networks are essentially learning stimulus-response associations (and must re-learn them when the matching criteria changes).We also compare to an interesting parallel model-based/model-free learning system proposed by Steinke et al (Steinke et al., 2019(Steinke et al., , 2020)), in which a model-based 1 learning process "understands" the task and need only learn which matching criteria currently applies, while a separate model-free process learns the stimulus-response associations described above, and the outputs are combined to produce each action.However, in Steinke's implementation, the model-based system's understanding of the task is assumed, and the effort to acquire it is unaccounted for.This makes it an unfit comparator for our other deep learning algorithms, which must learn rewarding behavior from scratch.Therefore our implementation of Steinke's idea encapsulates both the model-based and model-free systems in a single artificial neural network, which outputs estimated values both for individual matching actions (stimulus-response or model-free learning) and for matching based on specific criteria (Steinke's model-based learning).In this way we stay true to Steinke's parallel-system idea, but require the model-based system to be learned too.

Code
Learning algorithms and simulations were written in the Python language.

Results in variable Iowa gambling task
Measurements from the simulated gambling task are shown in Fig. 3.The new rule demonstrates much higher lose-shift behavior (i.e. less perseveration) than the conventional reinforcement learning model.Its average reward per step approaches the theoretical limit provided by the UCB algorithm, indicating that it exhibits reversal learning much better than conventional reinforcement learning models.Differences in winstay behavior between the three algorithms were statistically significant, but less dramatic (see statistical significance test results in supplementary data).
Periodic changes in the task reward structure cause the visible ripples in the cumulative reward curves in Fig. 3.After each ripple, the Mean rewards-per-step for the theoretical limit, the new rule, TD learning, and the gradient bandit algorithm were 0.56 (standard deviation 0.01), 0.53 (0.02), 0.47 (0.03), and 0.04 (0.02) respectively.Winstay percentages were 75. 3 (1.2), 77.3 (1.4), 78.0 (1.2), and 65.3 (1.2)   Fig 2. Illustration of the simulated tasks used in our experiments.Left: the gambling task presents the agent with several choices, each with a different, hidden reward/punishment probability that the agent must discover through trial and error.Reward probabilities change periodically, forcing reversals.Right: the card sorting task presents the agent with a card that must be matched to one of several others.The correct matching criteria periodically changes, forcing reversals.Here we use a numerical version of the task, in which each feature is represented by a number indicating which match that feature suggests.This numerical representation allows computational learning algorithms to be applied easily, and also allows us to increase the difficulty (number of features) arbitrarily.respectively.Lose-switch percentages were 29.7 (1.0), 17.0 (1.9), 6.1 (0.9), and 0.2 (0.5) respectively.All differences between models were statistically significant at the p < 0.01 level, by Welch's t-test (see supplementary data for complete statistical significance test results).

Results in card sorting task
The average reward per step for the card sorting task is illustrated in Fig. 4. The new rule outperforms the parallel model-based/model-free learning system, and significantly outperforms the conventional deep reinforcement learning network.Thus, the new rule effectively models adaptation to changing environmental reward and punishment contingencies in an artificial neural network.
Mean reward-per-step for the new rule, TD learning, and the parallel system were 0.45 (standard deviation 0.05), 0.04 (0.03) and 0.35 (0.04) respectively.All differences were statistically significant at the p < 0.01 level by Welch's t-test.

Card sorting performance compared to humans
To validate our new rule and put all results in context of human performance, we compare the models to human performance in the card sorting task.Kim et al. (Kim et al., 2011) report card sorting performance for 21 healthy humans over 128 card-matches, so we compare the performance of our new rule, the parallel model-based/model-free system of Steinke et al (Steinke et al., 2020), and conventional TD learning to this human benchmark.For this comparison the models do not use artificial neural networks: we implement the parallel model-based/ model-free algorithm of Steinke et al. as in their paper.As in Steinke et al, we implement conventional TD learning and our new rule at the level of card-sorting strategy: at each step, each selects from the three matching criteria to use (color, shape, or number).
Experiments consisted of each model performing 128 card-matching decisions, and the experiments were repeated 20 times.TD learning committed an average of 96.8 (standard deviation 3.6) errors (instances of selecting a card according to the wrong matching criteria) out of 128 matches.The parallel model-based/model-free method committed 36.8 (5.6) total errors, and the new rule committed 31.5 (6.2) total errors, compared to 16.76 (16.71) for humans (Kim et al., 2011).All differences were statistically significant at the p < 0.01 level by Welch's t-test.These results are illustrated in Fig. 5: the new rule comes closest to human performance in terms of total errors.
It is also common to measure perseverative errors specifically (errors in which a card is selected according to the previous matching criteria after a change).TD learning committed an average of 32.1 (standard deviation 4.4) perseverative errors, the parallel model-based/model-free method committed 13.2 (2.4) perseverative errors, and the new rule committed 21.1 (3.2) perseverative errors, compared to 11.19 (11.1) for humans (Kim et al., 2011).All differences were statistically significant at the p < 0.01 level by Welch's t-test.Though the new rule commits fewer errors in total, the parallel model-based/model-free system achieves fewer perseverative errors, possibly because it is more complex, with more free parameters to optimize (6 free parameters, compared to 2 for the new rule).

The new rule models both healthy and impaired reversal learning
Figs. 3 and 4 show that the new rule models effective adaptation to changing reward and punishment contingencies, but at first glance it appears that conventional reinforcement learning is still the best model of the perseveration that comes with impaired reversal learning.However, the new rule can produce this behavior as well.It can be interpreted as a special case of the conventional reinforcement learning rule: if separate temperature parameters are used in evaluating equation (1) (action selection) and equation ( 6) (value updates), the later temperature parameter can control the degree to which the new rule behaves like a conventional rule.This is because for very large τ, π(s,a) becomes a constant that can be absorbed into the learning rate α, making equation ( 6) identical to equation (3) for large τ.This effect is illustrated in Fig. 6.

A paradox-of-choice effect
Healthy humans have an impressive capacity for adaptation: we can change our decision making strategy very quickly after a punishment is received.But this ability to effortlessly adapt to changing reward and punishment affordances degrades as the complexity of the decision making task increases.Psychologist Barry Schwartz calls this "the paradox of choice" (Schwartz & Kliban, 2014): As the number of choices increases, our ability to select a satisfying option decreases and our preferences become weaker (Chernev, 2003).
We can observe a similar effect in our card sorting task, illustrated in Fig. 7.The new rule enjoys a significant advantage over conventional TD learning in the standard card sorting task (cards have 3 features, and there are 4 options for matching).But the rule suffers a disadvantage when the decision making task becomes more complex.Differences at each number of features illustrated in Fig. 7 are statistically significant at the p < 0.01 level by Welch's t-test.

Discussion
LIterature contains several disparate strategies-for and models-of adaptation to changing reward and punishment contingencies.Some of these are unified in the proposed learning rule.For example, Worthy et al. found that quick adaptation after punishment in the card sorting task is partly explained by reinforcement learning and partly by winstay/lose-shift strategy (Worthy et al., 2013;Worthy & Maddox, 2014).The new rule retains the basic mechanism of reinforcement learning, but produces greater lose-shift behavior, and so in a way captures both ideas.Similarly, the behaviors that have been added to conventional reinforcement learning through the addition of perseveration, "stickiness", or negative learning rate parameters (Metha et al., 2020;Wiehler et al., 2021;Zhukovsky et al., 2019) are to some degree created by the new rule without the addition of such parameters.Furthermore, Fig. 5 illustrates that by varying the temperature parameter, the new rule can account for a spectrum of behavior ranging from the quick adaptation of healthy reversal learning, to the perseveration of pathological reversal learning.The new rule depends only on the reward/punishment r and probability π for each individual experience; it can operate continuously through changes in environmental rewards or punishments, or both.
The form of the new rule is closely related to that of both TD learning and the Gradient Bandit Algorithm.The Gradient Bandit Algorithm uses a learning rule derived by computing the mathematical gradient of the expected reward with respect to each action preference (see Sutton & Barto for details (Sutton & Barto, 2018)).Interestingly, the learning rule thus derived involves a scale factor of (1 − π).This is in contrast to − and produces the opposite effect of − the scale factor of π which we have arrived at by generalizing a neuronal learning rule.By using a scale factor of (1 − π), the gradient bandit algorithm slows its learning as it becomes more confident in an action, "zeroing in" on optimal behavior.This is valuable in a static environment.But our results suggest that in a highly dynamic environment, the opposite effect is desirable: our scale factor of π results in bigger updates when unexpected punishments are introduced, allowing the agent to adapt more quickly.
Healthy learning allows quick response to punishment in simple decision making tasks.But for more complex decision making tasks, human decision making becomes less effective.Decision making time increases with the number of options in a relationship known as Hick's law (Hick, 1952), and selections made from large assortments can lead to weaker preferences" (Chernev, 2003).Fig. 6 illustrates a similar effect, in which the new rule adapts to punishment more effectively than a conventional reinforcement learning rule when the number of options is small, but less effectively when the number of options is large.In addition, the new rule's strong reactions to unexpected punishments could increase response variance and reduce performance under stable, static conditions.Thus the rule may account for both some advantages   Reward per-step obtained as the card sorting task becomes more difficult.For each number of card features shown, the number of matching options was 1 greater.Thus, the standard card sorting task is represented at the left (3 features, 4 options) and the (hypothetical) card sorting tasks become increasingly complex as we move right.The new rule has a significant reversallearning advantage over the conventional reinforcement learning model for simple tasks, but a disadvantage in complex tasks.This is similar to the "paradox of choice" effect observed in humans.Shaded regions show 95% confidence intervals for the mean.and tradeoffs of biological learning.These particular trade-offs likely work in favor of biological agents in the natural world, who rarely select between many attractive options, but must continuously adapt to simple environmental changes and punishments.
The new rule is inspired by a neuron-level rule that explains how neurons may learn in an energy-efficient manner.Here we have drawn a link between the neuron-level and the higher-level reinforcement learning situation: this is an analogy across a wide gulf of scale.However it is interesting that the principles in that neuron-level rule seem to generalize to the reinforcement learning situation, giving a learning rule that creates more realistic adaptation behavior than other reinforcement learning models without adding complexity.Since it produces realistic effects in both neuronal and reinforcement learning settings, we wonder if the rule may capture a basic principle of learning that applies at all levels of life: from neurons, to brains, to societies.
A major limitation of this work is that the experiments are almost entirely done in simulation, with comparison to human data only at the level of averages.Comparison of the new rule's decisions with trial-bytrial, decision-by-decision data from humans and animals is the next step in validating the new rule as a model of biological learning and choice.While such trial-by-trial data was not available for the present study, future work must perform this level of detailed analysis to discover how well and under what conditions the new rule predicts biological decision making.Until then, this work remains theoretical.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig 1 .
Fig 1.Our new rule is a generalization of the neuron-level learning rule of Luczak et al, which modifies the connection weight between two neurons after the presynaptic neuron's activation has caused the post-synaptic neuron to activate.That rule affects a synaptic weight change proportional to the error in predicted postsynaptic activation (error in predicted effect) scaled by the pre-synaptic activation (the cause).An analogous reinforcement-learning rule could scale reward prediction error (error in predicted effects) by the action probability π (the cause).

Fig 3 .
Fig 3. Performance comparison in the variable Iowa Gambling Task, for four learning rules: our new rule, conventional TD learning, the gradient bandit algorithm, and a theoretical upper limit on performance (see methods for details).a) cumulative reward over repeated choices in the task.b) average reward-per-step in the task.c) observed probability of lose-switch behavior.d) observed probability of win-stay behavior.Conventional TD learning exhibits perseverative behavior (low loseswitch and lower overall reward) indicative of impaired reversal learning.The new rule's performance is closer to optimal.Shaded regions and error bars show 95% confidence intervals for the mean over 20 repetitions.

Fig 4 .
Fig 4. Performance comparison in the Card Sorting Task, for three learning models based on artificial neural networks: a network incorporating our new rule, a conventional TD learning-based neural network, and a system involving parallel model-based and model-free learning (similar to the system proposed by Steinke et al).The new rule is capable of better performance than the parallel model-based/model-free approach In the neural networks, and the inadequacy of a standard (temporal-difference) RL rule is clearly seen.Shaded regions and error bars show 95% confidence intervals for the mean.

Fig 5 .
Fig 5. Number of total errors (a) and perseverative errors (b) over 128 card sorting matches for three learning models, and human performance reported by Kim et al(Kim et al., 2011).Of the three models, our new rule comes closest to human performance in terms of total errors, and has fewer free parameters (2, compared to 6 for the parallel model-based/model-free system).Bar height shows mean, and error bars show standard deviation.

Fig 6 .
Fig 6.Average reward per action achieved by the new learning rule in the gambling task, using various values of τ in evaluating equation (6).For large τ the new rule becomes similar to conventional reinforcement learning with its greater perseveration and impaired reversal learning.Thus the new rule is capable of expressing a range of reversal learning behavior, from healthy to impaired.Shaded regions show 95% confidence intervals of the mean.

Fig
Fig 7.Reward per-step obtained as the card sorting task becomes more difficult.For each number of card features shown, the number of matching options was 1 greater.Thus, the standard card sorting task is represented at the left (3 features, 4 options) and the (hypothetical) card sorting tasks become increasingly complex as we move right.The new rule has a significant reversallearning advantage over the conventional reinforcement learning model for simple tasks, but a disadvantage in complex tasks.This is similar to the "paradox of choice" effect observed in humans.Shaded regions show 95% confidence intervals for the mean.