Test-retest reliability of canonical reinforcement learning models

Reinforcement learning (RL) paradigms are commonly used in Cognitive Science research on human learning. These paradigms are often used in combination with computational models to estimate individual differences in learning parameters. Recently, it has been proposed that such parameter estimates can be used to better understand psychiatric conditions (Montague, Dolan, Friston, & Dayan, 2012). However, to be used as such, it is essential that the test-retest reliability of these paradigms and computational models is established. The present study seeks to close this gap by investigating the test-retest reliability of standard RL models in the context of two canonical paradigms: a probabilistic RL task with gain and loss feedback and a reversal learning task (Cools, Clark, Owen, & Robbins, 2002; Frank, Seeberger, & O’reilly, 2004). This study obtained test results from n=150 participants for each task via the online testing platform Amazon Mechanical Turk with a between-test interval of five weeks. Several standard versions of Rescorla Wagner models are fitted to the choice data in R to study the test-retest reliability of resulting parameter estimates. Test-retest reliability is studied in regard to behavioral measures and model parameters.


Introduction
Psychology is experiencing a replicability crisis. One of the potential causes underlying this problem may be lacking test-retest reliability of canonical test methods (Leppink & Pérez-Fuster, 2017). Cognitive Science and Cognitive Neuroscience increasingly use reinforcement learning tasks for computational modelling of choices to infer underlying patterns of learning. Concerns have been raised about the robustness of such computational modelling results, with test-retest reliability problems shown in relation to oversimplified, or overparameterized, computational models (Collins & Frank, 2012), misalignment of the model and experiment design (Spektor & Kellen, 2018), and assumptions about dynamic versus fixed parameters (Nassar & Gold, 2013). Some of these problems could be prevented by first establishing the model identifiability and recovery (Palminteri, Wyart, & Koechlin, 2017). However, the participants or task design may also play a role in performance on test retest reliability (e.g. due to strong path dependency in dynamic tasks).
No prior research reported an empirical test of reliability of parameter estimates across different sets of computational models. The present study seeks to close this gap by testing the testretest reliability of two canonical RL paradigms: a probabilistic RL task with gain and loss feedback (Pessiglione, Seymour, Flandin, Dolan, & Frith, 2006) and a reversal learning task (Cools et al., 2002).

Participants
Participants located in the United States completed each task via the online testing platform Amazon Mechanical Turk with a between-test interval of five weeks. Participants were allowed to take part in each task once and all participants included during T1 were invited to re-take the task five weeks later. The probabilistic gain/loss (PGL) task was completed by (n=142) participants during T1 and (n=93) during T2. The reversal learning (RVL) task was completed by (n=154) during T1 and (n=64) during T2. Behavioral analysis and computational modelling included participants whose performance met inclusion criteria during both T1 and T2, (n=69, m/f: 44/25, age=35(11)) in the PGL task and (n=47, m/f: 23,23, age=39(12)) in the RVL task (i.e. 'returners'). Exclusion criteria included failing to provide a valid MTurk ID, timing out on >20% of trials, and comments after completing the task that indicated misunderstanding the task. Participants were excluded when overall accuracy dropped <50% (PGL task) or below 55% (RVL task) or 513 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 when they chose stimuli with the same laterality >20 (PGL) or >30 (RVL) times in a row.

Tasks
Two canonical decision-making tasks were tested: a probabilistic RL task with gain and loss feedback (PGL, Pessiglione et al., 2006) and a reversal learning task (Schlagenhauf et al., 2014;Cools et al., 2002). All aspects of task design were kept as in the cited studies except for slightly adapted stimuli. Choice stimuli in the RVL task were geometric shapes and in the PGL task, images of everyday objects. Feedback stimuli were an image of a $1 bill with the headline "Gain", an image of a crossed out $1 bill with the headline "Loss" and an image of a neutral grey box with the headline "Neutral".

Behavioral analysis
Three behavioral measures were analyzed: accuracy, win stay and lose shift. Accuracy is defined as the number of times the stimulus with a higher reward probability was chosen, divided by all trials. Win stay is the number of repeated choices in trials following positive feedback divided by all trials following positive feedback. Lose shift is the number of shift responses following negative feedback divided by all trials following negative feedback. Timed-out trials as well as trials with 50-50 reward probability (only occurred in RVL task) were excluded. Test-retest reliability of each behavior was studied using Pearson's correlation and the Intra-class correlation coefficient (ICC (3,k)) over all returners between T1 and T2. ICC(3,k) scores were interpreted following (Koo & Li, 2016), with r<0.5 indicating 'poor', . 5 < r < .75 'moderate', . 75 < r < .9 'good', and r > .9 'excellent' reliability.

Computational modelling
Reinforcement learning algorithms were fitted to participants' choice behavior to infer underlying parameter values (Sutton & Barto, 1998). Specifically, different adaptations of the Rescorla-Wagner model were used (Rescorla & Wagner, 1972). In this model, choices result from a trial-by-trial calculation of anticipated outcomes ( ) of a choice ( ), weighed by prediction errors ( ) and the learning rate ( ).
The prediction error constitutes the trial-by-trial mismatch between an anticipated outcome and the observed outcome.
A modification was applied using two learning rates to differentiate between learning from positive and negative prediction error . A second modification was to add a parameter to weigh the extent to which participants use feedback to infer value-updates about the unchosen stimulus. Three variations of this 'doubleupdating' parameter were tested together with one and two learning rates. First, = 0 to model the absence of such inference, second, = 1 to model full updating, assuming anticorrelated reward for the two stimuli, and third, as free parameter 0 < < 1 for individually weighted updating of the unchosen stimulus trial by trial. The third modificsation was to add a free parameter γ moderating the decay of the learning rate(s) over the course of the task, tested with one and two learning rates. All models were fitted in the programming language R. In total, 8 models were fitted to choices in each task for T1 and T2, resulting in 32 model fittings in total. A softmax function was used to generate a trialby-trial probability of the observed choice behavior, given the modelled value estimates and accounting for decision noise in the free parameter (θ).
Free parameters were initialized at random values (0 < , , γ < 1) and (0 < θ < 10) for each participant and constrained to these parameter boundaries except the decay parameter, which was constrained (0 < γ < 4). All models were fitted using a general-purpose optimization algorithm based on the Nelder-Mead method (Nelder & Mead, 1965). Each model was fitted to each participant with 20 random initial parameter values to avoid getting stuck in local minima. The best fitting parameter estimates as indicated by lowest AIC value were stored for each subject.

Computational modelling results
In the PGL task, best model fit during T1 and T2 was achieved by the model with two learning rates and no double updating (T1: AIC = 139.98, T2: AIC = 141.32).
In the RVL task, best model fit during T1 and T2 was achieved by the model with two learning rates and individually weighted double-updating, (T1: AIC = 211.23, T2: AIC = 212.51). Adding a decay parameter did not improve model fit in either task. In the PGL task, no estimated parameters by the best-fitting model exhibited a significant correlation between T1 and T2, although a trend emerged for θ, (69) = .23, p = .057. The ICC was significant but 'poor' for θ, ICC(3, k) = .37, p = .029. No other model yielded more than one significantly correlated parameter estimate between T1 and T2.
In the RVL task, estimated values by the best fitting model were correlated for , (47) = .37, p = .01, mirrored by a 'moderate' ICC, ICC(3, k) = .54, p = .005. In all other models with two learning rates, estimates for were significantly correlated between T1 and T2 as well. No model yielded more than two significantly correlated parameter estimates between T1 and T2. In both tasks and for both T1 and T2 the learning rates related to positive and negative prediction error were correlated with win stay and lose shift behavior respectively (with two exceptions, Table 1).

Discussion
Most behavioral measures, such as win stay and lose shift, showed substantial individual differences (Fig. 1cd) and moderate to good test re-test reliability in both tasks over a time span of five weeks. This suggests both tasks capture robust individual differences in learning. Furthermore, the correlation between win stay and lose shift behaviour and learning rates from positive and negative prediction errors respectively suggests these RL models capture crucial behavioural phenotypes. However, in most models, including the best fitting model, only one parameter was correlated between T1 and T2. For the PGL task, no specific pattern emerged, whereas for the RVL task, the negative learning rate ( ) appears to be a crucial and robust factor in determining individual differences.
Our results suggest more work is needed to ensure reliable parameter estimates from reinforcement learning models. Considering experimental paradigms, one advantage of RVL over PGL may be that it requires more steady learning, potentially leading to a better fit of learning models. Although more dynamic learning tasks introduce more variability in behavior, it is possible that in this case they result in more robust parameter estimates. Second, we have not explored all possible RL models. Future work will compare a larger model space, including Bayesian models and additional parameters (e.g. 'stickiness'). Lastly, work is planned to compare model performance between fitting procedures (Log Likehood vs Hierarchical Bayesian). For instance, Bayesian approaches like Stan produce parameter estimates for each participant as probability distributions instead of point-estimates, which helps mitigate parameter-identifiability problems.

Conclusion
Test-retest reliability of research methods is critical for generating robust findings and making inferences about individual differences. We found behavioral measures of canonical reinforcement learning paradigms show moderate to good reliability between test sessions with a five-week interval. However, the parameters estimated through standard computational RL models did not (yet) show such robust results. Our results urge caution when interpreting estimated parameter values as individual differences in latent processes underlying learning. Further work is needed to investigate ways of improving test-retest reliability of parameter estimates.