Enhancing the Psychometric Properties of the Iowa Gambling Task Using Full Generative Modeling

Poor psychometrics, particularly low test-retest reliability, pose a major challenge for using behavioral tasks in individual differences research. Here, we demonstrate that full generative modeling of the Iowa Gambling Task (IGT) substantially improves test-retest reliability and may also enhance the IGT’s validity for use in characterizing internalizing pathology, compared to the traditional analytic approach. IGT data (n =50) was collected across two sessions, one month apart. Our full generative model incorporated (1) the Outcome Representation Learning (ORL) computational model at the person-level and (2) a group-level model that explicitly modeled test-retest reliability, along with other group-level effects. Compared to the traditional ‘summary score’ (proportion good decks selected), the ORL model provides a theoretically rich set of performance metrics (Reward Learning Rate (A+), Punishment Learning Rate (A-), Win Frequency Sensitivity (βf), Perseveration Tendency (βp), Memory Decay (K)), capturing distinct psychological processes. While test-retest reliability for the traditional summary score was only moderate (r = .37, BCa 95% CI [.04, .63]), test-retest reliabilities for ORL performance metrics produced by the full generative model were substantially improved, with test-retest correlations ranging between r = .64–.82. Further, while summary scores showed no substantial associations with internalizing symptoms, ORL parameters were significantly associated with internalizing symptoms. Specifically, Punishment Learning Rate was associated with higher self-reported depression and Perseveration Tendency was associated with lower self-reported anhedonia. Generative modeling offers promise for advancing individual differences research using the IGT, and behavioral tasks more generally, through enhancing task psychometrics.


Supplement Supplemental Background
Studies using computational modeling with the IGT We are only aware of two studies that have used reinforcement learning models with the IGT to examine internalizing pathology.Byrne and colleagues (2016) found depression was positively associated with greater loss aversion in an undergraduate sample.Alacreu-Crespo and colleagues (2020) found that depressed patients had lower perseveration (choice consistency) than healthy controls, and that depressed patients with a previous suicide attempt had a higher learning rate (higher learning rate was also associated with worse overall performance) and had lower loss aversion compared to healthy controls.

Model 4 Specification
Model 4 assumed that person-level parameters across sessions followed from group-level multivariate normal distributions, where each separate parameter had its own multivariate normal distribution in which it was assumed that the priors for locations (i.e., means) and scales (i.e., standard deviations) were assigned normal distributions.Here, for conceptual clarity, we present the centered parameterization of Model 4 parameters (and below in the subsection labeled 'Grouplevel Re-parameterizations to Increase Efficiency' we present the mathematically-equivalent, noncentered parameterization, which was used to make MCMC sampling more efficient).Using Punishment Learning Rate as an example, formal specification of the bounded parameters followed the form: where and are the location and scale parameters for the group-level distributions for each session t, is a vector of individual-level parameters for session t on the unconstrained space, and is a vector of the same individual-level parameters after they have been transformed to the appropriate bounds.This parameterization both ensures that the hyper prior distribution over the subject-level parameters is (near)uniform between the parameter bounds and allows for us to use multivariate normal distributions to model the correlation between bounded parameters.For example, and are the Punishment Learning Rates for person at both session 1 and 2, respectively.( ) is the cumulative distribution function for the standard normal distribution, which is used because the person-level learning rates must fall between 0 and 1.Because ( ) transforms from , it allows for us to use the multivariate normal group-level distribution to capture the test-retest correlation despite the multivariate normal distribution itself having support outside of .
For parameters bounded ( ) (e.g., K), we used the same parameterization as above but scaled to the upper bound accordingly:

( )
For unbounded parameters ( ), we used the same parameterization outlined above except we set the hyper standard deviations (e.g., ) to a half-Cauchy(0, 1).Because these parameters are unbounded, they follow directly from a multivariate normal distribution and do not require transformations.
For brevity, we can alternatively write out the probabilistic model with inverse parameter transformations on the left-hand side of the equation.Doing so, the multivariate normal distribution for each parameter was as follows: Reward Learning Rate * ( ) Above, ( ) is the inverse of the cumulative probability distribution for the standard normal distribution (i.e. the quantile function), which transforms from .
Each parameter's multivariate normal distribution contains a covariance matrix.Again, using Punishment Learning Rate as an example, is a covariance matrix that captures the correlation between the person-level parameters across sessions.Specifically, can be decomposed into the group-level standard deviations at each session ( and ) and the 2x2 correlation matrix of interest, : Here, is a correlation matrix with one free parameter ( ) on the off-diagonal: ( ) where the free parameter indicates the test-retest correlation of interest-the value that we present throughout the text.Finally, we assume the following LKJ prior on each correlation matrix: ~ LKJcorr(1) With only a single free parameter, this prior equates to a uniform distribution between -1 and 1 on , meaning that all possible values for the test-retest correlation were assumed to be equally likely.

Group-level Re-parameterizations to Increase Efficiency
Here, we present re-parameterizations employed to make the MCMC sampling more efficient (see Stan code available on OSF: https://osf.io/b3kwz/?view_only=fea448b0bcff48a2bc0304e7b4448720).These re-parameterizations are mathematically-equivalent to the centered parameterizations presented above and included non-centered parameterizations for the hierarchical (group-level) components of the model, along with Cholesky decompositions to re-parameterize the test-retest correlation matrices, which makes sampling from the multivariate normal distribution more efficient.For individual-level parameters drawn from multivariate normal distributions, which we used to estimate test-retest correlations, we used a Cholesky decomposition to employ non-centered parameterizations.Specifically, a covariance matrix can be decomposed into its Cholesky factor , where = .Then, individual-level parameters drawn from independent, standard normal distributions can be correlated using the Cholesky factor , as shown below.The Cholesky factor ( ) of the covariance matrix is equal to the diagonal matrix of the group-level SDs multiplied by the Cholesky factor of the correlation matrix .Therefore, as opposed to sampling directly from a multivariate normal distribution to estimate a correlation matrix , we can sample from the group-level means, standard deviations, the Cholesky factor of the correlation matrix ( ), and individual-level parameters (e.g., and ) independently and then reconstruct the correlation matrix afterward as = .
Using Punishment Learning Rate as an example, the non-centered parameterization was as follows: ( )

Model validation
Parameter Recovery.The ORL model was developed by Haines and colleagues (2018) to be used for making inference on individual differences related to clinical phenomena.As part of model development, parameter recovery was conducted for the ORL model and competing computational models, and the ORL was found to have good recovery of both parameter means and of the full posteriors compared to competing models (Haines, et. al., 2018).We also performed parameter recovery in the current sample.Data was simulated using parameters set to known values and the ORL model was subsequently fit to the simulated data to "recover" the parameter values.Correlations between known and recovered parameters, or "recovery statistics" represent model performance and the reliability of parameters, with values closer to r =1 indicating more precise person-level measures.When the ORL model was fit to data from the original IGT (single administration) for 200 simulated subjects, recovery statistics were acceptable across all parameters (A+ r =.81; A-r =.77; K r =.52; βf r = .84;βp r = .95).
Posterior Predictive Checks.To ensure that the models fit the data well, we performed posterior predictive checks in which a given model is used to simulate data, and the simulated data is subsequently compared to the observed data.More specifically, fitted model parameters were used to simulate data and the simulated data was plotted against the observed data as a qualitative check for model fit.
Supplemental Figure 1.Posterior Predictive Check for Model 2. Simulated data using Model 2 estimates of reproduced the Model 1 observed summary score at the person-level for both Session 1 and Session 2. Note that four participants from Session 1 were not present for Session 2, which we indicate in the Session 2 panel with points at 0%.Despite not having Session 2 data on these participants, the model predicts variation between these missing participants based on their Session 1 data.Supplemental Figure 2. Posterior Predictive Check for Model 3. Simulated data using Model 3 estimates of the five free ORL parameters reproduced observed group-averaged, trial-level choice behavior for both Session 1 and Session 2. Note: a random selection strategy would roughly produce a 25% group-averaged selection rate (marked by the horizontal red line) for each deck, on any given trial.50% and 95% highest density intervals are demarcated by dark and light orange areas, respectively.

Performance across testing sessions
Using both 'summary score models', performance means were similar across sessions, while the standard deviation in performance was greater at session 2. Supplemental Figure 3A shows performance across the two sessions for the observed 'percent good deck selected' (Model 1).For Model 1, there was no significant difference in mean performance between sessions (session 1 M = 55.78;session 2 M = 58.63;Welch's t(73) = -1.19,95% CI [-8.49, 2.15]), but the standard deviation in performance was significantly greater at session 2 (session 1 SD = 9.54; session 2 SD = 15.64;Levene's F(1,94) = 7.74, p = .007).For the Model 2 metric, (estimated probability of a good deck selection), the HDI plots in Supplemental Figure 3B show the difference in performance between the session 1 and session 2 posterior distributions for the mean (μ) and standard deviation (σ) of , respectively.For session 2 μ -session 1 μ, the 95% HDI indicates there is small-to-no difference in mean performance across sessions, with a posterior mean of .03(95% CI = [-.02,.08])for the distribution representing the difference in mean performance between sessions.However, the Model 2 difference in the standard deviation of performance across sessions had a posterior mean of .07(95% CI = [.03,.11])with a 95% HDI that does not include 0, indicating a larger standard deviation in session 2 performance.See Supplemental Table 4 for Model 3 and 4 parameter estimate posterior means.As shown in Supplemental Figure 4, there was no strong evidence of group-level between-session differences for the Model 3 parameters.1 Supplemental Table 9. Post-hoc correlations between Session 1 Self-Report Data and Session 1 and Session 2 IGT Metrics for Models 3 and 4. Session 1 correlations (session 1 IGT metrics and session 1 self-report) are the same as those reported in Table 3 of the main text and are replicated here for comparison with session 2 correlations (session 2 IGT metrics and session 1 self-report).Focusing on Model 4, results were fairly consistent with the associations between same-day ORL parameters and self-report, with moderate positive associations between session 2 Punishment Learning Rate and MASQ General Depressive subscale (r = .36,BCa 95% CI [.09, .56])and session 2 Perseveration Tendency and SHAPS Anhedonia (r = .31,BCa 95% CI [.02, .50]).Unlike the session 1 results, session 2 Reward Learning Rate showed positive associations with state-level Negative Affect (r = .29,BCa 95% CI [.01, .60]),with MASQ Anxious Arousal (r = .31,BCa 95% CI [.01, .60]),and with the MASQ General Depressive subscale (r = .39,BCa 95% CI [.10, .60]).

ORL model decomposes task behavior into distinct psychological processes
While the IGT summary score confounds different psychological mechanisms driving gross IGT performance (Ahn, et al., 2016;Almy, et al., 2018;Cauffman, et al., 2012;Lin, et al., 2007), the ORL (Haines, et al., 2018), decomposes performance into a theoretically rich set of process-level behavioral metrics, including Reward Learning Rate, Punishment Learning Rate, Win Frequency Sensitivity, Perseveration Tendency, and Memory Decay.Learning Rate, rather than referring to amount or quality of learning across the whole task, reflects the degree to which an individual "weights" prediction error for given outcomes on a trial-by-trial basis to update value estimations.Thus, higher Reward and Punishment Learning Rates can be thought of as more volatile trial-level updating in a reward or loss domain, respectively.Optimal learning rate depends on task structure (Bishop & Gagne, 2018;Nussenbaum & Hartley, 2019); thus, higher learning rates are not necessarily conducive to better performance.Indeed, we found better IGT performance was negatively associated with Reward Learning Rate but positively associated with Punishment Learning Rate.Another ORL parameter, Win Frequency Sensitivity, represents subjective value for reward frequency, rather than magnitude, and is a type of reward sensitivity commonly exhibited by healthy participants on the IGT (Haines, et al., 2018;Steingroever, et al., 2013).The ORL also provides metrics for Perseveration Tendency, or the consistency of a participant's choices irrespective of outcomes, and Memory Decay, an index of the participant's memory for which choices they selected in the past.Together, these five parameters allow a much more nuanced and fine-grained understanding of individual behavior on the IGT.

Potential model adaptations to address outstanding issues
One potential model adaptation might include examining potential change in learning parameters across engagement with the IGT during a single session (i.e., learning rate may change from the beginning to the end of the task).Another traditional scoring method for the IGT (not explored in the current paper) is examining change in deck selection across early versus later blocks in the game.Future modeling work might compare this alternative traditional scoring approach to a generative model that allows learning parameters to vary across the task.
Future research should also examine learning parameters longitudinally across multiple task administrations.In considering how to model data collected across multiple task administrations, it may prove useful to explicitly model potential carry-over learning effects as participants engage with the IGT multiple times.For example, it is possible that trial-and-error learning may be faster on a second task encounter, which could be modeled by assuming larger mean priors on learning rates for a second testing session.Alternatively, longitudinal estimation of learning parameters may be improved by assuming different priors on reward versus punishment learning rates, based on the parameter estimates from an individual's initial task performance.
Another possibility is that different types of learning (e.g., continuous trial-and-error versus one-shot or cateogry learning) may guide task performance once people learn the "rules" of the task.For example, the IGT has a valence structure such that two choices lead to long-term gains while two the other two choice lead to long-term losses.If participants learn this rule, they may develop an abstract representation of the task wherein their goal is to determine which category (good or bad) a deck belongs to, and then make a choice based on the learned category rather than information about expected value or win frequency.If this is the case, test-retest reliability would naturally be low.A potential solution could be to frame the IGT as a category learning task, wherein the different monetary outcomes shown on each trial are used as stimulus features to categorize the deck as either "good" or "bad" (see Turner et al., 2018).Whether a deck is "good" or "bad" itself could then be determined with expected value information as in the ORL model.If reliance on trialand-error versus category learning is then treated as a mixture model, it is possible that learning model parameters could themselves be consistent across time, and that changes in behavior across sessions could be due to an increased reliance (i.e. a higher mixture probability) on category learning.We anticipate that future work would benefit greatly by exploring these possibilities.
Another issue (raised above) was that some associations between behavioral estimates and self-report were inconsistent across testing sessions.It is important to reiterate here that two-step correlations between measures will be attenuated by any lack of precision in the individual measures (Spearman, 1904).As self-report measures were not incorporated into the current hierarchical model (and thus self-report measure uncertainty was not modified), associations between self-report and behavioral measures are attenuated in proportion to this lack of precision, which decreases power and could result in inconsistent associations.Future work should address this issue by incorporating self-report measures of theoretical importance directly into the hierarchical models (e.g., Kildahl, et al., 2020;Kvam, et al., 2021;Vandekerckhove, 2014), and/or including in the hierarchical model multiple self-report assessments across time along with modeling the covariance between these measures (as we have done here with the behavioral data).These modeling strategies would likely enhance the reliability of associations between behavior and self-report measures.

Figure 3 :
Summary Score Performance Metrics across Sessions.(A) Model 1: Scatterplots showing observed 'percentage good deck selected' across sessions.(B) Model 2: Highest Density Interval (HDI) plots showing differences for the mean (μ) and standard deviation (σ) of , the estimated 'probability of a good deck selection'.Horizontal red lines indicate the 95% highest density intervals for estimates within the difference distributions (session 2 distribution -session 1 distribution) and vertical red lines indicate the posterior means for these difference estimations.

Table 2 . In-Sample Reliabilities. In-Sample Reliabilities for Self-Report Measures
Supplemental Table3.Descriptive statistics for self-report measures.

Table 5 . Associations between ORL Parameters and the Model 1 Summary Score Excluding Outlier.
In the manuscript, the scatterplots in Figure4Brevealed an outlier on the βp parameter at Session 2. All correlations between the Model 3 and Model 4 ORL parameters and the Model 1 Summary Score are reported here with the outlier removed.Removal of the outlier did not substantially change correlation results for any parameter.CIs represent 95% bias-corrected and accelerated bootstrapped confidence intervals (95% BCa CIs).

Table 8 . Post-hoc correlations between Session 1 Self-Report Data and Session 1 and Session 2 IGT Metrics for Models 1 and 2.
Session 1 correlations (session 1 IGT metrics and session 1 self-report) are the same as those reported in Table2of the main text and are replicated here for comparison with session 2 correlations (session 2 IGT metrics and session 1 self-report).Interestingly, there was a moderate negative association between the session 1 MASQ General Depressive subscale and the session 2 summary score for Model 2 (r = -.24,BCa95%CI[-.44, -.01]).