The statistical structures of reinforcement learning with asymmetric value updates

Reinforcement learning (RL) models have been broadly used in modeling the choice behavior of humans and other animals. In standard RL models, the action values are assumed to be updated according to the reward prediction error (RPE), i.e., the difference between the obtained reward and the expected reward. Numerous studies have noted that the magnitude of the update is biased depending on the sign of the RPE. The bias is represented in RL models by differential learning rates for positive and negative RPEs. However, which aspect of behavioral data that the estimated differential learning rates reflect is not well understood. In this study, we investigate how the differential learning rates influence the statistical propertiesofchoicebehavior(i.e.,therelationbetweenpastexperiencesandthecurrentchoice)basedon theoretical considerations and numerical simulations. We clarify that when the learning rates differ, the impact of a past outcome depends on the subsequent outcomes, in contrast to standard RL models with symmetric value updates. Based on the results, we propose a model-neutral statistical test to validate the hypothesis that value updates are asymmetric. The asymmetry in the value updates induces the autocorrelation of choice (i.e., the tendency to repeat the same choice or to switch the choice irrespective ofpastrewards).Conversely,ifanRLmodelwithoutanintrinsicautocorrelationfactorisfittedtodatathat possessanintrinsicautocorrelation,astatisticalbiastooverestimatethedifferenceinlearningratesarises. We demonstrate that this bias can cause a statistical artifact in RL-model fitting leading to a ‘‘pseudo-positivity bias’’ and a ‘‘pseudo-confirmation bias.’’ © 2018 The Author. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Reinforcement learning (RL) models have been broadly used in modeling action selection in various fields such as psychology, neuroscience, and psychiatry (Corrado & Doya, 2007;Daw, Gershman, Seymour, Dayan, & Dolan, 2011;Maia & Frank, 2011;Yechiam, Busemeyer, Stout, & Bechara, 2005). In standard RL models, the action values are assumed to be updated according to the reward prediction error (RPE), i.e., the difference between the actual reward and the expected reward. Numerous studies have noted that the magnitude of an update is biased depending on the sign of the RPE. The bias is represented in RL models by differential learning rates for positive and negative RPEs (Frank, Moustafa, Haughey, Curran, & Hutchison, 2007;Gershman, 2015Gershman, , 2016. Such asymmetric RL models have been used in various contexts. Differential learning rates have been introduced to incorporate the neuroscientific knowledge that positive feedback and negative feedback are processed by separate neural substrates (Frank et al., 2007). The asymmetric RL models have also been used to explain the differential learning rates reflect. Can a researcher claim that asymmetric value updates underlie subjects' behaviors by merely observing the choice sequence without fitting RL models to the data? Most previous studies examining asymmetric value updates have relied solely on the parameter estimates of RL-model fits. The availability of statistical testing without computational model fitting is desirable because computational model-based analysis can suffer from the influence of a confounding variable, as we show in this study.
The steady-state, asymptotic properties of asymmetric RL models have been clarified in Cazé and van der Meer (2013). However, parameter estimates obtained from model fitting reflect not only the steady-state but also transient trial-by-trial dynamics of behaviors (cf. Katahira, Yuki, & Okanoya, 2017). We thus especially focus on the transient properties of the behavioral consequences of asymmetric value updates. To achieve this, we perform theoretical analysis and numerical simulations that examine how the differential learning rates influence the impact of past experiences on current choices, following Katahira (2015). Specifically, we compare the RL models and logistic regression models that have been used for analyzing the impact of past events on future choices. Because regression models can explicitly represent how the reward and choice histories influence future choices, investigating the relation between RL models and regression models would be helpful to clarify the transient, statistical structure of asymmetric RL models (Katahira, 2015).
The results indicate the following. In asymmetric RL models, the extent to which past outcomes influence the present choice also depends on the outcomes of intermediate choices, unlike standard (symmetric) RL models. Based on this effect, we propose a statistical method for testing the existence of an asymmetry in value updates without relying on the RL model fit. The asymmetry in the value updates induces the autocorrelation of choice (i.e., the tendency to repeat the same choice or to switch to another choice irrespective of past outcomes). Consequently, if the RL model without intrinsic autocorrelation is fitted to data that possess intrinsic autocorrelation, a statistical bias that overestimates the difference in learning rates will arise. By simulation, we demonstrate that this can cause a statistical artifact leading to ''pseudo-positivity bias'' and ''pseudo-confirmation bias'' as a conclusion of RL-model fitting.
The remainder of this article is organized as follows. First, we introduce asymmetric RL models and a logistic regression model (Section 2). Second, we discuss the property of asymmetric RL models by investigating the relation between the RL models and the logistic regression model (Section 3). Third, we examine the statistical artifact that can arise from the basic properties of asymmetric RL models (Section 4). Fourth, we describe the proposed statistical method for testing the asymmetry in value updates without RL-model fitting (Section 5). Finally, we discuss several implications of our findings (Section 6).

Reinforcement learning models with differential learning rates
Here, we introduce an RL model with differential learning rates for positive RPE and negative RPE. Hereafter, we refer to this model as the asymmetric RL model. Throughout this paper, we basically consider cases with only two options (i = 1 or 2). 1 The model assigns each option i an action value Q i (t), where t is the index of the trial. In common settings, the initial action values are set to zero (i.e., Q 1 (1) = Q 2 (1) = 0). The model updates the 1 Our results might apply to multiple-option cases where there are more than two choices, but we leave the investigation of such a case to future work. action values depending on the outcome of the action. The outcome (reward) at trial t is denoted by R(t). We typically consider a binary outcome case whereby we set R(t) = 1 if a reward is given and R(t) = 0 if no reward is given, unless stated otherwise. In the asymmetric RL model, the action value for the chosen option i is updated as follows: where α + L and α − L are the learning rates that determine how much the model updates the action value depending on the RPE, δ(t).
Although it is generally assumed that the sign of the learning rates depends on the RPE, in the binary outcome case, the sign of the RPE coincides with whether the reward is given (R(t) = 1) or omitted (R(t) = 0). For the unchosen option j (i ̸ = j), the action value is updated as follows: where α F is the forgetting rate. The forgetting rate is typically set to zero α F = 0 (i.e., the action value of the unchosen option is not updated). However, many studies have reported that non-zero forgetting rates improve the goodness of fit to behavioral data (e.g., Ito & Doya, 2009;Toyama, Katahira, & Ohira, 2017).
Here, we denote the option that was chosen at trial t by a(t) (= 1 or 2). Based on the set of action values, the model assigns the probability of choosing option 1 using the soft max function: where β is called the inverse temperature parameter, which determines the sensitivity of the choice probabilities to differences in action values. It has been shown that humans tend to repeat recent choices, a tendency called ''choice perseverance'', ''choice stickiness'', or ''decision inertia''. Such a tendency is often explicitly modeled by a choice-autocorrelation factor that is added to the action value as follows (Akaishi, Umeda, Nagase, & Sakai, 2014;Gershman, Pesaran, & Daw, 2009;: where C i (t) represents the choice trace, which quantifies how frequently option i was chosen recently. The choice trace weight ϕ is a parameter that controls the tendency to repeat (when positive) or avoid (when negative) recently chosen options. The choice trace is computed using the following update rule (Akaishi et al., 2014): where the indicator function I(·) takes on a value of 1 if the statement is true and 0 if the statement is false. The parameter τ is the decay rate of the choice trace. The initial values are set to zero, i.e., C 1 (1) = C 2 (1) = 0. Many RL models that represent the choice perseverance or choice stickiness only incorporate the influence of the immediate prior trial (e.g., Gillan, Kosinski, Whelan, Phelps, & Daw, 2016;Huys, Moutoussis, & Williams, 2011). Such a convention can be represented by setting τ = 1.
When the choice is binary, the above choice-autocorrelation factor is equivalently implemented by with the soft max function: (t)) .
We implemented the model with this expression in the following simulations. Examples of the model behavior are shown in Fig. 1. In Fig. 1A, the update was pessimistic (α + L < α − L ): the magnitude of the decrease in action values when no reward is obtained is larger than that of the increase in action values when a reward is obtained (bottom panel). In this case, the model often changes the choice probability and switches the choice frequently because the larger decrease in an action value when the model does not obtain a reward decreases the probability of the same choice. In contrast, in an optimistic case ( Fig. 1B; , the omission of the reward causes a smaller effect on the action value. This makes the model likely to repeat the same choice.

Logistic regression
Next, we introduce a regression model that predicts current choice from the histories of rewards and choices in previous trials (e.g., Corrado, Sugrue, Seung, & Newsome, 2005;Kovach et al., 2012;Lau & Glimcher, 2005;Seymour, Daw, Roiser, Dayan, & Dolan, 2012). Following the convention of Corrado et al. (2005) and Lau and Glimcher (2005) , we represent the reward history r(t) as follows: if option 1 is chosen and a reward is given at trial t −1 if option 2 is chosen and a reward is given at trial t 0 if no reward is given at trial t.
We represent the choice history c(t) as follows: With these history variables, the logistic regression model is constructed as where b (m) r and b (m) c are the regression coefficients for the trial m trials ago. The constants M r and M c are the history length for the reward history and the choice history, respectively. We estimate the regression coefficients using the maximum likelihood method.

Theoretical consideration
Following Katahira (2015), we discuss the relation between the parameters of the asymmetric RL model and the regression coefficients of the logistic regression model. Let R i (t) denote the reward value obtained from choosing option i at trial t. If option i is not chosen at trial t, R i (t) = 0. Then, the update rules (Eqs. (2) and (4)) can be summarized as where Let us consider the influence of the outcome at trial t − k after choosing option i (i.e., R i (t − k)) on the trial t + 1. Expanding Eq. (13) back into the past (see Appendix A; Katahira, 2015), we can confirm that the term R i (t − k) is multiplied by α * t,i at trial t − k, by (1 − α * t−k+1,i ) at the next trial t − k + 1, by (1 − α * t−k+2,i ) at trial t −k+2, etc. A specific example is shown in Fig. 2A. In this case, the impact of the outcome given at trial t − 3 is α + L at trial t − 2 due to the update rule for δ(t − 3) > 0. In the next trial, t − 2, the reward was absent (R i (t − 2) = 0, and thus, δ(t − 2) < 0). Therefore, the impact of R i (t − 3) is multiplied by (1 − α − L ) at trial t − 1. Because the option is unchosen at trial t, (1−α F ) is multiplied by the impact for the next trial t + 1.
Let n + , n − , and n F be the numbers of trials for each trial type after trial t − k and before trial t + 1 (n + : for positive RPE trials with option i being chosen; n − : for negative RPE trials with option i being chosen; and n F : for trials with option i being not chosen).
By construction, we have the relationship n F = k − 1 − n + − n − . The impact of the reward provided at t − k on the choice at trial t + 1 is given as a function of n + , n − , and n F : In the special case wherein α which is a constant that is independent of subsequent outcomes. From this fact, one can find that if the history length of the regression model is sufficiently large, this RL model is equivalent to a logistic regression model whose coefficients are written with the parameters of the RL model as b the inverse temperature of the RL model (Katahira, 2015). Eq. (15) indicates that if the learning rates differ depending on the outcome (α + L ̸ = α − L ), the influence of the reward (at trial t − k) on the future choice (at trial t + 1) depends on the outcomes of subsequent choices at trials t − k + 1, t − k + 2, . . . , t. An example of the impact of past rewards is shown in Fig. 2B. This property violates the assumption of the logistic regression model that assumes that the influence of a past event on the current choice is independent of the events at other trials. We can utilize this property to test the asymmetry in value updates, as we will discuss later.

Simulation results
To understand the properties of asymmetric value updates, we conducted simulations that examine the relation between the asymmetric RL models and the logistic regression model. In the simulations, we first generated the choice data from RL models used to perform a hypothetical decision-making task and then fitted the logistic regression model to the data (the detailed procedure is given in Appendix B). Here, the RL models were set to have no intrinsic choice autocorrelation (ϕ = 0).
The first case varied the balance of the learning rates (α + L and α − L ) while the sum of the learning rates was fixed as α + L +α − L = 0.8 and the forgetting rate was fixed as α F = 0.4 (Fig. 3). Fig. 3A compares the predictions of the RL models and regression models. Each dot represents the prediction for one trial. If the two models' predictions perfectly agree, the samples should lie on the diagonal line. As the theory predicted, when the learning rates and the forgetting rate are identical (α + the two models exhibited perfect agreement. In contrast, when there is an asymmetry (α + L ̸ = α − L ), mismatches between the predictions of the two models (discrepancies from the diagonal lines) were observed. This is because of the change in the impact of past outcomes on future choices, depending on the outcomes of For both cases, the first 100 trials of one hypothetical subject are depicted. The probability of choosing option 1 was calculated from the ''true'' RL model (bold, gray lines) and the logistic regression model (thin, black lines). The corresponding action values (Q-values) of the RL model are shown at each bottom panel. The other common parameter settings were α F = 0.4, ϕ = 0.0 (no choice-autocorrelation) and β = 3.0. In the first 50 trials, option 1 was optimal (the probability of reward associated with option 1 was 0.7 and that with option 2 was 0.3). At the 51-th trial, the reward probability was reversed.  . This dynamic change cannot be represented by a regression model that assumes that the influence of a reward is independent of subsequent events (without incorporating an interaction term, as we will discuss in Section 5).
Although the logistic regression models are not statistically equivalent with asymmetric RL models, the corresponding regression coefficients would be helpful in understanding the statistical properties of the RL models (Katahira, 2015). Fig. 3B shows the regression coefficients of the logistic regression models obtained by fitting to the simulated data. When the learning rates for the positive RPE and the negative RPE of the RL model were different, a dependence on the choice history arose (i.e., b c ̸ = 0) even though the RL model had no intrinsic autocorrelation factor (ϕ = 0). More specifically, when α + L is larger than α − L (optimistic updates), a positive dependence on choice history is observed (b c > 0). In contrast, when α − L is larger than α + L , a negative dependence is observed (b c > 0). The dependence on the choice history arose through the following mechanism: Consider the effect of the outcome of trial t −1. If α + L > α − L , the value of chosen action at trial t −1 can gain more from a reward than it can lose by a non-reward, which makes the repetition of the same choice at trial t more likely than when the learning rates are identical (α + . This tendency can be captured by a regression model with a positive regression coefficient for the choice at trial t −1. On the other hand, if α + L < α − L , the value of chosen action can lose more than it can gain, which makes switching the choice at trial t more likely and leads to opposite consequences. Next, we examined the effects of the forgetting rate when the balance of the learning rates is optimistic (α + Fig. 4A). In this case, a lower forgetting rate results in a smaller impact of the choice history ( Fig. 4A, right panel). When the forgetting rate was very small (α F < 0.05), the coefficients for the choice history were negative. The following reason for this is considered. When the forgetting rate is close to zero, the action value remains almost the same value if the option is unchosen (see Fig. 4B). Thus, in the case of optimistic updates, where the omission of the reward causes a decay with (1 − α − L ), the action value easily becomes lower than that of the opposite options. Thus, the model easily switches the choice compared to the regression model, which has no mechanism for retaining the action value of the unchosen option. In contrast, when the forgetting rate is large (e.g., Fig. 4C), the action value of the unchosen options quickly decays to zero, and thus, the action value of the chosen option tends to be superior to that of the other option. Because of this property, the model presents less alternation. Fig. 5 shows the results when the forgetting rate is zero as the balance of the learning rates is varied. It should be noted that a negative impact of the choice history from the smaller forgetting rate also occurs when the learning rates are identical but are larger than the forgetting rate (see the result with α + L = α − L = 0.4 in Fig. 5). The mechanism is discussed in Katahira (2015). The differential learning rates basically weaken the negative correlation (Fig. 5, i.e., it increases the coefficient for the choice history in the positive direction). In the case of optimistic updates (α + L > α − L ), the same mechanism as in Fig. 3B can work. In the case of pessimistic updates (α + L < α − L ), this effect occurs because the pessimistic updates decrease the overall action values (cf., the bottom panel of Fig. 1A), which in turn diminish the effect of the smaller forgetting rate. This leads to a diminished impact from the choice history.

Biases in parameter estimates of asymmetric updating models
The results in the previous section suggest that the differential learning rates for positive RPE and negative RPE cause an apparent choice autocorrelation (i.e., a tendency to continue with the same option or switch choices), even when the true model (the data generation process) has no explicit mechanism for generating such an autocorrelation. This property raises important questions: If the true model has an intrinsic choice autocorrelation and the model without the autocorrelation factor but with differential learning rates is fitted to the data, is there a possibility that the estimated learning rates are biased such that they can represent the choice autocorrelation? In addition, are the true choice autocorrelation and the effects of the differential learning rates statistically distinguishable? To address these issues, we performed simulations of RL-model fitting and statistical model selection. Specifically, we simulated the two experiments reported in Palminteri et al. (2017). In Experiment 1 of Palminteri et al. (2017), they reported a ''positivity bias'', where the estimate of the learning rate for positive RPE was larger than that for the negative RPE (α + L > α − L ), while in Experiment 2, they reported a ''confirmation bias'', which we explain later. We examine whether these estimates can be reproduced by assuming that the true model (data generation process) has an intrinsic choice-autocorrelation factor but no asymmetry characterizing the value updates.

Pseudo-positivity bias
First, we examined whether the ''pseudo-positivity bias'' (α + L > α − L ) can arise even when the true process has no such bias but has a positive choice autocorrelation (i.e., choice perseverance). In Cases 1 and 2, we generated hypothetical data from true models with identical learning rates (α + L = α − L = 0.5) with intrinsic choice-autocorrelation factors (Fig. 6). The models performed the task with the same structure of Experiment 1 in Palminteri et al. (2017) (see Appendix B for details). In Case 1, the true model has an impulsive autocorrelation (the model has a tendency to repeat the same choice as in the previous trial; τ = 1.0, ϕ = 2.5). In Case 2, the true model has a gradual autocorrelation (τ = 0.4, ϕ = 2.5). The other true parameters were β = 3.0 and α F = 0.0 (no forgetting). We choose these parameter values so that the model can approximately reproduce the learning curve of Experiment 1 in Palminteri et al. (2017). 2 We fitted six models (Table 1). Model 1-Model 3 were allowed to have differential learning rates depending on the valence of RPE, whereas Models 4-6 were not. Model 1 and Model 4 had no choice-autocorrelation factor, Models 2 and 5 could have an impulsive autocorrelation (τ was fixed to 1), and Models 3 and 6 could have a gradual choice autocorrelation (τ was a free parameter; Table 1). For all candidate models, the model parameters were fitted to each hypothetical subject's data using the maximum likelihood method as described in Appendix B.
The candidate models were evaluated based on Akaike's Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Both criteria introduce a penalty for the number of parameters in the model (see B.2). A lower value indicates a better model. More specifically, AIC is a measure of the predictive ability of a model for future data: a smaller AIC value indicates better prediction ability. BIC approximates the model evidence which quantifies how likely the model is. A smaller BIC value indicates that the model is more likely. Thus, to examine which model is likely to be the true model from the data, BIC would be more appropriate. However, given that both criteria are commonly used in RL model fitting, and BIC generally tends to more strongly penalize models with more parameters than AIC (cf., , we report the values of both criteria. Table 1 shows AIC and BIC values (mean across subjects) for each model. For both cases, both criteria selected the models with identical learning rates. For Case 1, Model 5, which has an impulsive choice-autocorrelation factor, as does the true model, 2 We did this by a visual inspection of the resulting learning curve obtained by the simulation using the same task design with Palminteri et al. (2017). Here, we intended to demonstrate a possible scenario about what could happen in the analysis reported in Palminteri et al. (2017), rather than to reproduce their situation precisely.
yielded the smallest mean value of AIC and that of BIC, indicating that this was the best model for the simulated data. 3 However, if the candidate models are restricted to the models with no choiceautocorrelation factor (Model 1 and Model 4), Model 1, which possesses the differential learning rates, was selected as the best model in both cases despite the fact that the true learning rates did not differ. This is because the asymmetric RL models can represent a positive choice autocorrelation, as we discussed in the previous section. Fig. 6 shows the resulting estimates of the learning rates in Models 1, 2, and 3 for both cases. For Case 1, the learning rate for the positive outcome α + L was significantly higher than that for the negative outcome α − L for Model 1, which has no choice autocorrelation (t(19) = 10.43, p < .001, paired t-test), and also for Model 2, which has an impulsive choice autocorrelation (t(19) = 2.10, p = .049, paired t-test).
For Case 2, in which the true model had a gradual choice autocorrelation, Model 6, which has the same structure as the true model, was selected by both AIC and BIC (Table 1). If the candidate models were restricted to the models that cannot have a gradual choice-autocorrelation factor (Models 1, 2, 4, and 5), both criteria selected Model 2, which has differential learning rates and an impulsive autocorrelation. This is because the asymmetric RL models can represent the impact of the choices made more than one trial ago, which cannot be represented by the impulsive choice-autocorrelation factor. These results suggest that including a model with only an impulsive autocorrelation factor is not sufficient when there is a gradual perseverance in the true model; models with an asymmetric RL model achieve a better fit to the actual gradual autocorrelation than the symmetric RL model with the impulsive autocorrelation. The learning rate for the positive outcome α + L was significantly higher than that for the negative outcome α − L for Model 1 and Model 2, which do not have a gradual choice-autocorrelation factor (t(19) = 7.55, p < .001 and t(19) = 2.63, p = .017, respectively; Fig. 6). The difference in the estimates of the learning rates was suppressed in the parameter estimates for Model 3, which includes the gradual autocorrelation factor (t(19) = 0.38, p = .705).
As a summary, an overestimation of the differences in the learning rates can be observed when a fitted model did not have a choice-autocorrelation factor that can represent the intrinsic choice autocorrelation in the true model. This effect can be regarded as a ''pseudo-positivity bias''.
We conducted an additional simulation (Case 3), in which we generated hypothetical data from the true model, where the differential learning rates were indeed different (optimistic) (α + L = 0.5, α − L = 0.2) but with no choice-autocorrelation factor (ϕ = 0).
In this case, both criteria correctly selected Model 1, which has the same structure as the true model (Table 1). This result indicates that the effects of asymmetric value updates and that of intrinsic choice perseverance are similar but statistically distinguishable based on the goodness of fit to data.

Pseudo-confirmation bias
Next, we consider a further example of statistical artifacts as a consequence of the similarity between the asymmetric value updates and choice perseverance. Palminteri et al. (2017) reported that when the forgone outcomes (i.e., the outcomes of the unchosen option) are given, people tend to greatly update values when a negative prediction error is given compared to when a 3 Overall, our results indicated that the RL model with a structure identical to the true model yielded the smallest AIC/BIC values. One cannot expect that such results will always be obtained. For example, when sufficient information is not available from the given data (e.g., due to a small sample size), a simpler model compared to the true model could give smallest AIC/BIC values, even if the parameter estimation is conducted appropriately.
while that of unchosen option j is updated depending on the forgone outcome R j as The confirmation bias is represented by differential learning rates: Our question here is whether such a relation can appear in the parameter estimates even when the true learning rates are identical (α + . It is expected that the confirmation bias in the value updates of the unchosen option (α + U < α − U ) tend to strengthen the tendency to avoid choosing that option because a larger impact of the negative outcome (i.e., omission of a reward) compared to a positive outcome leads to a smaller action value for the unchosen option. This can increase the choice autocorrelation in a similar manner as a positivity bias in the value updates of the chosen option.
We generated hypothetical data from the model with identical learning rates (α + Similar to the simulation of the pseudo-positivity bias, we considered two cases regarding the choice-autocorrelation factor: with an impulsive autocorrelation (Case 1; τ = 1.0, ϕ = 3.5) and with a gradual autocorrelation (Case 2; τ = 0.4, ϕ = 3.5). The other true parameters were β = 1.5 and α F = 0.0 (no forgetting). We again chose these parameter values so that the true model can approximately reproduce the learning curve for Experiment 2 in Palminteri et al. (2017). We fitted 10 models, which consisted of 6 models used in Palminteri et al. (2017) and an additional 4 models with choiceautocorrelation factors ( In the 'Valence' model (Model 5), the learning rates can vary according to the valence (positive or negative) of the RPE, resulting in the restrictions α In the 'Confirmation' model (Model 6 and Model 7), the learning rates can differ depending on whether the outcomes are ''confirmatory'' (positive obtained or negative forgone) or ''disconfirmatory'' (negative obtained or positive forgone). This corresponds to the following restrictions: α + C = α − U (≡ α CON ) and α − C = α + U (≡ α DIS ).
For this variant, Palminteri et al. (2017) also included the version with an impulsive choice-autocorrelation factor (Model 7). The estimated learning rates in the models whose four learning rates were allowed to have different values (Model 1-Model 3) are shown in Fig. 7. As expected, when the model with no choiceautocorrelation factor (Model 1) was fitted to the generated data, the strong bias in the estimates of the learning rates appeared. The resulting estimates of the learning rates correspond to ''confirmation bias''. Even when the fitted model has impulsive choice perseverance (Model 2), the confirmation bias appeared if the true model possessed gradual choice perseverance (Case 2, Fig. 7). When the true model has only impulsive choice perseverance, the confirmation bias was suppressed if the fitted model can also have impulsive perseverance (Case 1, Fig. 7).
In Case 1, in which the true model had impulsive perseverance, both model selection criteria selected Model 9, which has the same structure as the true model (Table 2). However, if the candidate models are restricted to the models with no choice-autocorrelation factor (Models 1, 4, 5, 6, and 8), the 'Confirmation' model with no autocorrelation factor (Model 6) was selected by both AIC and BIC. The parameter estimates of the 'Confirmation' models are also shown in Fig. 7. The learning rate for confirmatory outcomes, α CON , in Model 6 was significantly higher than that for disconfirmatory outcomes, α DIS (t(19) = 26.77, p < 10 −15 ). In contrast, there was no significant difference in the learning rates for Model 7, which has impulsive perseverance (t(19) = 1.05, p = .306).
In Case 2, in which the true model had gradual perseverance, Model 10, which has the same structure as the true model, showed the smallest AIC value, whereas the 'Confirmation' model with no autocorrelation factor (Model 6) presented the smallest BIC value. This is because Model 6 has one less parameter than Model 10; as we mentioned above, BIC tends to heavily penalize models with more parameters. When the models with gradual choiceautocorrelation factor were not included in the candidate models (i.e., Models 3 and 10 are excluded), AIC also selected Model 6. In this case, for both Model 6 (with no perseverance) and Model 7 (with impulsive perseverance), the two learning rates of the 'Confirmation' models (α CON and α DIS ) showed a significant difference (t(19) = 11.59, p < 10 −9 and t(19) = 6.46, p < 10 −5 , respectively).
Taken together, the statistical artifact causes an apparent confirmation bias to arise when the true model has an intrinsic choiceautocorrelation factor while the fitted model does not have this factor. If the model that can represent the true structure of choice perseverance is included, the model is selected based on a model selection criterion. However, if all the candidate models cannot represent the underlying choice perseverance, the models with differential learning rates can be selected.

Model-neutral analysis of asymmetry in learning
We have demonstrated that confounding factors can systematically bias parameter estimation. As a confounding factor, we have considered the exponentially decaying choice-autocorrelation factors. There might be other types of confounding factor characterized by a different functional form. In reality, it is generally difficult for a researcher to include all of the possible confounding factors into their candidate models. Thus, using non-parametric, modelneutral analysis is helpful to make the conclusion strong. Here, based on the theoretical consideration given in Section 3.1 (as illustrated in Fig. 2), we propose a model-neutral method to test the asymmetry in value updates.

Proposed analysis
Here, we again focus on the factual learning, in which only the outcome from the chosen option is presented. 4 As we have discussed, one behavioral consequence of differential learning is that the impact of a past outcome depends on subsequent outcomes. This indicates that there is an interaction between the outcome at trial t and that at trial t − 1 (R(t) and R(t − 1)) as a factor that influences the current choice, a(t +1). Thus, to claim an asymmetry in the value updates, we need to determine whether there is such an interaction. Here, we denote the three successive trials by t + 1, t, and t − 1. To simplify the analysis, we select the trials in which the subjects choose the same option at trial t and at trial t − 1 (trials t and t − 1 such that a(t) = a(t − 1)). Next, we construct the logistic regression model to predict the probability that the subject chooses the same option at trial t+1 (i.e., a(t+1) = a(t) = a(t−1)), denoted by p(stay(t + 1)). The regressors are R(t) and R(t − 1), and their interaction term is R(t) × R(t − 1). This constitutes the logistic regression model: log p(stay(t + 1)) p(switch(t + 1)) where p(switch(t + 1)) is the probability that the subject switches the choice at trial t + 1 (i.e., a(t + 1) ̸ = a(t) = a(t − 1)). The intercept b 0 represents the overall tendency to repeat the same choice, which may absorb the effects of the choice autocorrelation and the influence of the rewards provided before trial t − 1.
4 The proposed method might apply to the counter-factual learning task if one includes the outcomes from the unchosen option as regressors in a similar way as done for the chosen option. A future study would examine this extension. In the following demonstration, we assumed the regression coefficients in Eq. (21) to be constant across subjects, that is, we treated the coefficients as fixed effects. The hypothesis test with the null hypothesis b 12 = 0 is conducted by using a standard method implemented with the ''glm'' function in R. The specific procedure is as follows. Among all the data, three successive trials in which a subject chose the same option in the first two trials are selected. The logistic regression model given in Eq. (21) is specified in R syntax by

Stay~R1 * R2
as an argument of the ''glm'' function. Here, the variable Stay is set to 1 if the last choice in the three trials is the same as the two previous choices, whereas it is 0 if they are different. The variables R1 and R2 represent the outcome one trial ago and two trials ago, respectively (coded as reward: 1; non-reward: 0).

Theoretical consideration
If we ignore the events before trial t − 1, the asymmetric-RL model yields the stay probability as p(stay(t + 1)) = 1 1 + exp(−βh(t + 1)) , Here, we consider the condition whereby R(t) takes on a binary value (R(t) = 1 if a reward is given and R(t) = 0 if no reward is given), and thus, δ(t) > 0 if R(t) = 1 and δ(t) < 0 if R(t) = 0 (given that action values are not exactly 0 or 1). Under this condition, Eq. (23) can be rewritten as A comparison of Eqs. (21) and (24) indicates that the following relation holds if we can neglect the effects of rewards given more than two trials ago: This indicates that the interaction term b 12 represents the degree of the difference in the learning rates. If b 12 significantly differs from zero, one can claim an asymmetry in the value updates. The negative b 12 indicates optimistic updates (the learning rate for a positive outcome, α + L , is larger than that for a negative outcome, α − L ), whereas a positive b 12 indicates pessimistic updates (a learning rate for a positive outcome, α + L , is smaller than that for a negative outcome, α − L ).

Simulation
The proposed analysis relies on the assumption that one can neglect the events from more than two trials ago. Next, we examine whether this assumption causes undesirable statistical bias. To achieve this and to demonstrate how the proposed method works, we conducted simulations based on data generated from various RL models that perform the hypothetical learning task (see Appendix B.3 for details). For each simulation, the choice datasets of the 20 hypothetical subjects, each consisting of 200 trials, were generated from an RL model. Fig. 8 shows the estimated regression coefficients. Broken horizontal lines represent theoretical values given by Eqs. (25)-(27), which ignore the impact of events at more than two trials ago. When the true learning rates were identical (α + the theory predicts that the coefficient of the interaction term, R(t)R(t − 1), is zero (i.e., b 12 = 0, see Eq. (27)). For each simulation, the hypothesis test for the interaction term was conducted with a significance level α = 0.05. The mean values over 1,000 simulations (indicated by bars) almost agreed with this prediction (panel A: case without choice-autocorrelation factor; panel B: case with gradual choice-autocorrelation factor). Accordingly, the false positive rates, the fraction of the simulations where the null hypothesis (b 12 = 0) was rejected, were 0.045 and 0.060. These false positive rates were close to the intended type I error rate (α = 0.05), validating the proposed analysis. In contrast, when the true learning rates differed between the positive and negative outcomes, the interaction coefficients were In these simulations, we simply assumed that all the subjects had the same parameters in both the true models and the fitted regression models (i.e., all the regression coefficients were treated as fixed effects). In a real-world case, the parameters should vary across subjects. To address the individual differences in the parameters, it would be better to use mixed-effect models (e.g., implemented with the lme4 package, Bates, Mächler, Bolker, & Walker, 2015), which can treat the main effects of outcomes and their interactions as random effects, instead of the logistic regression models with fixed effects (cf, Gillan et al., 2016).

Discussion
In computational model fitting based on likelihoods, the computational model (e.g., RL model) is regarded as a statistical (probabilistic) model. Understanding the statistical structure of a model (e.g., the dependence of choice probabilities on past events) is crucial for understanding the type of properties of the data that the results reflect. In this study, we have investigated the statistical structure of the RL models with asymmetric (biased) value updates. Although the steady-state behavior of the asymmetric RL models have been discussed in a prior study (Cazé & van der Meer, 2013), the transient property of the choice behavior (i.e., how the current choice depends on the recent rewards and choices), which is critical for model fitting, has not yet been addressed.
We have demonstrated that the influence of the choice outcomes on future choices also depends on the outcomes of intermediate choices in the asymmetric RL models. This property is not directly represented in conventional regression models that assume the independence of the impact of past events. If the logistic regression model is fitted to the behavioral data generated based on asymmetric value-updates, an apparent choice history dependence is observed. This suggests that asymmetric value updates cause choice autocorrelation. Our simulation results suggest that this effect leads to a statistical artifact when asymmetric-RL models are fitted to data with an intrinsic choice autocorrelation (perseverance). The choice autocorrelation can be a critical confounding factor in estimating the differential learning rates. The results of the present study demonstrate the problem of model misspecification, which leads to an incorrect interpretation. Although such a type of pitfall of computational-model-based analysis has been discussed based on other types of model misspecification (Katahira, Bai, & Nakao, 2017;Nassar & Gold, 2013), the effect presented in this paper is perhaps the most striking example of the pitfall.
Of course, the existence of such an artifact in parameter fitting does not necessarily suggest that the previous results that have reported asymmetries in learning are due to this artifact. However, we should be careful when interpreting the results. In addition, re-analysis of data with RL models with choice-autocorrelation factor may be desirable. The simulation in this study clarified that including only the impulsive choice-autocorrelation factor (a tendency to repeat the choice made in the previous trial), as many studies performing RL-model fitting have done, might not be sufficient; including models with a choice-autocorrelation factor that can account for the autocorrelation between distant trials is desired.
The proposed model-neutral statistical test, which employs the logistic regression with interaction term of outcomes in the past two trials, would be useful for claiming that asymmetric value update indeed occur, as we discussed in Section 5. However, the statistical power of the test is relatively weak, as the simulation results showed. Thus, a large data set, which may be obtained through online experiments, is required (e.g., Gershman, 2015Gershman, , 2016Gillan et al., 2016). Combining the model-neutral analysis for such large data sets and computational models fit to a small number of but carefully collected data sets would be effective. Also, the proposed analysis only addresses the impact of the influence of the two most recent trials. The use of (parametric) computational models is useful to check the long-term history dependence of action selection. Thus, model-neutral analysis and model-based analysis have their own advantages and can be used in a complementary fashion.
In the simulation of RL model fitting presented in Section 4, we considered the maximum likelihood method for parameter estimation. One might wonder whether the results presented there are applicable to Bayesian approaches, including hierarchical Bayesian methods, which have been also used for fitting RL models (Ahn, Haines & Zhang, 2017;Ahn, Krawitz, Kim, Busemeyer, & Brown, 2011). While hierarchical Bayesian parameter estimation tends to produce stable, better results by utilizing both group-level information and individual-level information (Katahira, 2016), there is no explicit mechanism to reduce statistical biases due to the confounding factor, which is not included in the fitted models. Therefore, we expect that the systematic bias reported in this paper can occur in such a hierarchical Bayesian approaches. We leave this issue to a future study.
The temporal modulation of learning rates is a core component of various learning models. For example, the notable property of Bayesian update models (e.g., Kalman filters) is the modulation of the learning rates depending on uncertainty about task structure (e.g., reward probability) (Behrens, Woolrich, Walton, & Rushworth, 2007;Daw, O'Doherty, Dayan, & Seymour, 2006;Mathys, Daunizeau, Friston, & Stephan, 2011). In addition, modulating value updates using the ''model'' (knowledge) of the task structure can account for the ''model-based'' control of choice (Toyama et al., 2017). The asymmetric RL models discussed in this study may be the simplest case of a wide range of learning models with learning rate modulation. However, the results presented in this study have several implications for various models. For example, the results tell us how modulating the learning rates changes the history dependence of choice. This study is an important starting point for exploring the various phenomena in learning and decision making.

B.2. Model-based analysis for factual and counter-factual learning task
The details of the simulations examining pseudo-positivity bias (in factual learning) and pseudo-confirmation bias (in counterfactual learning) are as follows. We generated hypothetical data based on the tasks with the same design as Palminteri et al. (2017). Specifically, the model performed two sessions, where one session consists of four types of choice contexts. Each context has different options (cues) from other contexts and other sessions. In each context, 24 choice trials were presented. In total, the model performed 192 choice trials (2 sessions × 4 choice contexts × 24 trials). In each trial, the model chose one option, resulting in winning (R = 1) or losing a point (R = −1) according to the outcome probability associated with each option. The associated winning probabilities (option 1, option 2) were as follows: ''Symmetric condition'' -(.5, .5) for 24 trials (entire context), ''Asymmetric condition 1'' -(.75, .25) for 24 trials, ''Asymmetric condition 2'' -(.25, .75) for 24 trials, and ''Reversal condition'' -(.83, .17) for the first 12 trials and (.17, .83) for the later 12 trials. In the simulation of factual learning, only the outcome of the chosen option was presented to the subject. In the simulation of counter-factual learning, the outcomes of both the chosen option and the unchosen option were presented. Because different cue stimuli were used for each choice context, the initial action values in the RL models were reset at the beginning of each choice context as Q 1 (1) = Q 2 (1) = 0. For the factual learning task, we considered three RL models as the true models for generating data. In Case 1 and Case 2, the model had identical learning rates for both positive and negative RPEs (α + L = α − L = 0.5). In Case 1, the model had impulsive autocorrelation (τ = 1.0, ϕ = 2.5), while in Case 2, the model had gradual autocorrelation (τ = 0.4, ϕ = 2.5). In Case 3, the true model had different learning rates (α + L = 0.5, α − L = 0.2) but no autocorrelation factor (ϕ = 0). Common true parameters for all cases were β = 3.0 and α F = 0.0 (no forgetting). For the counterfactual learning task, the true models were RL models of two types. In Cases 1 and 2, the true model possesses identical learning rates for all four conditions (α + C = α − C = α + U = α − U = 0.5). In Case 1, the true model possesses impulsive autocorrelation (τ = 1.0, ϕ = 3.5), while in Case 2, the true model possesses gradual autocorrelation (τ = 0.4, ϕ = 3.5). The other true parameters were β = 1.5 and α F = 0.0 (no forgetting). In each case of each task, the simulations were ran 20 times. The resulting data were regarded as those from 20 hypothetical subjects.
To estimate the model parameters, maximum likelihood estimation (MLE) was separately performed for data from each hypothetical subject. MLE searches a single parameter set to maximize the log-likelihood of the model for all trials. This maximization was performed using the rsolnp 1.15 package, which implements the augmented Lagrange multiplier method with an SQP interior algorithm (Ghalanos & Theussl, 2011).
To evaluate the fitted models, we used two standard model selection criteria: Akaike's Information Criterion (AIC; Akaike, 1974) and the Bayesian Information Criterion (BIC; Schwarz, 1978). They are defined as where L is the maximum value of the log likelihood function for the model, k is the number of estimated parameters in the model, and T is the number of trials (here, = 192). Both criteria introduce a penalty for the number of parameters in the model. For both criteria, a smaller value indicates a better model.

B.3. Model-neutral hypothesis test
The simulation that demonstrates the model-neutral hypothetical test is as follows. The data generation processes was basically the same as in B.1; the data were generated by the RL models that performed a probabilistic reversal learning task. The exception was that for each simulation, the choice data sets of the 20 hypothetical subjects, each consisting of 200 trials, were generated from an RL model. The simulation was performed 1000 times under each condition. The data from each simulation run were submitted to the proposed hypothesis test, as described in the main text.