Sticky me: Self-relevance slows reinforcement learning

A prominent facet of social-cognitive functioning is that self-relevant information is prioritized in perception, attention, and memory. What is not yet understood, however, is whether similar effects arise during learning. In particular, compared to other people (e.g., best friend) is information about the self acquired more rapidly? To explore this matter, here we used a probabilistic selection task in combination with computational modeling (i.e., Reinforcement Learning Drift Diffusion Model analysis) to establish how self-relevance influences learning under conditions of uncertainty (i.e., choices are based on the perceived likelihood of positive and negative outcomes). Across two experiments, a consistent pattern of effects was observed. First, learning rates for both positive and negative prediction errors were slower for self-relevant compared to friend-relevant associations. Second, self-relevant (vs. friend-relevant) learning was characterized by the exploitation (vs. exploration) of choice selec- tions. That is, in a complex (i.e., probabilistic) decision-making environment, previously rewarded self-related outcomes were selected more often than novel — but potentially riskier — alternatives. The implications of these findings for accounts of self-function are considered.


Introduction
The self is an indispensable psychological construct, providing coherence and continuity to the narrative that underpins a personal sense of being (Baars, 1988;Baumeister, 1998;Conway, 2005;Conway & Pleydell-Pearce, 2000;Gallagher, 2000;James, 1890;Markus & Nurius, 1986;Markus & Wurf, 1987;Oakley & Halligan, 2017). As Markus and Wurf (1987, pp. 299-300) reported, "the self-concept…interprets and organizes self-relevant actions and experiences, it has motivational consequences, providing the incentives, standards, plans, rules, and scripts for behavior; and it adjusts in response to challenges from the social environment." In other words, the self serves as a basic processing hub around which social-cognitive functioning unfolds .
Despite an extensive literature cataloguing the effects of selfrelevance on core components of social cognition, important issues nevertheless remain. In particular, aside from a few notable exceptions, research has largely overlooked the extent to which the personal significance of stimuli influences a fundamental and crucial facet of daily life, the rate at which information is learned (Liao, Huang, & Luo, 2021;Lockwood et al., 2018). That is, just as self-relevance facilitates the detection, appraisal, and memorability of stimuli Symons & Johnson, 1997), so too it may enhance how rapidly this material is acquired. In one of the few studies to explore this matter, Lockwood et al. (2018) adopted a deterministic associative-learning task in which participants had to learn, from a pool of fractals (i.e., abstract, unfamiliar stimuli), which items belonged to various social targets (Brovelli, Laksiri, Nazarian, Meunier, & Boussaoud, 2008;Schultz, Dayan, & Montague, 1997). 1 Specifically, a single fractal appeared on each experimental trial and participants had to report (i.e., learn) whether the stimulus was owned by the self, a friend, or a stranger. Feedback was then provided indicating if the response was correct or incorrect, and the task was deterministic in that participants were told each target always possessed the same fractals. To establish the respective target-related learning rates, data were submitted to an associative learning (AL) algorithm (Sutton & Barto, 1998). Lockwood et al.'s (2018) findings were revealing. Reflecting the operation of an egocentric decisional strategy (Epley & Gilovich, 2004;Golubickis et al., 2018Golubickis et al., , 2019, participants tended to report that the fractal presented on the first trial belonged to them, when in reality it was just as likely to be owned by either of the other targets. In addition, responses were faster and more accurate when learning about fractals owned-by-self compared to those that belonged to others. Finally, learning rates were higher when acquiring knowledge about the self, although this effect was only significant when stranger comprised the target of comparisonlearning rates for self and friend were comparable. The absence of a reliable difference in learning rates between self and friend is interesting as while a self-advantage has frequently been reported when the target of comparison is best friend (e.g., Ma & Han, 2010;Sui et al., 2012;Sui & Han, 2007;Sui, Rotshtein, & Humphreys, 2013;Zhu, Zhang, Fan, & Han, 2007), some research has indicated that the benefits of personal-relevance can be attenuated, or even eliminated, when the self is compared with an intimate (i.e., highly familiar) other (Bower & Gilligan, 1979;Kuiper & Rogers, 1979;Symons & Johnson, 1997). Notwithstanding this observation, Lockwood et al. (2018) provided initial evidence for the biasing effects of self-relevance on aspects of associative learning. 2 Building upon and extending prior research, here we also explored the extent to which the personal relevance (or otherwise) of material impacts learning. Our overarching objectives were to probe the characteristics of self-learning effects in a different task context (i.e., learning environment) and to establish the pathway through which these effects arise. In so doing, rather than adopting a deterministic learning paradigm, a probabilistic selection task (PST) was employed (Frank, Moustafa, Haughey, Curran, & Hutchinson, 2007;Frank, Seeberger, & O'Reilly, 2004). We used this task for a couple of reasons. First, the PST explores reinforcement learning (RL) in uncertain (vs. certain) task environments (cf. Lockwood et al., 2018), thus examines the impact of selfrelevance when knowledge is acquired under demanding decisionmaking conditions. It is possible, for example, that basic components of self-representation and self-function may prompt learning effects to diverge when studied in uncertain (i.e., probabilistic) compared to certain (i.e., deterministic) task settings (Gershman & Daw, 2017). Second, in combination with recent developments in computational modeling (i.e., Reinforcement Learning Drift Diffusion Model (RL-DDM) analysis), adoption of the PST enables identification of the latent psychological processes that underpin RL (Fontanesi, Gluth, Spektor, & Rieskamp, 2019;Pedersen & Frank, 2020;Pedersen, Frank, & Biele, 2017).
In the current PST, participants were presented with three different stimulus pairs (i.e., AB, CD, EF) -comprising symbols (i.e., Japanese Hiragana characters; seeFrank et al., , Frank et al., 2007 with an item in each pairing (i.e., A, C, E) representing either the self or a friend and they were required to learn, following a series of choice selections, which of the symbols was most likely to denote each target based on feedback that was provided (see Fig. 1). Critically, the feedback was probabilistic and varied for each stimulus pair (i.e., AB = 80% -20%, CD = 70% -30%, EF = 60% -40%). For example, in AB trials, a choice of stimulus A led to positive feedback on 80% of the trials, whereas selecting stimulus B resulted in positive reinforcement on 20% of the trials. Thus, in this PST, learning was accomplished via choice-related feedback. Over numerous choice selections, participants learned which item in each pairing was more likely to be correct (i.e., represent self or friend; A, C, E rather than B, D, F) and the task was completed when sufficient levels of accuracy were achieved for each stimulus pair (Frank et al., 2004, Frank et al., 2007. To identify the mechanisms underpinning learning, computational modeling was undertaken on the data. Specifically, based on recent developments, a Reinforcement Learning Drift Diffusion Model (RL-DDM) analysis was adopted (Fontanesi et al., 2019;Pedersen et al., 2017;Pedersen & Frank, 2020). Integrating sequential sampling and RL models, the RL-DDM pinpoints the psychological operations that underpin decision-making (i.e., choice selection) and how these are adjusted as learning progresses (Miletić, Boag, & Forstmann, 2020;Pedersen & Frank, 2020;Ratcliff, Smith, Brown, & McKoon, 2016). This is realized through the simultaneous hierarchical Bayesian modeling of response time (RT) and choice data. A drift rate scaling parameter (v scaling ) measures sensitivity to feedback and the explorationexploitation trade-off (Cohen, McClure, & Yu, 2007), such that higher values indicate more confident learning based on current knowledge (Pedersen et al., 2017). A learning rate parameter (η) -ranging from zero to onequantifies how quickly individuals learn, with larger values indicating utilization of current feedback (i.e., fast learning), and smaller values reflecting reduced updating from recently experienced outcomes (i.e., slow learning). In this respect, either a single learning rate (η) that captures all learning, or separate learning rates for negative and positive prediction errors (η − & η + respectively) can be estimated (Miletić et al., 2020;Pedersen et al., 2017). Finally, the model also establishes how much evidence is needed to make a decision (i.e., threshold separation, a) and the efficiency of non-decisional processes (e.g., stimulus encoding, response execution, t 0 ).
Central to the current inquiry is the classic exploration-exploitation trade-off that underlies learning (Cohen et al., 2007;Daw, O'Doherty, Dayan, Seymour, & Dolan, 2006;Sutton & Barto, 1998). Confronted with a decision-making dilemma, learning can entail either the exploitation of options that have been optimal in the past or the exploration of alternatives that, in the long run, may prove to be more rewarding (Cohen et al., 2007). That is, one can either stick with existing knowledge or try something new. Critically, whereas exploration generally facilitates the acquisition of information, exploitation yields immediate decisional rewards, but it may impair learning (Sutton & Barto, 1998). As such, whether self-relevance enhances or reduces learning relative to a target of comparison (e.g., friend) should be reflected in decisions to explore or exploit the choice selections during RL. In this regard, an interesting possibility is that, in complex (i.e., probabilistic) task settings, people may prefer to stick (i.e., exploit) rather than switch (i.e., explore) when to-be-learned material is self-relevant, thereby prompting a slower learning rate for information pertaining to the self compared to others (cf. Lockwood et al., 2018). Several strands of evidence suggest such an outcome.
According to , via enhanced binding, self-reference serves as a form of associative glue for perception, attention, and memory (Cunningham, Turk, Macdonald, & Macrae, 1 Forming (and probing) target-object associations through ownership is a common methodology to explore self-prioritization (Constable et al., 2019;Constable, Kritikos, & Bayliss, 2011;Constable, Kritikos, Lipp, & Bayliss, 2014;Falbén et al., 2019Falbén et al., , 2020Golubickis et al., 2018Golubickis et al., , 2019 2008; Rogers et al., 1977;Sui et al., 2012;Wang et al., 2016). While generally facilitating information processing and response selection, these potent self-object associations can also impede performance in certain task contexts. For example, participants find it difficult to overcome prior self-shape (vs. friend-shape) associations when given the task of forming new relations (Wang et al., 2016) and display a stubborn preference for self-relevant (vs. other-relevant) items during decisionmaking (Constable et al., 2019;Golubickis et al., 2018Golubickis et al., , 2019Lockwood et al., 2018). Although such sticky learning undoubtedly supports the maintenance of a stable self-conceptan essential component of social-cognitive functioning (Greenwald, 1980;Markus, 1977) it also suggests that exploitation rather than exploration may be the preferred strategy when acquiring information pertaining to the self in uncertain (i.e., probabilistic) learning environments. That is, previously rewarded self-object associations may be selected more often than novel (but riskier) options, thereby reducing the learning rate for the acquisition of personally meaningful material. Accordingly, using a PST in conjunction with computational modeling, here we explored the possibility that selfrelevance may slow RL relative to an optimal target of comparison (e.g., best friend).

Participants and design
Fifty participants (33 females, 17 males, 3 others; M age = 23.04, SD = 3.06), with normal or corrected-to-normal visual acuity, took part in the research. Data collection was conducted online using Prolific Academic (http://www.prolific.co), with each participant receiving compensation at the rate of £7.50 (~$10) per hour. Informed consent was obtained from participants prior to the commencement of the experiment and the protocol was reviewed and approved by the Ethics Committee at the School of Psychology, University of Plymouth. The experiment had a single factor (Correct Symbol: self or friend) repeatedmeasures design. To detect a significant effect, a sample of fifty participants afforded 92% power for a large effect size (i.e., d = 0.80; PAN-GEA, v 0.0.2).

Stimulus materials and procedure
Participants performed two versions of a PST (Frank et al., 2004;Frank et al., 2007), with each comprising a learning phase in which three pairs of symbols (denoted as AB, CD, and EF, see Fig. 1) were presented. Participants were instructed they were required to learn, based on feedback provided, which symbol in each pair was most likely to represent them (i.e., self) or their best friend. Following previous research, prior to the task, participants were requested to bring their best friend (i.e., target of comparison) to mind . After each choice selection, participants were informed that onscreen information would indicate whether their response was correct or incorrect. Half of the participants were randomly assigned to perform a version of the PST in which self-related symbols were more likely to be correct, followed by another version of the task in which friend-related items were more likely to comprise the correct response. That is, trial type (i.e., learning) was blocked by target. The order of the PSTs was reversed for the remaining participants.
The probabilities indicating which symbol was more likely to be correct followed the standard version of the PST (Frank et al., 2004, Frank et al., 2007. Specifically, for the AB pair, A was 80% likely to be correct (20% for B), for the CD pair, C was 70% likely to be correct (30% for D), and finally, for the EF pair, E was 60% likely to be correct (40% for F). Over numerous choice selections, participants learned which item in each pairing was more likely to be correct (i.e., A, C, E rather than B, D, F) based on the feedback provided. The task was completed when participants reached sufficient levels of accuracy for each pairing (i.e., AB, 60% or above; CD, 55% or above; EF, 50% or above; Frank et al., 2004, Frank et al., 2007. Each trial began with the presentation of a pair of symbols that remained on the screen until the participant made a response. After the participant selected one of the symbols, feedback (i.e., the word 'Correct' in green or 'Incorrect' in red) was presented for 1000 ms, followed by a blank screen for 500 ms, after which the next trial commenced. Participants had to select a symbol by pressing the appropriate button on the keyboard (i.e., A for the symbol on the left side of the screen, L for the symbol on the right side of the screen). The symbols in each pair were equally likely to be presented on the left or right side of the screen. The experiment was conducted using Inquisit Web. Participants completed blocks of 60 trials in which each of the three stimulus pairs appeared randomly, equally often, until accuracy reached a satisfactory level. The maximum number of learning blocks was set to six (i.e., 360 trials in total) if the participant did not reach satisfactory levels of accuracy earlier in the task (Frank et al., 2007). On completion of the experiment, participants were debriefed and thanked.

Behavioral analysis
The mean latency and accuracy of choice selections were submitted to a paired-sample (Correct Symbol: self or friend) t-test (two-tailed). No significant difference emerged on either dependent measure (i.e., decision time: M self = 1203 ms vs. M friend = 1148 ms; learning performance: M self = 68% vs. M friend = 66%).

Modeling analysis
To identify the processes underpinning learning, data were submitted to a RL-DDM analysis (Frank et al., 2015;Pedersen & Frank, 2020;Pedersen et al., 2017). This analysis combines the strengths of RL and sequential-sampling models (SSMs) to elucidate the operations that support task performance. Specifically, although RL models account for changes in the relative proportion of choice probabilities over the course of learning, they do not speak to concurrent differences in response latencies, a fundamental and important dimension of the available data (e. g., as learning takes place, decision times decrease). In this respect, SSMs (e.g., drift diffusion model; Ratcliff et al., 2016;Smith & Ratcliff, 2004) are useful as they provide a mechanistic account of binary decisionmaking by explaining how choice accuracy and response latencies collectively arise from a common set of latent cognitive processes (e.g., rate of evidence accumulation, response caution). Thus, crucially, the RL-DDM extends standard RL models by explicating the processes through which learning unfolds over time (Fontanesi et al., 2019;Miletic et al., 2020;Pedersen & Frank, 2020;Pedersen et al., 2017). Two significant modifications characterize the RL-DDM. First, the typical choice rule for reinforcement learning (i.e., softmax) is replaced by the drift diffusion model (i.e., Wiener process, see Miletić et al., 2020;Pedersen et al., 2017). This change is important as it affords the possibility to model choice and RT data simultaneously. Second, the algorithm that captures the learning of subjective expectation values from stimuli and actions (i.e., value-based approach) is integrated into the process of evidence accumulation (i.e., drift rate). Thus, applying the delta learning rule, the model initially describes the updating of the expected Q-value for a chosen option (e.g., positively reinforced symbol A) based on the scaled by learning rate (α) reward prediction error (i.e., the difference between observed and expected feedback) in the previous trial (Rescorla & Wagner, 1972;Watkins & Dayan, 1992, see Eq. (1)): Subsequently, the RL-DDM formulates the drift rate (v) during reinforced decisions based on the difference between the expected value of positively (Q positively-reinforced ) and negatively (Q negatively-reinforced ) reinforced choices. To accommodate the manner in which this knowledge is used, the RL-DDM allows an additional free scaling parameter to be estimated (i.e., drift rate scaling, v scaling ). This scaling parameter is similar to inverse temperature in the softmax choice rule and reflects the level of exploration/exploitation during learning (Pedersen & Frank, 2020), such that larger values reflect stronger exploitation of the option with the highest expected value (see Eq. (2)).
Thus, in essence, the RL-DDM assumes that evidence is gathered for each choice option (e.g., symbol A vs. symbol B) until a critical evidential threshold is reached, at which point a response is made. This response threshold is captured by the boundary separation (a) parameter, and it reflects speed-accuracy trade-offs during decision-making. For example, if a conservative (vs. liberal) decision-making style (i.e., higher evidential requirements) is adopted, this would yield slower but more accurate responses. At the start of the PST, participants make slow guesses as the stimuli have not yet been reinforced, thus the difference in expected values between symbol pairings is extremely low (i.e., slow evidence accumulation due to high uncertainty). As participants start to receive feedback, via application of the delta learning rule (Rescorla & Wagner, 1972), the subjective Q-values of positively/negatively reinforced stimuli increase/decrease. The speed at which participants update the expected values is described by the learning rate (η) parameter. On a trial-by-trial basis, this knowledge (i.e., learning which symbol is correct, Q-value) is integrated into the drift rate such that over time the difference in expected values between reinforced options (ACE vs. BDF symbol pairings) increases. The larger the difference between positively and negatively reinforced options, the easier (i.e., faster and more accurate) choice selection becomes (i.e., fast information sampling).
To estimate model parameters, an extension of the Bayesian hierarchical drift diffusion toolbox was adopted (Wiecki, Sofer, & Frank, 2013). Models were response-coded, such that the upper threshold corresponded to responses to stimuli that were positively reinforced (i. e., symbols corresponding to the letters A, C, & E) and the lower threshold to stimuli that were negatively reinforced (i.e., symbols corresponding to the letters B, D, & F; Pedersen & Frank, 2020). Bayesian posterior distributions were modeled using a Markov chain Monte Carlo (MCMC) with 10,000 samples (including 1000 burn), with outliers (5% of the trials) removed by the HDDM software (Ratcliff & Tuerlinckx, 2002;Wiecki et al., 2013). Two RL-DDM models were estimated for comparison (i.e., single vs. dual learning rate model). In the first model, only a single learning rate (η) was allowed to vary across Correct Symbol (i.e., self vs. friend). This model examined whether there were differences in the speed of learning across the experimental conditions without taking the potential influence of different types of prediction error into consideration. In contrast, in the second model, learning rates for negative and positive prediction errors (η − & η + , respectively) were allowed to vary by Correct Symbol. As such, this model considered whether learning self-related or friend-related stimuli was accelerated following negative or positive prediction errors. In both models, drift rate scaling (v scaling ) and boundary separation (a) varied across Correct Symbol.
Model comparison was performed using the Deviance Information Criterion (DIC) as this approach is routinely adopted when comparing hierarchical Bayesian models (Spiegelhalter, Best, Carlin, & van der, 1998. Lower DIC values favor models with the highest likelihood and least number of parameters. This revealed better fit for the dual (DIC: 60999) compared to the single (DIC: 61059) learning rate model. Examination of the posterior distributions (see Fig. 2) revealed differences in learning rates for negative and positive prediction errors (η − & η + ), drift rate scaling (v scaling ), and threshold separation (a). Specifically, comparisons yielded very strong evidence that learning rates were faster for friend compared to self, both for negative (p Bayes (self < friend) = 0.032, BF 10 = 30) and positive (p Bayes (self < friend) < 0.001, BF 10 > 1000) prediction errors. 3 In addition, participants integrated information more efficiently from negative than positive prediction errors, an effect that was larger for self (p Bayes (η + < η − ) = 0.008, BF 10 = 125) than friend (p Bayes (η + < η − ) = 0.162, BF 10 = 6). There was also very strong evidence that drift rate scaling (v scaling ) was larger for self-than friendrelated symbols (p Bayes (self > friend) = 0.019, BF 10 = 52). Finally, for boundary separation (a), there was extremely strong evidence that more decisional information was required when selecting self-compared to friend-related responses (p Bayes (self > friend) < 0.001, BF 10 > 1000).
These findings reveal that, in a probabilistic task context (Frank et al., 2004(Frank et al., , 2007, self-relevance (vs. friend-relevance) reduced the rate of learning. In addition, the RL-DDM analysis also indicated a difference in the balance between the strategies that drive learningexploration and exploitation (Cohen et al., 2007;Sutton & Barto, 1998). Specifically, as indexed by the drift rate scaling parameter (v scaling ), self-relevant (vs. friend-relevant) trials were characterized by the tendency to exploit previously rewarded outcomes rather than explore new alternatives. In other words, self-relevance elicited a greater sensitivity to current outcomes (i.e., existing knowledge) during learning (Pedersen et al., 2017).
To probe the reproducibility of these effects, in our next experiment we also explored how self-relevance influenced learning in a PST (Frank et al., 2004;Frank et al., 2007), but with an important methodological modification. Rather than blocking the PST (i.e., learning) by target, participants simultaneously learned about self and friend in an intermixed design as previous research has demonstrated that selfrelevance exerts a greater influence on decisional processing under these conditions . Replicating Experiment 1, we expected self-relevance (vs. friend-relevance) to reduce the rate of learning and favor exploitation (vs. exploration) of the choice selections.

Participants and design
Thirty-four participants (22 females, 10 males, 2 others; M age = 22.97, SD = 2.62), with normal or corrected-to-normal visual acuity, took part in the research. Data collection was conducted online using Prolific Academic (http://www.prolific.co), with each participant receiving compensation at the rate of £7.50 (~$10) per hour. Informed consent was obtained from participants prior to the commencement of the experiment and the protocol was reviewed and approved by the Ethics Committee at the School of Psychology, University of Plymouth. The experiment had a single factor (Correct Symbol: self or friend) repeated-measures design. To detect a significant effect, a sample of thirty-four participants afforded 80% power for a large effect size (i.e., d = 0.80; PANGEA, v 0.0.2).

Stimulus materials and procedure
A modified version of the PST from Experiment 1 was adopted. Specifically, on a trial-by-trial basis, participants were required to learn which symbol in each pairing was more likely to represent self or best friend. Before the presentation of each stimulus pair, a cue (i.e., the labels "YOU" or "FRIEND") appeared on the screen indicating the target to which the symbols pertained (see Fig. 3). The cue appeared 500 ms before the symbols and remained on the screen, above the stimuli, until a response was made. Participants completed blocks of 120 trials (i.e., 60 self and 60 friend) in which each stimulus pair appeared randomly, equally often, until accuracy reached a satisfactory level. The maximum number of learning blocks was set to three (i.e., 360 trials in total) if the participant did not reach satisfactory levels of accuracy earlier in the task (Frank et al., 2007). In all other respects, the procedure was identical to Experiment 1.

Behavioral analysis
Four participants (3 females) failed to learn the probabilities associated with the symbols, thus were excluded from the analyses. The mean latency and accuracy of choice selections were submitted to a paired-sample (Correct Symbol: self or friend) t-test (two-tailed). The analysis of choice latencies revealed faster responses to self-related compared to friend-related symbols, t(29) = 2.77, p = .010, d = 0.51; respective Ms: 1546 ms vs. 1689 ms). In addition, accuracy was greater for self-related than friend-related stimuli, t(29) = 3.39, p = .002, d = 0.62; respective Ms: 70% vs. 63%).

Modeling analysis
To identify the processes underpinning learning, data were submitted to a RL-DDM analysis following the same modeling procedure as Experiment 1. As previously, fit was better for the dual (DIC: 43524) compared to the single (DIC: 43541) learning rate model. Examination of the posterior distributions (see Fig. 4) revealed differences in learning rates for negative and positive prediction errors (η − & η + ), drift rate scaling (v scaling ), and threshold separation (a). Specifically, comparisons yielded very strong evidence that learning rates were faster for friend compared to self, both for negative (p Bayes (self < friend) = 0.011, BF 10 = 90) and positive (p Bayes (self < friend) = 0.005, BF 10 = 199) prediction errors. As in Experiment 1, participants integrated information more efficiently from negative than positive prediction errors, an effect that was larger for self (p Bayes (η + < η − ) = 0.03, BF 10 = 33) than friend (p Bayes (η + < η − ) = 0.10, BF 10 = 10). There was also extremely strong evidence that drift rate scaling (v scaling ) was larger for self-related than friend-related symbols (p Bayes (self > friend) < 0.001, BF 10 > 1000). Finally, for boundary separation (a), there was extremely strong evidence that more decisional information was required when selecting friend-compared to self-related responses (p Bayes (self < friend) < 0.001, BF 10 > 1000).
Using a different experimental design, these findings replicated the effects observed in Experiment 1. First, for both negative and positive prediction errors, learning rates were slower for self-related compared to friend-related symbols. Second, reflecting a greater reliance on existing knowledge (i.e., sensitivity to current outcomes), self-relevant (vs. friend-relevant) trials were characterized by the tendency to exploit previously rewarded outcomes rather than explore new choice selections (Pedersen et al., 2017). Interestingly, unlike Experiment 1 in which response caution was greater for self-relevant compared to friendrelevant symbols, this effect was reversed in the current experiment. This reversal can likely be traced to task-specific differences in the presentation of the stimulus trials during the PST (i.e., Expt. 1 -blocked by target; Expt. 2 -intermixed; .

General discussion
Notwithstanding the acknowledged benefits that self-relevance exerts on information processing and response selection Symons & Johnson, 1997), here we demonstrated a quite different effect. In the context of a PST, self-relevance (vs. friend-relevance) reduced the rate at which information was acquired. Specifically, whether stimuli were blocked by target (Expt. 1) or intermixed (Expt. 2), learning rates were slower for self-related compared to friend-related associations. In addition, selfrelevant (vs. friend-relevant) learning was characterized by the tendency to exploit rather than explore the choice selections during the task (Cohen et al., 2007;Sutton & Barto, 1998). This indicates that, in a complex (i.e., probabilistic) decision-making setting, previously rewarded self-related outcomes were chosen more often than novelbut potentially riskierchoice selections. In other words, when learning about the self (vs. friend), participants tended to rely on their existing knowledge, thereby trading enhanced future learning for guaranteed current rewards (Pedersen et al., 2017).
That self-relevance has the capacity to impair performance in certain task contexts is unsurprising. Forging immediate and powerful targetobject associations in working memory, personal-relevance (vs. friendrelevance) yields substantial processing benefits when responding is driven by the enhanced accessibility of these relations . That is, highly accessible selfobject associationseven when the stimuli in question are unfamiliar and trivialgive rise to rapid and accurate responses (e.g., Golubickis et al., 2017;Golubickis et al., 2020;Schäfer et al., 2016;Schäfer, Wentura, & Frings, 2017;Stein, Siebold, & van Zoest, 2016;Sui et al., Fig . 3. Examples of the experimental trials. M. C.N. Macrae 2012, 2013;Woźniak & Knoblich, 2019). The strength of these sticky associations, however, can also hinder performance, particularly when participants must override previous learning experiences and acquire new target-object relations (Constable & Knoblich, 2020;Wang et al., 2016). For example, Wang et al. (2016) reported that, once self-shape associations were formed, participants found it difficult to break (i.e., undo) these relations and associate the shapes with a new target (e.g., friend). As they reported (p. 255), "…self-association can either enhance or disrupt processing, depending on whether new associations are assessed or whether old associations have to be discarded." By enhancing the binding of target-object relations, self-relevance has obvious implications for decision-making and learning, at least in settings in which these associations are a task-relevant component of the methodology (Caughey et al., 2021;Constable et al., 2019;Falbén et al., 2019;Woźniak & Knoblich, 2021). As demonstrated here, in a PST (Frank et al., 2004;Frank et al., 2007), learning rates were slower when material was self-relevant (vs. friend-relevant). Several factors probably contributed to the emergence of this effect. Most notably, by shifting the balance toward exploitation rather than exploration during RL, choice selections served both to bolster the stability of the self-concept and optimize response-related rewards. A basic component of socialcognitive functioning is the possession (and maintenance) of a stable self-concept (Greenwald, 1980;Markus, 1977). In this respect, favoring choice selections that previously were (correctly) associated with the self would unquestionably service this objective.
In addition, the reward value of self-relevant (vs. friend-relevant) outcomes would similarly encourage exploitation over exploration (Cohen et al., 2007). According to Northoff and Hayes (2011), selfreferential processing is underpinned by the intrinsic reward-related properties of self-relevant stimuli (Northoff & Hayes, 2011). Given the pivotal role of reward value during learning (Dayan & Balleine, 2002;Schultz, 1998;Sutton & Barto, 1998), exploiting formerly successful self-related outcomes would be particularly appealing (i.e., dopamine uptake), much more so than comparable friend-related responses or the exploration of novel choice selections. As such, although the precise relationship between self and reward remains a matter of continued scrutiny and debate Stolte, Humphreys, Yankouskaya, & Sui, 2017), during probabilistic learning this connection is likely intimate. Interestingly, in each of the reported experiments, learning was more effective following negative than positive prediction errors, an effect that was most pronounced for the self (vs. friend). It is possible that the tendency to exploit rather explore choice selections during selfrelated learning (i.e., sticky self-symbol associations) may underpin this asymmetry. Future research should explore this possibility.
Although, in the current investigation, the rate of learning was slower for self-relevant compared to friend-relevant stimuli, it is unlikely this effect is immutable. Indeed, as noted earlier, Lockwood et al. (2018) reported that, during deterministic learning, personal (vs. other) associations were formed most rapidly, albeit only when stranger comprised the target of comparison. For a familiar target of comparison (i.e., friend), self-other learning rates did not differ significantly. These inconsistent findings potentially derive from differences in self-function across probabilistic and deterministic learning environments (Gershman & Daw, 2017). In a fully certain (i.e., deterministic) world, exploration is not a viable strategy as pursuing new choice selections following positive feedback would impair performance. In contrast, in probabilistic settings (e.g., PSTs) feedback is accompanied by uncertainty (Frank et al., 2004, Frank et al., 2007, thereby moderating the balance between the competing strategies that drive choice selections (i.e., exploration-exploitation trade-off). As was observed in the current experiments, self-relevant (vs. friend-relevant) learning was characterized by the tendency to exploit rather than explore the response-related outcomes, such that potentially enhanced knowledge acquisition was traded for the certainty of immediate rewards (Cohen et al., 2007). This suggests that, depending on the characteristics of the learning environment (i.e., deterministic vs. probabilistic), self-relevance can exert quite different effects on RL.
Operating in this flexible way, learning mirrors the other domains in which the effects of self-relevance have been explored (e.g., attention, memory, decision-making). Inspection of a rapidly developing literature reveals the inherent malleability of self-prioritization and the divergent cognitive origins of self-bias. Specifically, whether self-prioritization facilitates or impedes performanceor indeed arises at allis highly contingent upon the way in which self-object associations are operationalized, established, and probed (Caughey et al., 2021;Constable et al., 2019;Falbén et al., 2019Falbén et al., , 2020Golubickis et al., 2020Macrae, Visokomogilski, Golubickis, & Sahraie, 2018;Siebold, Weaver, Donk, & van Zoest, 2015;Stein et al., 2016;Svensson et al., 2021;Wang et al., 2016;Woźniak & Knoblich, 2021). Moreover, whereas in some task contexts self-relevance influences the efficiency of stimulus processing Golubickis et al., 2020), in others it impacts response-related operations (Constable et al., 2019;Falbén et al., 2020;Golubickis et al., 2018Golubickis et al., , 2019. A useful task for future research will therefore be to establish how this contextual-dependency modulates the acquisition of self-knowledge across learning environments that vary in important ways; including the identity and number of targets of comparison, the characteristics of the to-be-learned material, and the distribution of rewards (Haruno & Kawato, 2006;Knowlton, Squire, & Gluck, 1994;Lockwood et al., 2018).
Attention should also be directed to the task context in which information pertaining to the self and others is encountered. Here differences in response caution were observed across two instrumental learning experiments that differed in task structure. Specifically, whereas response caution was greater on self-relevant compared to friend-relevant trials when stimuli were blocked by target (i.e., Experiment 1), this effect was reversed when the trial types were intermixed (i. e., Experiment 2). Relatedly, both  and Desebrock et al. (in press) have similarly demonstrated the sensitivity of self-referential processing to the characteristics of the task environment. For example, using a shape-label matching task,  observed a reduction in self-prioritization when stimuli were intermixed compared to blocked by target. Extending this finding, again in a shape-label matching task but using unisensory and multisensory stimuli, Desebrock et al. (in press) found that self-prioritization was greatest when trials were blocked by sensory modality. Collectively, these findings highlight the contextual dependence of self-bias, a factor that has largely been overlooked in research to date.
Consideration should also be given to the neural mechanisms that support the learning of material pertaining to the self and others. For example, is the acquisition of person-related knowledge underpinned by the same associative operations that drive reward-based learning in nonsocial contexts? Given the established role of the pre-frontal cortex (PFC) during self-referential processing (Kelley et al., 2002;Mitchell, Heatherton, & Macrae, 2002;Mitchell, Macrae, & Banaji, 2006;Sui et al., 2013), it is interesting to note that resolution of the explorationexploitation dilemma is also associated with activation in this region (Blanchard & Gershman, 2018;Domenech, Rheims, & Koechlin, 2020). Specifically, whereas activity in the ventromedial PFC (vmPFC) indexes the subjective value of outcomes given the action plan that is currently in place, modulation in dorsomedial PFC (dmPFC) reflects a reduction in these values and the generation of new response-related strategies (Donoso, Collins, & Koechlin, 2014). In their investigation of the neural correlates of self-learning, Lockwood et al. (2018) reported that no brain area tracked exclusively with self-bias (i.e., self-ownership effect) during a deterministic learning task. Nevertheless, vmPFC responded more strongly to self-compared to stranger-related (but not friend-related) associations. As the current experiments yielded differences in both learning rates and the drift-rate scaling parameter (i.e., explorationexploitation trade-off) for self and friend, it would therefore be interesting to explore the neural mechanisms that underlie self/other learning during a PST. In such a task setting, distinct patterns of activation may emerge in the mPFC and other cortical regions that support learning (e.g., anterior cingulate cortex [ACC]; Kennerley, Walton, Behrens, Buckley, & Rushworth, 2006;Holroyd & McClure, 2015).

Conclusion
Using a PST in combination with a RL-DDM analysis, here we considered how self-relevance influences instrumental learning. Across two experiments, learning rates were slower for self-related compared to friend-related associations and self-relevant (vs. friend-relevant) learning was characterized by exploitation (vs. exploration) of the choice selections. Together with related research (Lockwood et al., 2018), these findings affirm the utility of computational approaches in the investigation of core social-cognitive topics (Hackel & Amodio, 2018;Lockwood & Klein-Flugge, 2020). Continuing in this way, further research should clarify exactly when, how, and for whom self-relevance influences associative learning.