Testosterone eliminates strategic prosocial behavior through impacting choice consistency in healthy males

Humans are strategically more prosocial when their actions are being watched by others than when they act alone. Using a psychopharmacogenetic approach, we investigated the endocrinological and computational mechanisms of such audience-driven prosociality. One hundred and ninety-two male participants received either a single dose of testosterone (150 mg) or a placebo and performed a prosocial and self-benefitting reinforcement learning task. Crucially, the task was performed either in private or when being watched. Rival theories suggest that the hormone might either diminish or strengthen audience-dependent prosociality. We show that exogenous testosterone fully eliminated strategic, i.e., feigned, prosociality and thus decreased submission to audience expectations. We next performed reinforcement-learning drift-diffusion computational modeling to elucidate which latent aspects of decision-making testosterone acted on. The modeling revealed that testosterone compared to placebo did not deteriorate reinforcement learning per se. Rather, when being watched, the hormone altered the degree to which the learned information on choice value translated to action selection. Taken together, our study provides novel evidence of testosterone’s effects on implicit reward processing, through which it counteracts conformity and deceptive reputation strategies.

received testosterone treatment, not in the placebo group. Previous research [10,11] described similar abnormally high testosterone levels and attributed it to the testosterone contamination of the common surfaces (e.g., doorknobs, keyboards), excluding the option of physiological contamination. Based on their recommendation, we implemented a cleaning protocol that included the wearing of disposable sterile gloves, and cleaning of keyboards, computer mice, tables, and doorknobs with an alcoholbased solution after each session. Although these precautions successfully prevented between-session contamination, we suspect that they still did not reliably impede within-session contamination of the saliva containers. For future studies, we, therefore, recommend even stricter sanitizing protocols and more careful handling of the saliva collection tubes and boxes, before, during, and also after sample collection.
In our sample, the abnormally high values were present only in the sessions where testosterone was administered, and this notwithstanding, the testosterone group showed a reliable testosterone increase after the drug administration in comparison to the placebo group. We, therefore, decided to retain the participants with contaminated baseline samples for the behavioral analyses, except for the analysis that includes baseline testosterone levels.

The effect of salivary testosterone levels on the correct choice.
To examine whether salivary testosterone levels measured in the saliva samples taken before the start of the experimental task (i.e., 2 hours after the drug administration) predicted participants' behavior, we included the mean-centered log-transformed testosterone levels as a predictor in interaction with the factors recipient and visibility to the GzLMM of the correct choice. The analysis did not reveal a main effect of testosterone levels (OR = 0.99, CI = [0.96, 1.03], p = .624) or a significant interaction effect on correct choice (recipient x drug treatment x testosterone levels : OR = 1.02, CI = [0.99, 1.05], p = .162). The absence of a significant association between salivary testosterone concentrations and behavior is in line with studies that point out that although salivary testosterone measurements are correlated with the hormone concentration in serum, they do not precisely track the availability of free serum testosterone after transdermal application [11,12]. Thus, although the between-groups comparison of saliva testosterone provides a manipulation check of topical drug administration, for the analysis of the relationship between behavior and post-administration hormonal levels on the individual level, salivary testosterone measures may presently lack sensitivity.

Interaction of salivary cortisol levels with testosterone effects on the correct choice and RLDDM parameters.
To examine whether cortisol levels interacted with testosterone's effect on correct choice and reinforcement learning drift diffusion model (RLDDM) parameters, we separately added baseline cortisol levels and cortisol reactivity as predictors to our main analysis. The cortisol reactivity was defined as a difference between cortisol levels detected in the saliva sample taken 20 minutes after the end of visibility manipulation and cortisol levels from the sample taken immediately before the start of the paradigm . The values were mean-centered and entered as a predictor in interaction with the   other factors (recipient, drug

Rescorla-Wagner (RW) model.
We started with the simple Rescorla-Wagner [13] model as our baseline model. On each trial, the value (Vc,t) of the chosen option was updated with the reward prediction error (RPE): , where Ot-1 was the received outcome, and α (0 < α < 1) denoted the learning rate.

Dual learning rates (DLR) model.
Previous studies have reported differences in learning following positive and negative PEs [14,15] and these differences have been linked to the distinctive roles of striatal D1 and D2 dopamine receptors in segregated cortico-striatal pathways [16]. Furthermore, analyses of genetic polymorphisms of DARPP-32 that predict choice behavior associated with a positive outcome, and the DRD2 gene predicting avoidance of choices associated with negative outcomes, support the notion that independent dopaminergic mechanisms contribute to learning from positive and negative feedback [17]. RL models with dual learning rates have been successful in capturing this asymmetric learning effect [18,19], including studies in social neuroscience examining prosocial behavior [20].
Hence, we tested a dual learning rates model on top of the RW model: , where α posPE and α negPE were the learning rates for positive and negative RPEs, respectively.
In both RW and DLR models, action values were converted to action probabilities using the softmax function. Let A and B be the choice symbols per trial, the probability of choosing A was computed via the difference between V(A) and V(B): , where β (β > 0) was the inverse temperature that represented choice consistency. Higher β indicated that individuals' choices were more consistent with their value computation, where lower β indicated that individuals behaved more randomly. The action probability was then used to model participants' choice data with a categorical distribution: It is worth noting that our winning model with differential learning rates for positive and negative PEs is in line with previous studies on learning and decision-making that report asymmetric learning effect [14,15,18,19]. The theoretical interpretation of these differences, however, is mixed in the literature. While some studies reported enhanced learning after positive PE compared to negative and related this feature to optimism bias [21], others report a higher learning rate for negative than for positive PEs, interpreted as possibly reflecting risk aversion [14]. Our findings in this regard may contribute to this debate such that learning update is much quicker after receiving positive feedback, especially when the feedback is associated with appetitive stimuli (e.g., monetary reward).
This is likely to be generalized to situations with relatively positive feedback, rather than the actual positive feedback per se. For instance, in aversive learning, receiving no feedback (e.g., a neutral outcome) is, on a relative scale, more positive than actual negative feedback (e.g., an electric shock).
Learning rates for such no-feedback events have been shown to be higher than those for stimuli with negative feedback, including learning under social contexts [20].

Drift diffusion model (DDM).
The drift diffusion model (a.k.a., diffusion decision model [22]) was a widely used computational framework to model individuals' response times (RTs). In its canonical expression, DDM contained four parameters, namely, the drift rate (v; v > 0), the initial bias (z; z > 0), non-decision time (T; 0 < T min (RT)), as well as the decision threshold (a). For simplicity in learning tasks with abstract symbols, the initial bias z was fixed at 0.5. Trial-by-trial RTs were distributed according to the Wiener first passage time (WFPT [23]): .

Reinforcement learning drift diffusion model (RLDDM).
In value-based decision-making, individuals' RTs may vary as the function of trial-by-trial valuation, such that the larger the value difference between choice alternatives, the faster the RT. Therefore, a joint reinforcement learning drift diffusion model (RLDDM) framework has been proposed [15,24], bridging RL and DDM. This approach provides more granularity than using RL or DDM alone [24].
In essence, the drift rate in DDM was characterized by the accuracy-coded value differences computed from the RL counterpart. This way, the drift rate was no more a constant parameter throughout the entire experiment, instead, it varied across trials (i.e., vt, instead of v) according to the values computed from RL updates (in the present study, RW or RP). In the simplest RLDDM, trialby-trial drift rates were constructed via a linear function of value difference: , where vscaling (vscaling >0) was the scaling parameter that quantified the impact of value difference. Note that we employed stimulus coding in our RLDDM, so that in Equation (6), the drift rate was always a function of the value difference between the correct (i.e., more rewarding, 75% reward probability) and the incorrect options (i.e., less rewarding, 25% reward probability), rather than between the chosen and unchosen options.

Reinforcement learning drift diffusion model with non-linear transformation (RLDDM-nonlin).
There is evidence that a non-linear mapping between value difference and the drift rate could better capture individuals' RTs as opposed to a linear transformation [24]. This is likely because non-linear functions may provide more sensitivity, akin to the softmax function in choice models. We thus implemented an RLDDM-nonlin following: , with , where S(x) was a non-linear sigmoid function centered at 0, that convert x to lie between -vmax and vmax (vmax > 0). It is worth noting that vmax only affected the maximum value of the drift rate, whereas vscaling, as in Equation 6, established the trial-by-trial mapping between the value difference and the drift rate.
In both RLDDM and RLDDM-nonlin, all other DDM parameters (i.e., a, T, z) were identical to the canonical DDM model, and RTs were distributed with wfpt using trial-by-trial drift rate (vt): . (9) Note that, in all candidate models (Table S3), we introduced differential parameters for the withinsubject condition of our experiment, namely, all parameters were separately modeled for the "self" and the "other" conditions.

Model estimation.
The model estimation and model selection procedures were largely similar to [25]. Hence, below we echoed these procedures from [25] to enhance reproducibility, with modifications that were specific to the current study.
In all models, we simultaneously modeled participants' choice and RT, separately for each betweensubject condition (i.e., placebo vs. testosterone; observed vs. private). Model estimations of all candidate models were performed with hierarchical Bayesian analysis (HBA) [26] using the statistical computing language Stan [7] in R. Stan utilizes a Hamiltonian Monte Carlo (HMC; an efficient Markov Chain Monte Carlo, MCMC) sampling scheme to perform full Bayesian inference and obtain the actual posterior distribution. We performed HBA rather than maximum likelihood estimation (MLE) because HBA provides much more stable and accurate estimates than MLE [24]. Following the approach in the "hBayesDM" package [8] for using Stan in the field of reinforcement learning, we assumed, for instance, that a generic individual-level parameter  was drawn from a group-level normal distribution, namely,  ~ Normal (μ, σ), with μ and σ. being the group-level mean and standard deviation, respectively. Both these group-level parameters were specified with weaklyinformative priors [26]: μ ~ Normal (0, 1) and σ.~ half-Cauchy (0, 1). This was to ensure that the MCMC sampler traveled over a sufficiently wide range to sample the entire parameter space.
In HBA, all group-level parameters and individual-level parameters were simultaneously estimated through the Bayes' rule by incorporating behavioral data. We fit each candidate model with four independent MCMC chains using 1,000 iterations after 1,000 iterations for the initial algorithm warmup per chain, which resulted in 4,000 valid posterior samples. The convergence of MCMC chains was assessed both visually (from the trace plot) and through the Gelman-Rubin R Statistics [27]. R values of all parameters were smaller than 1.05 in the current study), which indicated adequate convergence.

Model selection and validation.
For model comparison and model selection, we computed the Leave-One-Out information criterion (LOOIC) score per candidate model [28]. The LOOIC score provides the point-wise estimate (using the entire posterior distribution) of out-of-sample predictive accuracy in a fully Bayesian way, which is more reliable compared to information criteria using point-estimate (e.g., Akaike information criterion, AIC; deviance information criterion, DIC). By convention, a lower LOOIC score indicates better out-of-sample prediction accuracy of the candidate model. We selected the model with the lowest LOOIC as the winning model. We additionally performed Bayesian model averaging (BMA) with Bayesian bootstrap [29]to compute the probability of each candidate model being the best model.
Conventionally, the BMA probability of 0.8 (or higher) is a decisive indication.
Moreover, given that model comparison provided merely relative performance among candidate models [28], we then tested how well our winning model's posterior prediction was able to replicate the key features of the observed data (a.k.a., posterior predictive checks, PPCs per trial per participant, and we analyzed the generated data the same way as we did for the observed data. We then assessed whether these analyses could reproduce the behavioral pattern in our behavioral analyses ( Figure 4B, 4D in the main text).

Simulations of optimal learning rates.
To better understand and interpret the magnitude of the posterior learning rates, we performed simulations with grid approximation to obtain "optimal learning rates", and then compared the estimated posterior learning rates in relation to these optimal parameters ( Figure 4A, 4C in the main text). Because there were two learning rates (α posPE and α negPE ), to reduce complexity, we fixed the inverse temperature parameter to be the corresponding group-level posterior mean in each condition.
For each simulation, we took a small grid per parameter (0:0.01:1) and computed the choice accuracy across 16 trials (identical to the main experiment) for each combination of the parameters. Each simulation was repeated 1000 times to obtain stable results. We then considered the parameters that gave the highest choice accuracy as the optimal learning rates.

Analysis of the RLDDM parameters and their association with prosocial behavior.
Next, we examined whether the behavioral pattern found in the analysis of the correct choice would be associated with differences in the individual model parameters.
As a first step, we tested the parameters of our validated winning model for the 3-way interaction effect of drug treatment, visibility, and type of recipient. There was no significant 3-way interaction in As a second step, we examined whether the RLDDM parameters that were impacted by testosterone administration predict behavioral prosociality, measured by the difference between correct choices made for other and self across the whole sample. Out of the five parameters, choice consistency (B =

Analysis of the drift-scaling parameter and response times.
As specified in equation (7), on each trial t, the drift rate vt was defined with a drift-scaling parameter, vscaling that scales the value difference between the correct and incorrect symbol. Drift-scaling parameter affects the curvature of the function: smaller values lead to a more linear mapping between the value difference and the drift rate, and therefore less sensitivity to value differences.
Drift scaling is conceptually linked to the speed of integration and response times [22], we, therefore, tested whether drift-scaling parameter predicted response times and found a significant association (B

Supplementary information on the analysis of genetic data
Previous research suggested that testosterone may influence behavior through dopaminergic pathway [31]. In humans, testosterone administration enhanced activation of the ventral striatum to monetary rewards [32] and the enhancing effects of exogenous testosterone on competitive status-seeking were more pronounced among individuals with a 9/10R compared to 10/10R genotype of the dopamine transporter (DAT) [33]. The expression of DAT, which regulates striatal dopamine, is linked to a 40 base-pair variable number tandem repeat polymorphism of the DAT1 gene [34]. Homozygous 10/10repeat carriers of this polymorphism have higher DAT expression (i.e., lower striatal dopamine) than heterozygous, 9-repeat variant, individuals [34].
Testosterone's effects on status-seeking behavior have likewise been shown to be enhanced among individuals with fewer CAG repeats in exon 1 of the androgen-receptor gene [33,36]. In-vitro experimental work suggests that increasing the number of CAG repeats within the androgen receptor (AR) gene reduces the receptor's transcriptional potential [37]. In other words, the efficiency of the androgen receptors is negatively related to the CAG repeat [38].
We, therefore, tested whether testosterone's effects on strategic prosociality depended on individual differences in striatal dopamine, assessed by DAT1 polymorphism, and the efficiency of ARs, assessed by the CAG repeat polymorphism.

Genotyping of AR CAG repeat and DAT1 polymorphisms.
DNA was extracted from buccal swabs and isolated using a resin-based method with Chelex®100

Interaction of DAT1 polymorphism with testosterone effects on the correct choice and RLDDM parameters.
There were no significant differences in the distribution of the genotype among our experimental

Interaction of AR CAG repeat polymorphism with testosterone effects on the correct choice and RLDDM parameters.
Mean

Interaction of trait dominance with testosterone effects on RLDDM parameters.
Mean-centered dominance scores [39] were

Supplementary information on the questionnaire data
Post-task questionnaire.
To estimate the subjective perception of being watched, a post-task questionnaire was administered immediately after the end of the reinforcement-learning paradigm. The participants were asked the question: "Did you feel that you were being watched while performing the task?" The answers were

Portrait Values Questionnaire.
We conducted exploratory analyses to examine what motivational constructs may have interacted with testosterone's effects. To this end, we analyzed data from the Portrait Values Questionnaire [40,41] which had been administered as part of the study's extensive questionnaire battery. Note that these are exploratory analyses and that this specific questionnaire was originally intended to be used for analyses related to other tasks that were part of the overall study. They should therefore be

Observers' salience survey.
To estimate the salience of the observers introduced as NGO representatives in our paradigm, we conducted an additional online survey.