Hedging Your Bets by Learning Reward Correlations in the Human Brain

Summary Human subjects are proficient at tracking the mean and variance of rewards and updating these via prediction errors. Here, we addressed whether humans can also learn about higher-order relationships between distinct environmental outcomes, a defining ecological feature of contexts where multiple sources of rewards are available. By manipulating the degree to which distinct outcomes are correlated, we show that subjects implemented an explicit model-based strategy to learn the associated outcome correlations and were adept in using that information to dynamically adjust their choices in a task that required a minimization of outcome variance. Importantly, the experimentally generated outcome correlations were explicitly represented neuronally in right midinsula with a learning prediction error signal expressed in rostral anterior cingulate cortex. Thus, our data show that the human brain represents higher-order correlation structures between rewards, a core adaptive ability whose immediate benefit is optimized sampling.


Table S1. Individual Subjects' Performance
The payout bonus benchmarks the portfolio fluctuation realized by the subject against the fluctuation that would result from an optimal strategy (bonus = sd optimal / sd subject ). The performance index benchmarks subjects' actual responses to optimal responses of an omniscient agent (normalized between 0=random choice and 1=perfect choice).

Supplemental Experimental Procedures
In addition to the correlation learning model described in the main text we created the following alternative models that do not require learning of covariance information.

Model free RL learning
The most basic one in terms of reinforcement learning is model-free RL that learns Q-values for actions (moving the slider right or leftwards, corresponding to increasing or decreasing weights), based on the portfolio outcome. A model free learner would in the beginning start somewhere on the slider (in our modelling we used the center) and then makes an action a t (a move either left or right). After observing the portfolio outcome he calculates a prediction error as the absolute deviation of this outcome V p,t from the target outcome M (the grand mean of the portfolio outcomes). The Q-value for that action is then updated by the RL-prediction error in the current trial.
If the subject experiences a large deviation from the target, then the current move is penalized more than if the subject experiences a small deviation. The Q value for moving in the opposite direction was 1-Q a .
Because the values of the two available actions were directly linked, each outcome provided equal information for both actions. Consistent with the other models we used greedy action selection to determine how weights are updated in every trial (Sutton and Barto, 1998). An additional parameter allowed fluctuations in the step size of resulting weight changes across subjects.

Heuristic based on coincidence detection (outcome -outcome associations)
Instead of learning the relation between two outcomes via their co-variation (a statistical measure related to risk and variance) individuals could form simple associations between one and the other outcome. A subject performing such associative learning would learn an outcome-outcome association and update the strength of this relation by a trial-by-trial prediction error. This concept is easy to describe for the case of probabilistic outcomes of constant magnitude. If outcome O 1 is present its predictive strength is updated depending on the presence of O 2 as If O 1 is absent, then the strength is inversely updated as

Sliding window model
A common approach to estimate the local temporal correlation of a time series is to calculate the correlation coefficient over the past n trials. We did this by using a sliding window of size n and calculated the correlation coefficient over these trials: computed in applied mathematics or finance. We therefore include data from the model fit for comparison to the reinforcement learning approach.
The best fitting window size parameter n in a fit of model to subjects' behavior allows us to put the learning rate in the RL algorithm (using an exponential kernel) in relation to a span of observed trials if the correlation was calculated normatively.

1/N Heuristic
This simple heuristic from behavioral finance keeps the weights constant at equal weights for all N assets (N=2 in our case) w sun = w wind = 1/N = 0.5 Our task was not a fair test for whether subjects would use the 1/N rule in an uncertain environment because our adaptive design punished the use of constant weights. We nevertheless include parameters from this strategy in our summary table for comparison. The gain from using an optimal strategy over this heuristic decreases with larger N or if the decision maker has very imprecise information about correlations.

Random choice
A random weight (within the range [-1, 2] is chosen on every trial).

Supplemental Text Variance minimizing strategies in a portfolio
Modern portfolio theory (MPT) is a theory of investment which normatively maximizes expected return for a given amount of risk, or equivalently minimizes risk for a given level of expected return, by solving for the optimal proportions of various assets. It is therefore ideally suited to describe the best possible strategy for setting portfolio weights that minimize overall fluctuation.
The optimal mixing of the two assets depends on their individual variance and their correlation but to simplify our task we kept the mean return of both assets and the standard deviations  1 and  2 constant with the relationships of either  1 = 2* 2 or  1 = 0.5* 2 . Hence for the purpose of our experiment the optimal portfolio weight w 1 could be described as a fixed (nonlinear) function of the correlation coefficient. We will explain in the following section how the portfolio variance can be minimized for three cases of a (1) positive correlation, (2) negative correlation and (3) zero correlation.
1) Single trial outcomes from two highly correlated assets tend to deviate from the mean in the same direction and the asset with the larger fluctuation will on average be the one that deviates more. The investor therefore buys assets with the lesser risk and short sells a smaller number of assets with the higher risk. This is realized by a large positive portfolio weight for the outcome that has the smaller variance and a negative weight for the other outcome. Large deviations of the lower risk stimulus are thereby offset by a subtraction of the deviation of the higher risk asset (Fig. S2A).
2) If the two assets are negatively correlated, they tend to deviate from their mean in opposite direction. The portfolio variance is minimal with some form of averaging over both assets (Fig.   S2B), i.e. in the optimal solution here the weights for both assets are positive.