Dissociating different forms of random exploration

The arbitration between making the most out of current knowledge (exploitation) and gathering new knowledge (exploration) is central to decision making. It has been proposed that humans use two distinct strategies for exploration. Directed exploration which targets high information conveying options, and random exploration which assigns weights to options relative to their value estimates. Here we suggest that humans use a third strategy, ’tabula rasa’ exploration, which in contrast to the traditional ’random’ exploration ignores all prior knowledge about the world. We tested this hypothesis using a novel three-bandit task in which the expected values, the prior information and the time horizon is manipulated. Using computational modeling, we found evidence for tabula rasa exploration in addition to directed and random exploration.

When seeking to optimise rewards over time in a finite decision space, a thinking agent will most certainly face the explore-exploit dilemma. She will need to decide whether to go for a known option with the highest expected reward (exploitation) or for lesser known options (exploration) to make sure to not miss out on possibly even higher rewards. Although humans solve this dilemma on a daily basis, the mechanisms involved are still not fully understood. Explorationexploitation decisions can be studied using the multi-armed bandit problem in which a gambler has to choose how to play from slot machines in order to maximize her reward (Sutton & Barto, 1998). By varying the time horizon (i.e. the number of choices that could be made in the future) humans modulate their exploration behaviour using two distinct strategies: directed and random (Wilson, Geana, White, Ludvig, & Cohen, 2014). Directed exploration is the idea that humans are biased towards highly informative options. This is commonly implemented by adding an 'information' bonus to the expected reward of an option, for example using the Upper Confidence Bound (UCB) algorithm. Random exploration is based on the assumption that humans inject stochasticity in their decision (Daw, O'Doherty, Dayan, Seymour, & Dolan, 2006), which influences the weight of the relative value estimate, e.g. via softmax (Sutton & Barto, 1998) or Thompson sampling (Thompson, 1933) . In addition, recent findings have suggested that a hybrid model combining a stochastic UCB model (UCB + softmax) with a Thompson sampling model is best accounted for human exploration behaviour (Gershman, 2018). Both those strategies take into account some knowledge of the world. Here we suggest that humans can also use a third type of exploration strategy ('tabula rasa') which is completely agnostic to the environment. This is a reflection of what is known in reinforcement learning as ε-greedy algorithm (Sutton & Barto, 1998). Tabula rasa exploration thus predicts a constant probability of exploration independent of the states by occasionally substituting the greedy action with a random action. To test this hypothesis we developed a novel task and used computational modelling to investigate these three forms of exploration.

899
This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0

Methods
Sixty healthy volunteers (30 female, 30 male; mean age = 23.30) were recruited. All participants provided written informed consent and the study was approved by the University College London research ethics committee.
Figure 1: Maggie's Farm task. Left: Long horizon. In this trial red is tree D, green is tree A and yellow is tree B. Right: Short horizon. In this trial red is tree A, green is tree B and yellow is tree C.
Participants had to choose between trees that produced apples with different sizes in two different horizon condition (Figure 1). They were instructed to collect the biggest apples and that they would receive cash bonus according to their performance. To distinguish between different types of exploration (Wilson et al., 2014), we manipulated the horizon (i.e. number of apples to be picked: 1 in the short horizon, 6 in the long horizon) and within games the mean reward µ (i.e. apple size) and the information I (i.e. apples shown at the beginning of the trial) of each option. Trees were generated from 4 different generative groups: TreeA : µ A ∼ N(5.5, 1.4), I A = 3 TreeB : µ B = µ A ± 1 or 2, I B = 1 TreeC : µ C = (µ A or µ B ) ± 1 or 2, I C = 0 For each type of tree x, the apples were sampled from N(µ x , 0.8), bounded to [2,10], and rounded to the closest integer. On each trial, three trees from different groups were available to choose from. Tree A and Tree B are a reproduction of the 'Horizon task' (Wilson et al., 2014) and allow us to distinguish between random and directed exploration. In this task we added Tree C to examine how directed exploration applies in a complete naive scenario, and a very low-valued option, Tree D, to measure tabula-rasa exploration. There were 50 trials of each tree category combination for both short and long horizon, resulting in a total of 400 trials. We then developed a set of Bayesian generative models, where each model assumed that different characteristics accounted for participants' behavior. The binary coefficients w 1 , w 2 , w 3 , w 4 , w 5 are used to indicate which components were included in the different models. Similarly to previous studies (Gershman, 2018), the mean Q t (x) and variance σ 2 t (x) of each tree x are tracked using Kalman filtering (Bishop, 2006) and γ is a horizon-dependent weighting factor on the uncertainty bonus.
We computed the sum Λ t of the value of each tree (the first term), the directed exploration component of UCB (the second term) and the random exploration component of Thompson sampling (the third term) for each tree x. Additionally, as gaining information about an unknown stimuli can be intrinsically rewarding (Dubey & Griffiths, 2017), we added a novelty bonus η (non zero for tree C only). Therefore, for each tree x we have: The choice policy was computed using a softmax with a horizon-dependent τ controlling the choice stochasticity (analogous to a choice temperature). To measure tabula-rasa exploration, we extended it with an ε-greedy component using a horizon dependent ε. The probability of choosing tree x is:

Results
The winning model (i.e. the model with the lowest BIC) was found to be the model with [w 1 , w 2 , w 3 , w 4 , w 5 ] = [1, 1, 0, 1, 1], that is a stochastic UCB model with both a novelty bonus and an ε-greedy parameter. Therefore, the sum of directed and random exploration components of each tree x is: And the probability of choosing tree x : As shown in previous studies (Wilson et al., 2014) , γ and τ reflect directed and random exploration respectively. Following our hypothesis, ε reflects tabula rasa exploration. The novelty of tree C is taken into account by the novelty bonus η.
We then averaged the model parameters of each participant. We compared the parameters between horizons ( Figure 2) and found strong evidence of γ being modulated by the horizon (p < 0.001) with marginally significant horizon effects on τ (p = 0.054) and ε (p = 0.052). (*** is p < 0.01, + is p = 0.05). The parameters γ, τ and ε reflect directed, random and tabula rasa exploration respectively and show (marginally) significant horizon effects

Conclusion
In this study we examined different forms of exploration. Our results reproduce previous research dissociating directed and random exploration and they suggest that participants also engage in another, more agnostic, form of random exploration: tabula rasa exploration.