Searching for rewards in graph-structured spaces

How do people generalize and explore structured spaces? We study human behavior on a multi-armed bandit task, where rewards are influenced by the connectivity structure of a graph. A detailed predictive model comparison shows that a Gaussian Process regression model using a diffusion kernel is able to best describe participant choices, and also predict judgments about expected reward and confidence. This model unifies psychological models of function learning with the Successor Representation used in reinforcement learning, thereby building a bridge between different models of generalization.


Introduction
From social networks to subway maps, many decision-making environments can be described using graph structures, where relationships are defined based on transition structure rather than comparing features. Here, we propose the diffusion kernel as a similarity metric for functions on graphs, which combined with the Gaussian Process (GP) framework, allows us to make Bayesian predictions about unobserved nodes. Using a graph-correlated bandit task, we study how people generalize and search for rewards in structured spaces. We show that the GP model best predicts choices, produces humanlike learning curves, and predicts judgments about expected reward and confidence for unobserved nodes. Overall, these results extend the scope of previous theories of generalization in spatial (Wu, Schulz, Speekenbrink, Nelson, & Meder, 2018;Schulz, Wu, Ruggeri, & Meder, 2018) and conceptual domains (Wu, Schulz, Garvert, Meder, & Schuck, 2018;Stojic, Schulz, Analytis, & Speekenbrink, 2018) to structured spaces.

Generalization on graph structures
We can specify a graph G = (S , E) with nodes s i ∈ S and edges e i ∈ E to represent a structured state space (Fig. 1a).
Nodes represent states and edges represent allowed transitions. For now, we assume that all edges are undirected (i.e., if x → y then y → x). The connectivity structure of the graph determines which states are accessible from a given prior state, and is often described using the graph Laplacian L: : Inference over graphs. a) An example of a graph structure, where nodes represent states and edges indicate the transition structure. b) A diffusion kernel is a similarity metric between nodes on a graph, allowing us to generalize the value of unobserved nodes based on the assumption that correlations of rewards decays as an exponential function of the graph-distance between two nodes. The diffusion parameter (α) governs the rate of decay. c) Screenshot of our graph-correlated bandit task, where each node is an arm of a bandit with rewards correlated across the graph structure.
where A is the adjacency matrix and D is the degree matrix. Each element a i j ∈ A is 1 when nodes i and j are connected, and 0 otherwise, while the diagonals of D describe the number of connections of each node. The graph Laplacian can also describe graphs with weighted edges, where D becomes the weighted degree matrix and A becomes the weighted adjacency matrix.

The diffusion kernel
The diffusion kernel (DF; Kondor & Lafferty, 2002) defines a similarity metric k(s, s ) between any two nodes based on the matrix exponentiation of the graph Laplacian: k(s, s ) = exp(αL). ( Intuitively, the diffusion kernel assumes that rewards diffuse along the graph similar to a heat diffusion process (i.e., by assuming a continuous random walk), with closely connected nodes assumed to have similar values. The parameter α models the level of diffusion, where α → 0 assumes complete independence between nodes, while α → ∞ assumes all nodes are perfectly correlated.

Gaussian Process regression
From the similarity metric defined by the diffusion kernel (Eq. 2), we use Gaussian Process (GP) regression (Rasmussen & Williams, 2006) to perform Bayesian inference on graph structures. A GP defines a distribution over functions f : S → R n that map the state space S to real-valued scalar 814 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 outputs (e.g., rewards). Functions are modeled as a random draw from a multivariate normal distribution: where m(s) is a mean function specifying the expected output of s, and k(s, s ) encodes prior assumptions about the underlying function. We use the diffusion kernel (Eq. 2) to represent covariance based on the connectivity structure of the graph (see Smola & Kondor, 2003;Kemp & Tenenbaum, 2009, for alternative implementations).
Given some observations D t = {s t , y t } of observed rewards y t at states s t , we can compute the posterior distribution p( f (s * )|D t ) for any target state s * . The posterior is a normal distribution with mean and variance defined as: where K t is the t × t covariance matrix evaluated at each pair of observed inputs, and k t, * = [k(s 1 , s * ), . . . , k(s t , s * )] is the covariance between each observed input and the target input s * , and σ 2 ε is the noise variance. Thus, for any node in the graph, we can make Bayesian predictions about the expected reward m(s * |D t ) and the uncertainty v(s * |D t ) attached to the prediction.

Experiment: Graph-correlated bandit
We designed a task where rewards were defined by the connectivity structure of a graph. Participants searched for rewards by clicking the nodes of a graph, where connections between nodes influenced rewards. This provided a correlated reward structure allowing for similarity-based generalization to aid in search, but where similarity was defined based on connectivity structure rather than perceptual features.

Methods
Participants and design. We recruited 100 participants on Amazon MTurk (requiring 95% approval rate and 100 previously completed HITs). Two participants were excluded because of missing data, leading to a total sample size N = 98 (M age = 34.3; SD = 8.7; 32 female). Participants were paid $2.00 for completing the task and earned an additional performance contingent bonus of up to $3.00. Overall, the task took 7.2 ± 3.3 minutes and participants earned $4.32 ± $0.24 on average.

Materials and procedure.
Participants were instructed to earn as many points as possible by clicking on the nodes of a graph (Fig. 1c). Expected rewards were defined by a graphcorrelated structure, such that connected nodes generated similar rewards. Along with instructions, participants were shown four fully revealed graphs to familiarize them with the reward structure and answered three comprehension questions before starting the task.
The task was performed over 10 rounds, each corresponding to a different graph structure (8x8 lattice graphs with 40% edges randomly pruned). In each round, participants were initially shown a single randomly revealed node, and had 25 clicks to either explore unrevealed nodes or to reclick previously observed nodes. Each clicked node displayed the numerical value (most recent observation if multiple selections) and a color aid, where darker colors corresponded to larger rewards. Participants were informed about their performance after each round as a percentage of the best possible score (w.r.t. the global optimum). The final performance bonus (up to $3.00) was also calculated based on this percentage, averaged over all rounds.
Judgments. Participants were informed that the last round was a "bonus round", where the goal of maximizing points remained the same. However, after 20 clicks, participants were shown a series of 10 unrevealed nodes and asked to make judgments about the expected reward and their confidence. After making the judgments, participants were forced to choose one of the 10 options, and then the task was completed as normal. Behavioral and modeling results exclude the bonus round, except for the analyses of the judgment data.

Results
Participants achieved higher rewards over successive trials (r = .93, p < .001, BF > 100; Fig. 2a) and decisively outperformed a random baseline (t(97) = 29.6, p < .001, d = 3.0, BF > 100). We found no influence of round number on performance (r = .49, p = .182, BF = 1), indicating that the fully revealed environments in the instructions and comprehension questions were sufficient for conveying the goal of the task and the correlational structure.

Model Comparison
We used computational modeling to make predictions about choices and participants' judgments in order to understand how subjects reasoned about graph-correlated environments. Models were fit using leave-one-round-out cross validation, and then compared using the summed out-of-sample prediction accuracy of the left-out rounds. Altogether, we compared five different models corresponding to different strategies for generalization and exploration (see below).
Each model computes a value for each option q(s), which is then transformed into a probability distribution using a softmax choice rule P(s i ) = exp(q(s i )/τ)/ ∑ j exp(q(s j )/τ), where the temperature parameter τ is a free parameter controlling the level of random exploration. In addition, all models use a stickiness parameter ω that adds a bonus onto the value of the most recently chosen option and is a common feature of reinforcement learning models (e.g., Gershman, Pesaran, & Daw, 2009). diffuse along the graph structure. To model how participants balance exploiting high value rewards with exploring highly uncertain options, we use upper confidence bound (UCB) sampling (Auer, 2002):

Gaussian process with diffusion kernel. The Gaussian
where the exploration bonus β is a free parameter governing the level of exploration directed towards uncertain options.

Bayesian mean tracker. The Bayesian mean tracker (BMT)
is a prototypical reinforcement learning model that can be interpreted as a Bayesian variant of the Rescorla-Wagner model (Rescorla & Wagner, 1972;Gershman, 2015). The BMT also produces normally distributed predictions of reward m(s) and v(s) for each node, but are learned independently without generalization. Predictions of unobserved nodes defaulted to a prior of m 0 = 50 and v 0 = 500. The BMT has the error variance σ 2 ε as a free parameter, which can be interpreted as inverse sensitivity. Smaller values result in larger updates to the learned mean m(s) and larger reductions of uncertainty v(s).
The BMT also uses UCB as a sampling strategy, along with stickiness and a softmax choice rule.
Successor representation. The successor representation (SR; Dayan, 1993) is a reinforcement learning model that performs generalization based on building a predictive map of the connection structure. The successor representation matrix M(s, s ) = (I − γT ) −1 models the similarity of node s to node s based on future expected state occupancy, where we assume a random walk policy by setting the transition matrix T to the row normalized graph Laplacian T = I − D −1 L. The extent of generalization is governed by the temporal discount parameter γ, which we treat as a free parameter.
While the SR has theoretical equivalencies to the diffusion kernel (Stachenfeld, Botvinick, & Gershman, 2014;Machado et al., 2018), there are practical differences when computed on finite graphs and also by modeling the extent of generalization using the temporal discount rate γ rather than the diffusion parameter α.
where R(s ) is the observed reward at state s . Because there are no uncertainty estimates, the SR does not implement any directed sampling using UCB. Instead, we set q(s) = m(s) and apply stickiness along with a softmax choice rule.
Nearest neighbors models. In addition to reinforcement learning models, we also consider two simple nearest neighbor averaging models. The d-nearest neighbors (dNN) model estimates expected reward for unobserved node by averaging the rewards of all observed nodes within a distance of d.

Model results
We compared model in terms of predictive accuracy using outof-sample loss. Figure 2b shows the relative performance of each model in terms of the protected exceedance probability (PXP), which is a Bayesian model selection framework for estimating the probability that a given model is more prevalent in the population than all others, corrected for chance (Rigoux, Stephan, Friston, & Daunizeau, 2014). Overall, the GP had the highest predictive accuracy, with an exceedance probability of PXP=.90.
We also simulated the behavior of each model by sampling (with replacement) from the set of participant parameter estimates (10k samples) and computing the average learning curves (Fig. 2a). Although all models under-performed compared to the human curves, the GP had the closest match.

Judgments.
To provide additional support for our modeling results, we predicted participant judgments in the bonus round using parameters estimated over all rounds except the bonus round. Comparing participant and model predictions about expected reward, the GP had the highest average correlation (r = .41; Fig. 2c), which was better than the dNN (comparing Z-transformed correlation coefficients: t(97) = 3.0, p = .004, d = 0.2, BF = 7), and kNN models (t(97) = 3.0, p = .003, d = 0.2, BF = 8), but equally good as the SR (t(97) = 0.1, p = .901, d = 0.0, BF = .11). This is intuitive, because the GP and the SR should generate close-to-equivalent mean predictions in our task. Correlations are undefined for the BMT, since it invariably makes the same prediction.
Additionally the GP uncertainty predictions were predictive of participant confidence ratings ( Fig. 2d; see also Wu, Schulz, & Gershman, 2019). Using mixed effects regression to predict the raw confidence judgment (Likert scale 1-11) using the GP uncertainty estimate as a fixed effect and participant as a random effect, we find higher GP uncertainty estimates predicted lower confidence ratings (β = −.30, t(414) = −5.7, p < .001, BF > 100). The BMT assumes the same level of uncertainty for all unobserved nodes (i.e., making no predictions about confidence), while none of the other models represent uncertainty.

Conclusion
We studied how people generalize in structured spaces, where the transition structure rather than the singular stimuli features define the distribution of rewards in the environment, extending previous work (Wu, Schulz, Speekenbrink, et al., 2018). We find that a Gaussian process (GP) model using the diffusion kernel is able to capture how people use generalization to guide search in structured environments. The GP provides the best predictive accuracy of choices, produces similar learning curves to human performance, and can robustly predict judgments about expected reward and confidence. While the SR matches the GP in terms of the correspondence between participant judgments and model predictions, it performed less well in predicting choices and in simulating human-like learning curves. Thus, while there is a theoretical equivalency between the SR and the diffusion kernel, the ability to estimate uncertainty within the GP framework gives it a clear advantage in describing search behavior.