Prospective planning and retrospective learning in a large-scale combinatorial game

What algorithms do people use to make decisions with future consequences in complex environments? In order to investigate the cognitive processes underlying sequential planning, we collected large-scale behavioral data in a challenging variant of tic-tac-toe. This task is at an intermediate level of complexity, providing rich behavior for which modeling is still tractable. We argue that a data set of this nature is necessary for distinguishing theoretical frameworks for integration between prospective and retrospective decision-making, and show preliminary evidence for the existence of both systems in our task. We outline a computational model based on an intuitive value function and decision tree search to demonstrate that people engage in prospective planning. We then explain discrepancies between the model’s predictions and observed data in early game choices, finding behavioral patterns consistent with retrospective learning.


Introduction
Reinforcement learning (RL) is arguably the most successful theoretical framework available for explaining human sequential decision-making and planning (Sutton & Barto, 2018). A central finding in the human RL literature is that people can select actions by combining information from prospective and retrospective systems. To choose an action in a given state, the prospective system mentally simulates the consequences of possible actions multiple steps into the future, whereas the retrospective system considers the outcome of actions taken in the same or similar states in past experience. These dual systems have been discussed under various names and implementations, such as deliberative and habitual (Dolan & Dayan, 2013), goal-directed and Pavlovian (Huys et al., 2012), and model-based and model-free (Daw et al., 2005). In general, the prospective system is slow and computationally expensive, but can determine high-value actions from any state, including ones that the agent has never previously encountered. On the other hand, the retrospective system is fast but needs previous experience to inform its policy.
One outstanding question is how people combine information from these systems or decide whether prospective planning is worthwhile in terms of time and computational re-sources. This meta-level decision may be based on uncertainty estimates provided by both systems (Daw et al., 2005), the historical accuracy of their predictions (Kool et al., 2017), or estimation of the value of information gained by planning (Callaway et al., 2018;Sezener et al., 2019). A related problem is how these systems can benefit from each other's computations. An appealing framework for integrating prospective planning and retrospective learning is amortization (Dasgupta et al., 2018), in which the agent re-uses simulated experience from the prospective system as additional training data for the retrospective system. Similarly, the retrospective system might influence the prospective system by adapting its internal models and search heuristics.
Here, we establish that people engage in prospective planning by fitting a computational model to their choices in a combinatorial game. We illustrate two behavioral patterns that the model fails to predict, but which are consistent with retrospective learning. While we have not yet developed a complete theoretical framework to fit our data set, we argue that its size and complexity is necessary for understanding the integration of prospective and retrospective RL in a naturalistic setting.

Task
An ideal experimental task for comparing models of prospective and retrospective integration needs to satisfy multiple conditions: 1. The task needs to be novel, so that participants start with uninformed priors to initialize their retrospective system.
2. The task needs to exhibit a large state space so that participants will continually encounter novel states irrespective of experience level, thereby necessitating prospective planning.
3. The task needs to contain a natural division into phases where either prospective or retrospective strategies are likely to be more effective.
4. Participants need to perform the task over long periods of time, since shifts between strategies might require extensive experience.
5. The data set needs to contain many participants, so that some states occur often enough to enable nuanced statistical analyses of people's changing action distributions.

1091
This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 We designed a combinatorial game in which two players alternate placing pieces on a 4-by-9 board, attempting to connect 4-in-a-row. The game has a large state space (1.18 * 10 16 possible states), and naturally encourages people to adopt retrospective strategies for early-game moves and prospective strategies in the middle and late game ( Figure 1A). In the opening, forward search is inefficient, since a decision tree leading to terminal states is necessarily deep and wide. Informative heuristics are also difficult to find, as the empty board contains no patterns. The user always plays first, so opening sequences are likely to repeat across different games, and people can learn an opening policy by trial-and-error. In other words, people are encouraged to develop an "opening book": a tabular representation of state-action mappings which can be updated by model-free RL. By contrast, in the middle and late game, positions are unlikely to ever repeat, but board states tend to contain more patterns and be closer to terminal states, favoring forward planning over retrospective learning.
Additionally, we partnered with Peak, a cognitive exercise company based in London, to implement the game on their mobile platform (https://www.peak.net). We are currently collecting data at a rate of 1.5 million games per month, and here we analyze a subset consisting of approximately 3.2 million games from 430,000 unique users. Users play against an AI agent implementing a version of our computational model, with parameters adapted from fits on previously collected human-vs-human games (van Opheusden et al., 2017).

Computational model
In order to demonstrate that users engage in prospective planning, we developed a computational model which combines tree search with a feature-based value function, stochastic feature dropping, and value-based pruning (van Opheusden et al., 2017).

Value function
The core component of our model is an evaluation function V (s) which assigns values to board states s. We use a weighted linear sum of 5 features: center, connected 2-in-arow, unconnected 2-in-a-row, 3-in-a-row and 4-in-a-row. The center feature assigns a value to each square corresponding to inverse Euclidean distance from the board center, and sums up the values of all squares occupied by the player's pieces. The other 4 features count how often the associated pattern occurs on the board. We associate weights w i to these features, and define where c self = C and c opp = 1 when it is the player's move, and c self = 1 and c opp = C when it is the opponent's move. C captures value differences between active and passive features. For example, a three-in-a-row feature signals an immediate win on the player's own move, but not the opponent's.

Tree search
The evaluation function guides the construction of a decision tree with an iterative best-first search algorithm. Each iteration, the algorithm chooses a board position to explore, evaluates the positions resulting from each legal move, and prunes all moves with value below that of the best move minus a threshold θ. The algorithm has a stopping probability γ, resulting in a geometric distribution over the number of iterations.

Noise
To account for variability in people's choices, we add three sources of noise. Before constructing the decision tree, we randomly drop features at specific locations and orientations, which are omitted during the calculation of V (s). During tree search, we add Gaussian noise to V (s) at each node. Finally, we include a lapse rate λ.

Model fitting
When fitting the computational model to behavioral data, we infer parameters for individual users with maximum-likelihood estimation. The model has 10 parameters: the 5 feature weights, the active-passive scaling constant C, the pruning threshold θ, stopping probability γ, feature drop rate δ, and the lapse rate λ. We estimate the log probability of a user's move in a given board position with inverse binomial sampling, and optimize the log-likelihood function with multilevel coordinate search. We account for potential overfitting by reporting 5-fold cross-validated log-likelihoods, with the same testing-training splits for model comparison.

Evidence for prospective planning
The average accuracy of the computational model's predictions on the hold-out data is 23.5±0.8%, which is much better than chance (5 ± 0.1%). To test whether the tree search component is necessary to fit human choices, we compared the model's log-likelihood per move with that of a myopic model. In the myopic model, we fix γ to 1, which implies that the tree search terminates after a single iteration. Because model fitting and comparison is computationally taxing, we ran this analysis on 50 pseudo-randomly selected users. The crossvalidated log-likelihood per move of the computational model is significantly higher across users than that of the myopic model (t = 6.69, p = 2 * 10 −8 , Figure 1B), demonstrating that tree search is indeed necessary to predict people's moves.

Discrepancies between data and the model
One major difference between the model and our observed data is predicted response times. Previously, we found that people's response times correlate on individual trials with the number of model iterations (van Opheusden et al., 2017).
However, their average trend over the course of a game differs considerably ( Figure 1C). Early in gameplay, the model predicts that people search larger decision trees and thus have longer response times, but the data shows the opposite. Therefore, it is likely that in situations where the board is fairly empty and no player can immediately win the game, there is a faster retrospective process that takes place before prospective planning begins. In the middle and late game, response time trends roughly follow model predictions.

Evidence for retrospective learning
The size of our data set allowed us to uncover clear evidence for retrospective learning in early-game positions. We found that users were significantly more likely to repeat their opening moves following wins rather than losses, and that these moves were primarily distributed in the center or corners of the board (Figure 2A). This effect continued on the third move, where users most often elected to play in the center positions closest to the two pieces already on the board ( Figure 2B). On the fifth and seventh moves, however, the proportion of move repetitions based on game outcome decreased, and varied by specific board position despite consistent move selections ( Figure 2C,D). These population-wide trends suggest that people make decisions partially based on whether or not an opening strategy was successful in previous games in their first two moves, and then begin to utilize alternative strategies in subsequent moves when board positions are more likely to be unique.
Next, we show that response times in early stages of a game also follow patterns predicted by retrospective learning. This was similarly mediated by previous game outcome: user response times across the first 7 moves were, on average, longer after losses rather than wins ( Figure 3A). Furthermore, third move response times decreased significantly when users encountered repeated 2-piece board states ( Figure 3B). This could be a confounded result, since on average users play faster after playing multiple games regardless of which states occurred. Therefore, we ran a control in which we sampled the average response times of other users that had played the same number of games, explaining some of the effect but not all. Finally, we verified that the effect was not solely due to recent memory of encountered states. We averaged third move response times based on the number of games in the past that the same 2-piece board state occurred, and found that response times were consistent regardless of how long ago a given state had been seen ( Figure 3C). These response times were also drastically lower than for novel 2-piece board states.

Discussion
In this article, we analyze human behavioral data in a twoplayer combinatorial game, and find strong evidence for both prospective planning and retrospective learning. We demonstrate that a computational model based on a forward search algorithm fits human choices well in the middle and late game, but not the early game. However, we find that people's earlygame moves as well as their response times are affected by the outcome of previous games in which they encountered the same board positions. These results demonstrate that people learn from past experience and are consistent with many retrospective learning algorithms, ranging from simple win-stay-lose-shift to sophisticated policy gradient methods. Our findings suggest that people strategically integrate information from a prospective and a retrospective system, and that a data set of this nature is essential for differentiating between existing theoretical frameworks of prospective and retrospective integration.