Beyond games: a systematic review of neural Monte Carlo tree search applications

Kemmerling, Marco; Lütticke, Daniel; Schmitt, Robert H.

doi:10.1007/s10489-023-05240-w

Beyond games: a systematic review of neural Monte Carlo tree search applications

Open access
Published: 28 December 2023

Volume 54, pages 1020–1046, (2024)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Beyond games: a systematic review of neural Monte Carlo tree search applications

Download PDF

1450 Accesses
1 Citation
4 Altmetric
Explore all metrics

Abstract

The advent of AlphaGo and its successors marked the beginning of a new paradigm in playing games using artificial intelligence. This was achieved by combining Monte Carlo tree search, a planning procedure, and deep learning. While the impact on the domain of games has been undeniable, it is less clear how useful similar approaches are in applications beyond games and how they need to be adapted from the original methodology. We perform a systematic literature review of peer-reviewed articles detailing the application of neural Monte Carlo tree search methods in domains other than games. Our goal is to systematically assess how such methods are structured in practice and if their success can be extended to other domains. We find applications in a variety of domains, many distinct ways of guiding the tree search using learned policy and value functions, and various training methods. Our review maps the current landscape of algorithms in the family of neural monte carlo tree search as they are applied to practical problems, which is a first step towards a more principled way of designing such algorithms for specific problems and their requirements.

HoopTransformer: Advancing NBA Offensive Play Recognition with Self-Supervised Learning from Player Trajectories

Article 30 May 2024

Multi-agent deep reinforcement learning: a survey

Article Open access 15 April 2021

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The combination of Monte Carlo Tree Search (MCTS) and deep learning led to the historical event of the computer program AlphaGo beating a human champion in the game of Go [99], which had been considered beyond the capabilities of computational approaches for a long time. Since then, such approaches, which we term neural MCTS in this review, have enjoyed a huge amount of popularity. They have been applied to many other games and yield promising results in the field of general game playing [89, 100].

While the effectiveness of neural MCTS in game contexts has been clearly established, the transfer of such approaches to non-game-playing applications is still a fairly recent development and hence less well understood. This transfer to other applications may create significant value, as previously computationally intractable problems may become tractable, just as playing the game of Go became tractable with the introduction of AlphaGo. More generally, neural MCTS methods may be able to find higher quality solutions than previous approaches and occupy a unique niche by balancing the trade-off between computational cost and solution quality.

In principle, neural MCTS methods are applicable to any (discrete) problem that can be addressed by traditional model-free reinforcement learning methods. The promise of neural MCTS approaches is in spending additional computational budget to increase decision quality compared to model-free reinforcement learning. This additional computational budget is spent on a planning procedure guided by neural networks, therefore potentially combining the advantages of forward planning and generalising from past experience.

Since games have different characteristics than many other problems neural MCTS approaches can be applied to, a direct transfer of algorithms like AlphaZero without any modifications to problems other than games is often not possible. This leads researchers to tailor neural MCTS methods to their specific applications and hence creates a considerable amount of algorithmic variation in the field. An overarching understanding of why certain variants are especially suitable for certain problem settings has not been explicitly developed in the literature yet. Further, the existing algorithmic variation is not well documented, as different variants continue to be introduced in individual publications, but little attention has been devoted to creating a clear view of the bigger picture. The first step towards such a view of the bigger picture is in reviewing the current state neural MCTS approaches and applications.

Some surveys and reviews of MCTS approaches have been published in the past, e.g. Mandziuk [68] provides a survey of selected MCTS applications on non-game-playing problems. However, it only examines two different applications, neither of which feature any neural guidance.

A more extensive survey is provided in [161], which also features a brief section on the combination of MCTS and deep learning. However, their focus is on games rather than other applications.

We are not aware of any reasonably extensive review of applications of neural MCTS methods in non-game-playing domains. We believe that such a review can shed light on the extent to which such methods can be transferred to more practical problems and how neural MCTS methods can be designed to cope with the requirements of different use cases. To gain an understanding of the kinds of problems for which neural MCTS is suitable, we first review problem settings of already existing neural MCTS applications. In a second step, we review the existing variation in the design of neural MCTS algorithms. Our intention behind this review is two-fold: First, we aim to give practitioners an overview to evaluate whether their applications are suitable for the use of neural MCTS and what algorithmic design choices are available to them. Second, we hope to contribute towards a more thorough understanding of the effect of different design choices and towards a more principled guide to create neural MCTS algorithms for specific problem settings.

The following research questions guide our review:

1.
In which disciplines, domains, and application areas is neural MCTS used? What are the commonalities and differences in the observed applications?
2.
What differences in the design of neural MCTS methods can be observed compared to applications in games?
3.
Where and how can neural guidance be used during the tree search?

To address these questions, we perform a systematic literature review and analyze the resulting literature. Starting with a keyword search in multiple databases, we filter articles for relevance, perform additional forward and backward searches, and extract a set of predefined data items from each article included in the review. The detailed review process is described in Section 3. Before, we provide a brief introduction to the concepts of reinforcement learning, MCTS, and AlphaZero in the next section. The remaining sections begin with a focus on the problems described in the surveyed publications in Section 4, and continue with an examination of the employed methods in Section 5. We end our review with a brief discussion in Section 6.

2 Reinforcement learning & neural MCTS

2.1 Reinforcement learning

Reinforcement Learning (RL) is a paradigm of machine learning, in which agents learn from experience collected from an environment. To do so, an agent observes the state s of the environment and executes an action a based on this state. Upon acting, agents receive a reward r and observe a new state $s'$. A problem which follows this kind of formulation is called a Markov decision process (MDP) if the new state $s'$ only depends on the state s immediately preceding it and the action a of the agent. The agent’s goal in such an MDP is to maximize the return, i.e. the expected long-term cumulative reward, by learning an appropriate policy $\pi $, i.e. a behavioural strategy that prescribes an action, or a probability distribution over actions, for a given state [108].

Such a policy can be learned directly from experience, e.g. through policy gradient methods, or it can be derived from a learned action-value function. An action-value function $Q^\pi (s,a)$ estimates the value, i.e. the expected return, of taking action a in state s. From such a learned action-value function, a deterministic policy can be derived by greedily choosing the best action, while a stochastic policy can be derived by sampling actions proportionally to their value. In addition to the policy- and value-based approaches described so far, hybrid approaches which synergistically learn both policy and value functions are often employed as well. Such methods are called actor-critic approaches and often learn the state-value function $V^\pi (s)$ instead of the action-value function $Q^\pi (s,a)$. The former computes the expected return of state s when following policy $\pi $, while the latter computes the expected return of state s when first executing action a and following $\pi $ in subsequent steps [108]. The superscript $\pi $ is often omitted for more concise notation.

In contrast to model-free approaches, in model-based RL, a model of the environment is used for planning. A model simulates the dynamics of the environment either exactly or approximately. Planning simply refers to the simulation of experience using a model and planning approaches can be categorized into background planning and decision-time planning. In the former, the training data consisting of real experiences collected from the environment is augmented with imagined experience generated from a model. In the latter, the action selection at a given time-step is dependent on planning (ahead) using a model, i.e. the consequences of different choices of actions are imagined to improve the policy for the current state [73]. MCTS, further explained in the following, can be considered a form of decision-time planning.

2.2 The connection between RL and MCTS

MCTS arose as a heuristic search method to play combinatorial games by performing a type of tree search based on random sampling. In such games, a player has to decide which action to perform in a given state to maximize an outcome z at the terminal state of the game. While MCTS has not been traditionally thought of as a type of reinforcement learning, the scenario described here bears strong similarities to the formulation of reinforcement learning problems given earlier and some authors have explored this connection in detail [116]. To avoid ambiguity, we will not use the term reinforcement learning to refer to MCTS in this article. Similarly to RL, MCTS also produces a policy $\pi _{MCTS}$ and a value estimate $v_{MCTS}$. In MCTS, the policy is produced for a given state s by a multi-step look-ahead search, i.e. by considering future scenarios and determining which sequence of actions will lead to favourable outcomes starting from s. This policy is produced anew for every encountered state, i.e. the determined policy does not generalize to states other than the one currently encountered. In contrast, traditional RL produces policies by learning from past experience that aim to generalize to unseen situations. At decision-time, no forward search is performed and an action is simply chosen based on the policy learned from past experience. In a sense, MCTS looks into the future, while traditional RL looks back to the past to determine actions. As a consequence, RL requires computationally expensive upfront training but incurs negligible computational cost at decision time, while MCTS requires no training, but performs computationally expensive planning at decision time.

2.3 MCTS

The general idea of MCTS is to iteratively build up a search tree of the solution space by balancing the exploration of infrequently visited tree branches with the exploitation of known, promising tree branches. This is accomplished by the repeated execution of four different phases: selection, expansion, evaluation, and back-propagation. In the selection phase, starting from the root node, actions are chosen until a leaf node $s_L$ is encountered. New children are then added to this leaf node in the expansion phase and their value is estimated in the evaluation phase. Finally, the values of the newly added nodes are back-propagated up the tree to update the values of nodes along the path to $s_L$. In the following, we describe each of these phases in more detail.

Selection

In the selection phase, starting from the root node, an action is chosen according to some mechanism. This leads to a new state, in which the selection mechanism is applied again. The process is repeated until a leaf node is encountered.

The mechanism of action selection is referred to as the tree policy. While different mechanisms exist, the UCB1 [4] formula is a popular choice. When MCTS is used with the UCB1 formula, the resulting algorithm is called Upper Confidence Bound for Trees (UCT) [53]. In UCT, the action selection is defined as follows:

$$\begin{aligned} a = \underset{a}{\textrm{argmax}} \frac{W(s,a)}{N(s,a)} + c \ \sqrt{\frac{ln \ N(s)}{N(s,a)}} \end{aligned}$$

(1)

where W(s, a) represents the number of wins encountered in the search up to this point when choosing action a in state s, N(s, a) the number of times a has been selected in s, and N(s) the number of times s has been visited. The left part of the sum encourages exploitation of actions known to lead to favourable results where the fraction $\frac{W(s,a)}{N(s,a)}$ can be seen as an approximation of Q(s, a). The right part of the sum encourages exploration by giving a higher weight to actions that have been visited less often compared to the total visit count of the state. Exploration and exploitation are balanced by the exploration constant c. For game outcomes $z \in [0,1]$, the optimal choice of c is $c = \frac{1}{\sqrt{2}}$ [54], but for rewards outside this range, c may have to be adjusted [8].

Expansion

After repeated application of the selection step, the search may arrive at a node with unexpanded potential children. Once this happens, one or more children of the node will be expanded. There are some possible variations in this phase. In some cases, all possible children are expanded when a leaf node (a node with no children) is encountered. In other cases, a single child is expanded when an expandable node (a node with some as of yet unexpanded children) is encountered. Expanding all children right away may lead to undesirable tree growth depending on the application.

In some literature, expandable nodes are also called leaf nodes. For clarity, we will only use the term leaf node to refer to true leaf nodes without any children in this article. Note that a leaf node is not the same as a terminal node, with the former merely being the current end of a tree branch, while the latter is a node that represents an end state of the game (see Fig. 1).

Evaluation

Once a node has been expanded, it is evaluated to initialize W(s, a) and N(s, a). This evaluation is sometimes also called simulation, roll-out, or play-out and consists of playing the game starting from the newly expanded node until a terminal state is encountered. The outcome z at the terminal state is the result of the evaluation. The game is played according to a default policy, which determines the sequence of actions between the newly expanded node and the terminal one. In the simplest case, the default policy samples actions uniformly randomly [54].

Instead of evaluating newly expanded nodes, it is also possible to only evaluate leaf nodes and and leave the evaluation of the newly expanded nodes for a later point in the search, when they are again encountered as leaf nodes themselves.

Back-propagation

The outcome z from the evaluation phase is propagated up the tree to update W(s, a) among the preceding nodes. The visit counts of all selected nodes are incremented as well.

Once the back-propagation phase is finished, the process starts anew from the selection phase until a predefined simulated budget is reached.

2.4 AlphaZero

The program known as AlphaGo drew attention for being the first computer program to beat a professional human player in a full-size game of Go [99] by combining deep learning and MCTS. While AlphaGo relied on supervised pre-training on human expert moves prior to reinforcement learning, its successor, AlphaGo Zero, was only trained using reinforcement learning by self-play. It further simplified the training by reducing the number of employed neural networks. AlphaGo Zero was still developed specifically for the board game Go and incorporated some game-specific mechanisms. In contrast, the next iteration of the AlphaGo family, AlphaZero, is more generic and can be applied to a variety of board games.

The algorithms introduced in this subsection are all examples of neural MCTS, i.e. MCTS guided by neural networks. While we focus on the AlphaZero family here due to its popularity, similar ideas were independently proposed under the name of Expert Iteration [2]. In the following, we provide more details on AlphaZero as one representative of neural MCTS methods utilized for games.

Like regular MCTS, AlphaZero follows the four phases of selection, expansion, evaluation, and back-propagation. Some of the phases are assisted by a neural network $f_\theta $, which, given a state, produces a policy vector $\textbf{p}$, i.e. a probability distribution over all actions, and an estimate v of the state value.

The selection phase in AlphaZero uses a variant of the Predictor + UCT (PUCT) formula [83]:

$$\begin{aligned} a = \underset{a}{\text {argmax}}~Q(s,a) + c \ P(s,a) \ \frac{\sqrt{N(s)}}{1+N(s,a)} \end{aligned}$$

(2)

where P(s, a) denotes a prior probability of choosing action a in state s given by $f_\theta $ [100].

Once the selection phase reaches a leaf node $s_L$, it is evaluated by the neural network $(\textbf{p}, v) = f_\theta (s_L)$. The leaf node is then fully expanded and its children initialized with $N(s_L, a) = 0, W(s_L, a) = 0, Q(s_L, a) = 0, P(s_L, a) = p_a$. In the back-propagation step, the statistics of each node including and preceding $s_L$ are then updated as: $N(s_t, a_t) = N(s_t, a_t) + 1, W(s_t, a_t) = W(s_t, a_t) + v, Q(s_t, a_t) = \frac{W(s_t, a_t)}{N(s_t, a_t)}$ [100].

Note that the expansion and evaluation phases are interwoven here to some degree and do not strictly follow the order of the phases in standard MCTS. In this review, we are generally not overly concerned with the MCTS phases as a strictly ordered set of algorithmic steps, but more with the function each phase fulfills in the tree search.

Once the search budget is exhausted, an improved policy $\pi _{MCTS}$ is derived from the visit counts N(s, a) in the tree and a corresponding value estimate $v_{MCTS}$ is extracted. The produced policy $\pi _{MCTS}$ and value estimate $v_{MCTS}$ are then used as training targets to further improve $f_\theta $. Since AlphaZero plays two-player games, some mechanism is required to determine the actions of the second player. In what is called self-play, the actions for the second player are chosen by (some version of) the same policy currently being trained for the first player [100].

As a model-based RL algorithm, AlphaZero needs a model of the environment to perform the tree search. This model is simply assumed to be given, although further extensions such as MuZero [89] demonstrate that such a model can be learned from collected experience during the search.

To recapitulate, the neural guidance in AlphaZero, consists of evaluating nodes by using $f_\theta $ to compute v and $\textbf{p}$, which are then used in the selection phase. The way neural guidance is used in AlphaZero is not the only possible form of neural guidance. Other possibilities to guide the search exist, as will become apparent in Section 5.

3 Research methodology

We perform a systematic literature review, i.e. our review follows a structured, explicit, and reproducible method to identify and evaluate a body of literature relevant to our research questions [117]. Our approach is sequential, meaning that we follow a series of pre-defined steps in a given sequence consisting of a keyword search in multiple databases, a screening process to filter for relevant articles, a forward and backward search, data extraction from all included articles, followed by analysis and synthesis of the results.

While we aim to take a neutral position and hence do not want to limit the collected literature on arbitrary grounds, a comprehensive literature search attempting to capture all the relevant literature is infeasible due to incurred time-requirements. Instead, we aim to balance feasibility and coverage by collecting a representative sample of the existing literature by limiting ourselves to a keyword search with a defined set of keywords in a limited number of databases. Both the set of keywords as well as the set of databases could be enlarged to arrive at more comprehensive results.

Table 1 Information extracted from each article after screening

Beyond games: a systematic review of neural Monte Carlo tree search applications

Abstract

Similar content being viewed by others

HoopTransformer: Advancing NBA Offensive Play Recognition with Self-Supervised Learning from Player Trajectories

Multi-agent deep reinforcement learning: a survey

A practical guide to multi-objective reinforcement learning and planning

1 Introduction

2 Reinforcement learning & neural MCTS

2.1 Reinforcement learning

2.2 The connection between RL and MCTS

2.3 MCTS

Selection

Expansion

Evaluation

Back-propagation

2.4 AlphaZero

3 Research methodology

3.1 Search query & databases

3.2 Eligibility criteria & screening

3.3 Forward and backward search

3.4 Data extraction

4 Neural MCTS applications

4.1 Application fields

Chemistry

Material science

Electronics design

Energy systems

Production

Combinatorial optimization

Cloud & edge computing

Graph navigation

Networking & communications

Autonomous driving & motion planning

Natural language processing

Machine learning

Computer science

4.2 Application characteristics

Time

Finite & infinite horizons

Transitions

Rewards

Action spaces

State spaces

4.3 Hardware requirements

5 Neural MCTS methodologies

5.1 Guidance network training

Policy improvement by MCTS

Alternative training approaches

5.2 Self-play beyond games

Self-play

Self-competition

5.3 Guided selection

Tree policy formulations

Neural guidance in the tree policy

5.4 Guided expansion

5.5 Guided evaluation

5.6 Guidance in multiple phases

5.7 Use of dynamics models

5.8 MCTS modifications

Average and best values

Value normalization

The cost of neural inference

5.9 MCTS hyper-parameters

6 Discussion & conclusion

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendix A: Implementations

Appendix A: Implementations

Rights and permissions

About this article