1 Introduction

The combination of Monte Carlo Tree Search (MCTS) and deep learning led to the historical event of the computer program AlphaGo beating a human champion in the game of Go [99], which had been considered beyond the capabilities of computational approaches for a long time. Since then, such approaches, which we term neural MCTS in this review, have enjoyed a huge amount of popularity. They have been applied to many other games and yield promising results in the field of general game playing [89, 100].

While the effectiveness of neural MCTS in game contexts has been clearly established, the transfer of such approaches to non-game-playing applications is still a fairly recent development and hence less well understood. This transfer to other applications may create significant value, as previously computationally intractable problems may become tractable, just as playing the game of Go became tractable with the introduction of AlphaGo. More generally, neural MCTS methods may be able to find higher quality solutions than previous approaches and occupy a unique niche by balancing the trade-off between computational cost and solution quality.

In principle, neural MCTS methods are applicable to any (discrete) problem that can be addressed by traditional model-free reinforcement learning methods. The promise of neural MCTS approaches is in spending additional computational budget to increase decision quality compared to model-free reinforcement learning. This additional computational budget is spent on a planning procedure guided by neural networks, therefore potentially combining the advantages of forward planning and generalising from past experience.

Since games have different characteristics than many other problems neural MCTS approaches can be applied to, a direct transfer of algorithms like AlphaZero without any modifications to problems other than games is often not possible. This leads researchers to tailor neural MCTS methods to their specific applications and hence creates a considerable amount of algorithmic variation in the field. An overarching understanding of why certain variants are especially suitable for certain problem settings has not been explicitly developed in the literature yet. Further, the existing algorithmic variation is not well documented, as different variants continue to be introduced in individual publications, but little attention has been devoted to creating a clear view of the bigger picture. The first step towards such a view of the bigger picture is in reviewing the current state neural MCTS approaches and applications.

Some surveys and reviews of MCTS approaches have been published in the past, e.g. Mandziuk [68] provides a survey of selected MCTS applications on non-game-playing problems. However, it only examines two different applications, neither of which feature any neural guidance.

A more extensive survey is provided in [161], which also features a brief section on the combination of MCTS and deep learning. However, their focus is on games rather than other applications.

We are not aware of any reasonably extensive review of applications of neural MCTS methods in non-game-playing domains. We believe that such a review can shed light on the extent to which such methods can be transferred to more practical problems and how neural MCTS methods can be designed to cope with the requirements of different use cases. To gain an understanding of the kinds of problems for which neural MCTS is suitable, we first review problem settings of already existing neural MCTS applications. In a second step, we review the existing variation in the design of neural MCTS algorithms. Our intention behind this review is two-fold: First, we aim to give practitioners an overview to evaluate whether their applications are suitable for the use of neural MCTS and what algorithmic design choices are available to them. Second, we hope to contribute towards a more thorough understanding of the effect of different design choices and towards a more principled guide to create neural MCTS algorithms for specific problem settings.

The following research questions guide our review:

  1. 1.

    In which disciplines, domains, and application areas is neural MCTS used? What are the commonalities and differences in the observed applications?

  2. 2.

    What differences in the design of neural MCTS methods can be observed compared to applications in games?

  3. 3.

    Where and how can neural guidance be used during the tree search?

To address these questions, we perform a systematic literature review and analyze the resulting literature. Starting with a keyword search in multiple databases, we filter articles for relevance, perform additional forward and backward searches, and extract a set of predefined data items from each article included in the review. The detailed review process is described in Section 3. Before, we provide a brief introduction to the concepts of reinforcement learning, MCTS, and AlphaZero in the next section. The remaining sections begin with a focus on the problems described in the surveyed publications in Section 4, and continue with an examination of the employed methods in Section 5. We end our review with a brief discussion in Section 6.

2 Reinforcement learning & neural MCTS

2.1 Reinforcement learning

Reinforcement Learning (RL) is a paradigm of machine learning, in which agents learn from experience collected from an environment. To do so, an agent observes the state s of the environment and executes an action a based on this state. Upon acting, agents receive a reward r and observe a new state \(s'\). A problem which follows this kind of formulation is called a Markov decision process (MDP) if the new state \(s'\) only depends on the state s immediately preceding it and the action a of the agent. The agent’s goal in such an MDP is to maximize the return, i.e. the expected long-term cumulative reward, by learning an appropriate policy \(\pi \), i.e. a behavioural strategy that prescribes an action, or a probability distribution over actions, for a given state [108].

Such a policy can be learned directly from experience, e.g. through policy gradient methods, or it can be derived from a learned action-value function. An action-value function \(Q^\pi (s,a)\) estimates the value, i.e. the expected return, of taking action a in state s. From such a learned action-value function, a deterministic policy can be derived by greedily choosing the best action, while a stochastic policy can be derived by sampling actions proportionally to their value. In addition to the policy- and value-based approaches described so far, hybrid approaches which synergistically learn both policy and value functions are often employed as well. Such methods are called actor-critic approaches and often learn the state-value function \(V^\pi (s)\) instead of the action-value function \(Q^\pi (s,a)\). The former computes the expected return of state s when following policy \(\pi \), while the latter computes the expected return of state s when first executing action a and following \(\pi \) in subsequent steps [108]. The superscript \(\pi \) is often omitted for more concise notation.

In contrast to model-free approaches, in model-based RL, a model of the environment is used for planning. A model simulates the dynamics of the environment either exactly or approximately. Planning simply refers to the simulation of experience using a model and planning approaches can be categorized into background planning and decision-time planning. In the former, the training data consisting of real experiences collected from the environment is augmented with imagined experience generated from a model. In the latter, the action selection at a given time-step is dependent on planning (ahead) using a model, i.e. the consequences of different choices of actions are imagined to improve the policy for the current state [73]. MCTS, further explained in the following, can be considered a form of decision-time planning.

2.2 The connection between RL and MCTS

MCTS arose as a heuristic search method to play combinatorial games by performing a type of tree search based on random sampling. In such games, a player has to decide which action to perform in a given state to maximize an outcome z at the terminal state of the game. While MCTS has not been traditionally thought of as a type of reinforcement learning, the scenario described here bears strong similarities to the formulation of reinforcement learning problems given earlier and some authors have explored this connection in detail [116]. To avoid ambiguity, we will not use the term reinforcement learning to refer to MCTS in this article. Similarly to RL, MCTS also produces a policy \(\pi _{MCTS}\) and a value estimate \(v_{MCTS}\). In MCTS, the policy is produced for a given state s by a multi-step look-ahead search, i.e. by considering future scenarios and determining which sequence of actions will lead to favourable outcomes starting from s. This policy is produced anew for every encountered state, i.e. the determined policy does not generalize to states other than the one currently encountered. In contrast, traditional RL produces policies by learning from past experience that aim to generalize to unseen situations. At decision-time, no forward search is performed and an action is simply chosen based on the policy learned from past experience. In a sense, MCTS looks into the future, while traditional RL looks back to the past to determine actions. As a consequence, RL requires computationally expensive upfront training but incurs negligible computational cost at decision time, while MCTS requires no training, but performs computationally expensive planning at decision time.

2.3 MCTS

The general idea of MCTS is to iteratively build up a search tree of the solution space by balancing the exploration of infrequently visited tree branches with the exploitation of known, promising tree branches. This is accomplished by the repeated execution of four different phases: selection, expansion, evaluation, and back-propagation. In the selection phase, starting from the root node, actions are chosen until a leaf node \(s_L\) is encountered. New children are then added to this leaf node in the expansion phase and their value is estimated in the evaluation phase. Finally, the values of the newly added nodes are back-propagated up the tree to update the values of nodes along the path to \(s_L\). In the following, we describe each of these phases in more detail.

Selection

In the selection phase, starting from the root node, an action is chosen according to some mechanism. This leads to a new state, in which the selection mechanism is applied again. The process is repeated until a leaf node is encountered.

The mechanism of action selection is referred to as the tree policy. While different mechanisms exist, the UCB1 [4] formula is a popular choice. When MCTS is used with the UCB1 formula, the resulting algorithm is called Upper Confidence Bound for Trees (UCT) [53]. In UCT, the action selection is defined as follows:

$$\begin{aligned} a = \underset{a}{\textrm{argmax}} \frac{W(s,a)}{N(s,a)} + c \ \sqrt{\frac{ln \ N(s)}{N(s,a)}} \end{aligned}$$
(1)

where W(sa) represents the number of wins encountered in the search up to this point when choosing action a in state s, N(sa) the number of times a has been selected in s, and N(s) the number of times s has been visited. The left part of the sum encourages exploitation of actions known to lead to favourable results where the fraction \(\frac{W(s,a)}{N(s,a)}\) can be seen as an approximation of Q(sa). The right part of the sum encourages exploration by giving a higher weight to actions that have been visited less often compared to the total visit count of the state. Exploration and exploitation are balanced by the exploration constant c. For game outcomes \(z \in [0,1]\), the optimal choice of c is \(c = \frac{1}{\sqrt{2}}\) [54], but for rewards outside this range, c may have to be adjusted [8].

Expansion

After repeated application of the selection step, the search may arrive at a node with unexpanded potential children. Once this happens, one or more children of the node will be expanded. There are some possible variations in this phase. In some cases, all possible children are expanded when a leaf node (a node with no children) is encountered. In other cases, a single child is expanded when an expandable node (a node with some as of yet unexpanded children) is encountered. Expanding all children right away may lead to undesirable tree growth depending on the application.

In some literature, expandable nodes are also called leaf nodes. For clarity, we will only use the term leaf node to refer to true leaf nodes without any children in this article. Note that a leaf node is not the same as a terminal node, with the former merely being the current end of a tree branch, while the latter is a node that represents an end state of the game (see Fig. 1).

Evaluation

Once a node has been expanded, it is evaluated to initialize W(sa) and N(sa). This evaluation is sometimes also called simulation, roll-out, or play-out and consists of playing the game starting from the newly expanded node until a terminal state is encountered. The outcome z at the terminal state is the result of the evaluation. The game is played according to a default policy, which determines the sequence of actions between the newly expanded node and the terminal one. In the simplest case, the default policy samples actions uniformly randomly [54].

Instead of evaluating newly expanded nodes, it is also possible to only evaluate leaf nodes and and leave the evaluation of the newly expanded nodes for a later point in the search, when they are again encountered as leaf nodes themselves.

Back-propagation

The outcome z from the evaluation phase is propagated up the tree to update W(sa) among the preceding nodes. The visit counts of all selected nodes are incremented as well.

Once the back-propagation phase is finished, the process starts anew from the selection phase until a predefined simulated budget is reached.

Fig. 1
figure 1

Search tree as seen at a given time during the search. The current node is indicated with a thick border, as are the edges that were traversed in the current iteration of the search. \(s_1\) is currently a leaf node and its potential children \(s_3\) and \(s_4\) are considered for expansion, as their dotted edges signify. The terminal nodes \(s_7\), \(s_8\), \(s_9\), and \(s_10\) are represented by rectangular nodes

2.4 AlphaZero

The program known as AlphaGo drew attention for being the first computer program to beat a professional human player in a full-size game of Go [99] by combining deep learning and MCTS. While AlphaGo relied on supervised pre-training on human expert moves prior to reinforcement learning, its successor, AlphaGo Zero, was only trained using reinforcement learning by self-play. It further simplified the training by reducing the number of employed neural networks. AlphaGo Zero was still developed specifically for the board game Go and incorporated some game-specific mechanisms. In contrast, the next iteration of the AlphaGo family, AlphaZero, is more generic and can be applied to a variety of board games.

The algorithms introduced in this subsection are all examples of neural MCTS, i.e. MCTS guided by neural networks. While we focus on the AlphaZero family here due to its popularity, similar ideas were independently proposed under the name of Expert Iteration [2]. In the following, we provide more details on AlphaZero as one representative of neural MCTS methods utilized for games.

Like regular MCTS, AlphaZero follows the four phases of selection, expansion, evaluation, and back-propagation. Some of the phases are assisted by a neural network \(f_\theta \), which, given a state, produces a policy vector \(\textbf{p}\), i.e. a probability distribution over all actions, and an estimate v of the state value.

The selection phase in AlphaZero uses a variant of the Predictor + UCT (PUCT) formula [83]:

$$\begin{aligned} a = \underset{a}{\text {argmax}}~Q(s,a) + c \ P(s,a) \ \frac{\sqrt{N(s)}}{1+N(s,a)} \end{aligned}$$
(2)

where P(sa) denotes a prior probability of choosing action a in state s given by \(f_\theta \) [100].

Once the selection phase reaches a leaf node \(s_L\), it is evaluated by the neural network \((\textbf{p}, v) = f_\theta (s_L)\). The leaf node is then fully expanded and its children initialized with \(N(s_L, a) = 0, W(s_L, a) = 0, Q(s_L, a) = 0, P(s_L, a) = p_a\). In the back-propagation step, the statistics of each node including and preceding \(s_L\) are then updated as: \(N(s_t, a_t) = N(s_t, a_t) + 1, W(s_t, a_t) = W(s_t, a_t) + v, Q(s_t, a_t) = \frac{W(s_t, a_t)}{N(s_t, a_t)}\) [100].

Note that the expansion and evaluation phases are interwoven here to some degree and do not strictly follow the order of the phases in standard MCTS. In this review, we are generally not overly concerned with the MCTS phases as a strictly ordered set of algorithmic steps, but more with the function each phase fulfills in the tree search.

Once the search budget is exhausted, an improved policy \(\pi _{MCTS}\) is derived from the visit counts N(sa) in the tree and a corresponding value estimate \(v_{MCTS}\) is extracted. The produced policy \(\pi _{MCTS}\) and value estimate \(v_{MCTS}\) are then used as training targets to further improve \(f_\theta \). Since AlphaZero plays two-player games, some mechanism is required to determine the actions of the second player. In what is called self-play, the actions for the second player are chosen by (some version of) the same policy currently being trained for the first player [100].

As a model-based RL algorithm, AlphaZero needs a model of the environment to perform the tree search. This model is simply assumed to be given, although further extensions such as MuZero [89] demonstrate that such a model can be learned from collected experience during the search.

To recapitulate, the neural guidance in AlphaZero, consists of evaluating nodes by using \(f_\theta \) to compute v and \(\textbf{p}\), which are then used in the selection phase. The way neural guidance is used in AlphaZero is not the only possible form of neural guidance. Other possibilities to guide the search exist, as will become apparent in Section 5.

3 Research methodology

We perform a systematic literature review, i.e. our review follows a structured, explicit, and reproducible method to identify and evaluate a body of literature relevant to our research questions [117]. Our approach is sequential, meaning that we follow a series of pre-defined steps in a given sequence consisting of a keyword search in multiple databases, a screening process to filter for relevant articles, a forward and backward search, data extraction from all included articles, followed by analysis and synthesis of the results.

While we aim to take a neutral position and hence do not want to limit the collected literature on arbitrary grounds, a comprehensive literature search attempting to capture all the relevant literature is infeasible due to incurred time-requirements. Instead, we aim to balance feasibility and coverage by collecting a representative sample of the existing literature by limiting ourselves to a keyword search with a defined set of keywords in a limited number of databases. Both the set of keywords as well as the set of databases could be enlarged to arrive at more comprehensive results.

Fig. 2
figure 2

The literature search and screening process starting from a set of keywords to the final set of publications to be included in the review

Table 1 Information extracted from each article after screening

3.1 Search query & databases

To find relevant publications, we derive three types of keywords:

  1. 1.

    Based on neural MCTS being a combination of MCTS and neural networks or traditional reinforcement learning:

    • “reinforcement learning” AND “monte carlo tree search”

    • “neural monte carlo tree search”

    • “neural MCTS”

  2. 2.

    Based on MCTS providing the ability to perform decision-time planning in a model-based reinforcement learning setting:

    • “decision-time planning” AND “reinforcement learning”

  3. 3.

    Based on the names given to algorithms in the AlphaGo family:

    • AlphaGo

    • AlphaZero

    • MuZero

Each of the partial search strings expressed after a bullet point above is connected with an OR operator to arrive at one overall search query.

We use this query to search for publications in the databases Web of Science, IEEExplore, Scopus, ScienceDirect, and PubMed. The search query is applied to the abstract, title, and keywords.

3.2 Eligibility criteria & screening

To be included in the review, a given publication must fulfill a predefined set of eligibility criteria:

  1. 1.

    Must feature an application of MCTS guided by a neural network. This excludes publications which are purely reviews or surveys. By guidance, we mean that a learned policy or value function is used in at least one of the phases of the tree search. These functions can of course be learned by other means than neural networks. We explicitly only consider neural network based approaches here because preliminary searches showed that attempting to include other methods leads to many more irrelevant search results while providing relatively little additional value.

  2. 2.

    The publication must contain at least some amount of validation of the presented approach. Purely conceptual articles are not considered.

  3. 3.

    The problem to which neural MCTS is applied must not be a game. While many impressive results have been achieved using neural MCTS in game playing, we are interested in determining whether such approaches transfer to other applications as well. We do consider applications that are not typically considered a game, but have been modelled as a game to facilitate the use of neural MCTS.

  4. 4.

    Publication language must be English.

For each of the publications retrieved during the keyword search described in the previous section, we assess its eligibility according to the above criteria. After the removal of duplicates, the screening process is conducted in two phases. In the first phase, we only examine the abstract of each article and discard it if it is clear that at least one of the above criteria is not fulfilled. If there is any ambiguity, we reexamine the article in a second phase, where we repeat the process using the full text of the article. Any article that is not discarded in this second phase will be included in our review.

3.3 Forward and backward search

After the abstract and full-text screening described above, we perform a forward and backward search [130] based on the set of publications which has passed the screening process. That is, for every article, we check its references for further relevant publications and also look for publications that in turn reference the articles which passed our screening.

In this step, we notice that among the additionally identified literature, many simply cite an approach already included in our review for a similar application without performing any modifications or providing additional details. We do not include such publications in our review since they do not provide any benefit in addressing our research questions. The eligibility criteria described above apply to the results from forward and backward search as well.

The full search process from keyword search to our final set of publications is visualized in Fig. 2.

3.4 Data extraction

For every article resulting from the search process described above, we extract information along a set of predefined categories. These are given in Table 1, where problem refers to a short description of the examined problem, time is either continuous or discrete, horizon either finite or infinite, transitions either deterministic or stochastic.

Since it can be difficult to extract information from works originating from different disciplines with differing terminology and varying descriptions of details, we generally err on the side of providing incomplete rather than wrong information.

Some of the collected information did not lead to notable insights and will hence not be discussed in this review. This includes the activation functions, the author affiliations, and comparisons of the neural MCTS approach with model-free RL or MCTS without neural guidance. In such comparisons, neural MCTS tends to find solutions with superior quality, but this could simply be due to positive-results publication bias.

Table 2 Applications in chemistry
Table 3 Applications in material science

4 Neural MCTS applications

4.1 Application fields

To determine the applicability of neural MCTS outside of game-playing, we survey the areas of application to which neural MCTS has been transferred. The algorithmic details of individual approaches are analyzed in Section 5. Here, we simply outline where such approaches are applied. We find applications in a wide variety of domains including chemistry, medicine, production, electrical engineering, and computer science. In the following, we assign each publication to a specific application area. Note that his merely serves the purpose of creating an overview of the research landscape. Many articles could be assigned to more than one category, and the choice of categories itself could have been made in many different ways.

Chemistry

In the chemical literature in particular, neural MCTS has received considerable attention. It has been used to perform synthesis planning, de novo molecular design, protein folding, and more (see Table 2). In many cases, states are represented by a simplified molecular-input line-entry system (SMILES) string, a notation allowing for the representation of molecular structure [132]. Such a string can then be iteratively constructed during the MCTS process until a viable molecule is found or the attempt is discarded.

Molecular applications appear to be a comparatively mature branch of neural MCTS research, as evidenced by the fact that authors are building on each others work and by the existence of standardized, commonly used implementations such as ChemTS [145]. This is an exception rather than the norm, as most literature is more disjointed and most other works either do not specify any implementation or use a custom one (see Appendix A for a detailed list of observed implementations).

Material science

Closely related is the domain of material science (see Table 3). In some cases, the SMILES representation is used here as well, such as in the design of metal-organic frameworks [110, 135], where metal ions and organic ligands are combined to create structures of various shapes.

In other cases, neural MCTS is used to optimize the thickness of alternating layers of two materials in a multilayer structure such that certain desired properties are achieved [24] and to generate models which describe the mechanical behaviour of materials in various circumstances [122].

Electronics design

In the design of electronic circuits, neural MCTS is applied to solve routing problems in multiple cases [15, 41, 80] (see Table 4), where it can outperform e.g. traditional A* based approaches [41]. Thacker et al. [111] further explore performing redundancy analysis in memory chips using neural MCTS.

Table 4 Applications in electronic design
Table 5 Applications in energy systems

Energy systems

Neural MCTS finds many applications in the operation of energy systems, especially in innovative grid concepts which aim to enable a more sustainable energy supply (see Table 5). This includes optimizing the operation of residential microgrids in an online fashion [96] as well as non-intrusive load monitoring and identification [50].

Table 6 Applications in production systems
Table 7 Applications in combinatorial optimization problems

Some attention is also directed towards managing a grid consisting of renewable energy sources and battery systems, which absorb the former’s fluctuations in power output. For instance, Al-Saffar et al. [1] devise a system to coordinate voltage regulation in a distributed energy network with battery systems at multiple locations, while [133] use neural MCTS to address predictive maintenance problems in such systems.

Production

Applications in production systems mainly concern themselves with various kinds of scheduling approaches. Here, the processing sequence of jobs or operations on different machines is to be determined to e.g. minimize the total time until all jobs have finished processing (see Table 6). Traditional RL approaches are also increasingly being investigated for these types of problems [34, 51, 86, 149]. A closer examination of the advantages and disadvantages of traditional RL and neural MCTS methods for scheduling approaches may be an interesting line of future research.

Further applications in production include line buffer planning in car manufacturing [37] as well as assembly planning in collaborative human-robot scenarios [147].

Combinatorial optimization

While the scheduling problems described above are problems from the field of combinatorial optimization, the authors approach them from a production perspective and pay close attention to the details of their individual use cases. A second group of combinatorial optimization applications can be found in Table 7. Here, the problems are more abstract and investigated from a computer science lens.

Table 8 Applications in cloud and edge computing
Table 9 Applications on graphs

Combinatorial optimization problems share many similarities with combinatorial board games. In a reinforcement learning context, they are typically solved constructively by building a solution iteratively from scratch, or they are solved by improvement, i.e. by iteratively improving some existing solution. In both cases, the problem features inherently discrete time steps and an inherently discrete action space. Differences can be observed in that there is no obvious notion of winning or losing, but rather a sense of relative performance. In addition, combinatorial games feature a fixed board and a fixed set of game pieces. In, e.g. a traveling salesman problem, the equivalent of a board may be considered a graph with weighted edges connecting different cities. Such a graph, however, will vary, with each problem instance consisting of different cities to be traveled through. In machine scheduling problems, operations may be considered the game pieces. Depending on the exact problem formulation, each operation needs to be processed on a specific machine for a specific duration. An operation could therefore be described as a game piece which can be freely parameterized by properties such as the duration, contrary to pieces in typical games.

Cloud & edge computing

As before in the production domain, a primary application in cloud and edge computing concerns scheduling problems (see Table 8). Again, the scheduling problems are combinatorial optimization problems, but the authors’ interests arise from the domain of cloud computing itself and the presented problems are less abstract.

Table 10 Applications in networking and communications

Graph navigation

Navigating a graph from a given node to a target node is a relevant task in many settings, but is gaining attention, particularly in knowledge graph research. Here, a common task is knowledge graph completion, which involves the prediction of missing relations between individual entities [128]. Graph navigation is an important sub-task of knowledge graph completion [95], for which neural MCTS has been investigated (see Table 9) and been shown to outperform existing baselines [95].

Networking & communications

Applications in networking and communications (see Table 10) range from network function virtualization [30, 57], to network topology optimization [69, 120, 144, 146, 159, 160], to spectrum sharing in mobile networks with multiple radio access technologies [10, 142].

One notable example here is the use of neural MCTS for intrusion defense in software-defined networking scenarios [32]. Here, the defense problem is actually modelled as a two-player game, which is an exception among mostly single player scenarios within the surveyed literature.

Table 11 Applications in autonomous driving, as well as path and motion planning
Table 12 Applications in Natural Language Processing

Autonomous driving & motion planning

Autonomous driving applications make up a comparatively large group (Table 11), including general motion planning tasks [39, 61, 76, 131], motion planning tasks in autonomous parking scenarios [104, 150], and motion planning tasks in multi-agent settings [82, 103]. More specialised tasks such as lane keeping [56], overtaking [70], and higher-level decision making during autonomous driving [44] are considered as well.

Such problems are often fundamentally different from combinatorial games. For instance, Weingertner et al. [131] consider a motion planning problem, in which the acceleration of a vehicle is controlled along a predetermined path. In its natural formulation, such a problem requires selecting continuous actions in continuous time. To apply neural MCTS, both time and action space are discretized. The resulting solution demonstrates good performance and outperforms A* search, pure deep learning approaches, and model predictive control.

Natural language processing

In Table 12, several applications of conversational agents are shown, in which agents assist users in completing tasks [125], negotiate with users to divide a given set of resources [49], and try to convince users of a certain view by framing messages in different ways [9]. While humans are difficult to simulate as conversational partners explicitly, models that approximate narrow conversational behaviour of humans can be trained on historical data and then utilized as part of the tree search [9, 125].

Natural language processing is itself a diverse field, in which topics such as sentiment analysis [20] and named entity recognition [59] are being addressed with neural MCTS.

Machine learning

MCTS guided by machine learning models can in turn be used in certain machine learning tasks (see Table 13). For instance [47, 129] apply neural MCTS to reduce the size of neural networks by network distillation in the former and convolutional neural network (CNN) filter pruning in the latter case.

Lu et al. [64] further approach the task of symbolic regression with neural MCTS. Here, instead of solving regression tasks by adjusting the coefficients of a e.g. a linear or polynomial function, the terms of a function themselves (e.g. sinusoids, square operations, constants) are determined and connected through mathematical operators such as addition and divison. In MCTS, the full expression of a function can be built up step by step.

Computer science

Computer Science offers a wide range of opportunities for the application of neural MCTS (see Table 14), many of which are presented in separate sections. Others do not warrant their own section due the small number of publications in their specific niche, but are nevertheless interesting. The number of publications this applies to demonstrates the wide applicability of neural MCTS.

One notable example is AlphaTensor [26], where neural MCTS is used to find efficient algorithms for matrix multiplication. Others include the optimization of database queries [151], the recovery of sparse signals [18, 155], and various applications in quantum computing [16, 22, 102].

Table 13 Applications in machine learning
Table 14 Applications in computer science
Table 15 Applications in various other fields

Finally, Table 15 shows applications in various other fields that do not fit into any of the previous categories. These feature a diverse set of problems including the optimization of user interfaces [98, 113], control of a pneumatic actuator [55], as well as design tasks for trusses [80] and fluid structures [3]. As in many examples here, similar design tasks are also being approached with traditional RL [29]. In future studies, direct comparisons of traditional RL and neural MCTS methods on specific problems may help to decide what approach is preferable under which conditions.

In summary, the applications described in the above sections originate in a variety of different disciplines including chemistry, medicine, computer science, mathematics, and electrical engineering. The types of problems include optimization tasks of various kinds, control problems, generative design tasks, and many others. Clearly, neural MCTS shows wide applicability beyond combinatorial games, to problems which in part share and do not share the properties of games.

4.2 Application characteristics

Like games, the problems surveyed here can be formulated as a MDP. Playing combinatorial games involves choosing discrete actions at discrete time-steps in a finite horizon setting, i.e. games are episodic with well-defined terminal conditions. Rewards are typically sparse and correspond to a small set of possible game outcomes: loss (-1), draw (0), and win (1). While the state transitions of each individual player are typically deterministic, the presence of a second player introduces uncertainty about the states which will be encountered at the next turn. While many of the applications surveyed here share many of these properties, neural MCTS is also applied to applications which differ from combinatorial games in one or multiple dimensions.

Time

Many settings do not have a turn-based nature, but allow for the execution of actions at arbitrary points in time, i.e. time often has a continuous nature. This does not appear to hinder the application of neural MCTS, as many authors simply discretize the time dimension in their problem formulation [5, 28, 33, 39, 47, 55, 56, 59, 76, 104, 131, 154].

Finite & infinite horizons

Like combinatorial games, most of the applications surveyed here consist of a finite horizon problem, i.e. the problem is solved in episodes of finite length. In some cases, the natural formulation of the problem features an infinite horizon. To apply neural MCTS, episodes can then be created artificially by setting a maximum number of steps after which the episode always terminates, as is done in [58, 96, 113].

Transitions

Many of the surveyed problems are of a completely deterministic nature, which is a fundamental difference compared to the combinatorial games domain. In such cases, the tree search may be modified to take advantage of the deterministic transitions (see Section 5.3 for more details).

Nevertheless, some problem formulations with stochastic state transitions can be observed [32, 38, 58, 69, 92, 113, 125, 142].

Rewards

The reward structure of typical problem settings often does not share the simplicity of the reward function present in games. Instead of a set with two or three distinct reward values, rewards are typically given on a continuum corresponding to the quality of the obtained solutions. Often, the rewards are not even clearly bounded on one or both sides (see e.g. [5, 45, 67, 90, 96, 125, 147, 154]).

In some cases, the reward is transformed by self-play inspired mechanisms. This will be investigated in more detail in Section 5.2.

While the majority of surveyed problems feature some kind of sparse reward at the end of an episode, in some cases, more fine-grained rewards after each action are incorporated into the tree search [26, 44, 105].

Action spaces

MCTS naturally lends itself well to discrete action spaces, as is the case in combinatorial board games. While modifications of (neural) MCTS for continuous action spaces exist [72], the vast majority of applications surveyed here exhibit discrete action spaces. Notable exceptions are the approaches of Lei et al. [61] and Paxton et al. [76].

Further, Raina et al. [80] apply a hierarchical reinforcement learning approach, in which neural MCTS is used for an overarching set of discrete actions while subsequent, continuous actions are determined by another mechanism.

Finally, it is always possible to discretize a naturally continuous action space. While this reduces the amount of precision with which actions can be chosen, some applications can nevertheless be successfully approached in this manner [26, 96].

State spaces

While the exact characteristics of state spaces depend not only on the underlying problem, but also on how the problem is modelled, games such as Go have a well-defined, regular board, which is helpful in formulating a state space. In Go, the board of fixed size consisting of cells which are positioned in spatial relation to each other lends itself well to processing by a CNN. The problems surveyed here feature a diverse range of state spaces which are processed by different kinds of neural networks. Next to CNNs, the employed neural networks include recursive neural networks such as long short-term memory networks [5, 25, 28, 35, 40, 42, 46, 96, 113, 119] and gated recurrent units [20, 33, 49, 59, 65, 106, 110, 122, 123, 141, 145, 158], as well as graph neural networks [48, 52, 61, 63, 75, 77, 102, 105, 109, 123, 136, 137, 141, 157]. Less frequent types include transformers [26, 78, 87] and the DeepSet architecture [146]. In many cases, a simple multi-layer perceptron is sufficient [10,11,12,13, 16, 22, 32, 36,37,38, 41, 45, 55, 56, 58, 76, 81, 90, 91, 97, 112, 125, 126, 131, 140, 150, 151, 155].

The chosen architecture and its depth will to some degree determine what kind of hardware is required to train a neural MCTS approach.

4.3 Hardware requirements

The training of AlphaZero involved more than 5000 tensor processing units [100]. One might hence question whether the application of neural MCTS is a viable option for researchers and practitioners who do not have access to resources of that magnitude.

Some of the publications we review do report the usage of significant resources. For example, Huang et al. [46] use up to 300 NVIDIA 1080Ti and 2080Ti GPUs during training. The majority of the reported hardware, however, is not out of reach for typical organizations and even private individuals. Genheden et al. [36] report that they use a single Intel Xeon CPU with a single NVIDIA 2080Ti GPU on a machine with 64 GB memory. Many others use a single high-end consumer CPU and GPU [57, 58, 84, 94, 155].

On the lower end, some researchers even use consumer notebooks to train neural MCTS methods [45, 141, 143].

Fig. 3
figure 3

MCTS as a policy improvement operator. The learned policy and value function are used to guide the tree search, which then produces an improved policy \(\pi _{MCTS}^{*}\) and value estimate \(V_{MCTS}^{\pi ^*}\) for a given state. As visualized by the dotted lines, \(\pi _{MCTS}^{*}\) and \(V_{MCTS}^{\pi ^*}\) can then also be used as training targets for the neural network

Clearly, hardware requirements vary by application and the complexity of the employed neural networks. While it is difficult to predict what level of hardware is required for a given application and desired solution quality, it is clear that moderately powerful hardware can be successfully utilized in many applications.

5 Neural MCTS methodologies

After gaining an overview of the breadth of possible neural MCTS applications in the previous section, we now turn our attention to the design of neural MCTS approaches as they were encountered during the review.

5.1 Guidance network training

Before delving into the inner mechanisms of neural MCTS, we content ourselves with the knowledge that learned policy and value networks are used to guide the tree search in some way. In the following, we first dedicate some attention to the training procedure used in AlphaZero [100], before discussing alternatives found during the review.

Policy improvement by MCTS

In AlphaZero [100], a learned policy is iteratively improved by guiding an MCTS search and in turn using the search results to improve the learned policy (see Fig. 3). We refer to this procedure as policy improvement by MCTS. More concretely, the learned policy \(\pi _\theta \) guides the tree search in one or multiple of the search phases and a new policy \(\pi _{MCTS}\) for the state under consideration is obtained after a given number of MCTS simulations. Since this new policy is typically stronger than the initial, learned one, it can be used as a training target for the policy network. More precisely, the policy network and MCTS produce policy probability vectors \(\textbf{p}_\theta \) and \(\textbf{p}_{MCTS}\) for a given state, where the former can be seen as the actual prediction and the latter as the prediction target. These can then be used in a cross-entropy loss function to train the policy network: \(L_{CE} = - \ \textbf{p}_{MCTS}^T \ \log \ \textbf{p}_\theta \). Accordingly, if a value function is learned alongside the policy, its value estimates are adjusted in the direction of those found by MCTS by using the mean squared error (MSE) as a loss function: \(L_{MSE} = (v_{MCTS}-v)^2\). Both terms are typically combined with a regularization term into a single loss function

$$\begin{aligned} L = (v_{MCTS}-v)^2 - \ \textbf{p}_{MCTS}^T \ \log \ \textbf{p}_\theta + c \ \Vert \theta \Vert ^2 \end{aligned}$$
(3)

The vast majority of the articles we reviewed that improve a learned policy by MCTS use the loss function given in (3). We find two exceptions to this, one in which the Kullback-Leibler divergence is used instead of the cross-entropy [17] and one in which the Kullback-Leibler divergence is also used instead of the cross-entropy, but a quantile regression distributional loss is additionally used instead of the MSE [26]. It may be worth investigating what effect these loss functions have on training, but the combination of cross-entropy and MSE clearly emerges as the default choice during our review.

Fig. 4
figure 4

Depending on the approach, policy and value networks are trained before the search, during the search, or both. Depending on this choice, different training methods are employed. Policy gradient and Actor-Critic are families of algorithms that encompass multiple specific algorithms

Alternative training approaches

We find two broad groups of training approaches during our review: (1) training before the networks are used to guide the tree search and (2) training during the tree search, i.e. iteratively performing tree search and using its results for training. In group (1), training is facilitated either by supervised learning on labelled examples or by classical reinforcement learning algorithms in the policy-based, value-based, and actor-critic families. In group (2), the dominant approach is policy improvement by MCTS, but some variations on this exist. Finally, both approaches can be combined by first performing what can be considered a pre-training and then training further during the tree search. This is sometimes referred to as a warm-start.

Figure 4 shows the distribution of these different approaches as found in the surveyed publications and which specific methods are employed in each approach. A large portion of authors train their networks during the search by using policy improvement by MCTS. In some cases, training during the search occurs by other methods such as Q-Learning [1, 37, 95, 123] and Maximum Entropy RL [157] on the MCTS trajectories. Instead of improving the policy (alongside the value function) by MCTS, it is also feasible to refine the value function individually [62, 137], without any learned policy. We term this approach value function refinement by MCTS here. Likewise, while policy improvement by MCTS typically trains both a policy and a value function, sometimes the policy is also trained in isolation [93].

When trained before the tree search, networks are most often trained by supervised learning on labelled demonstrations. Q-Learning [75, 76, 119, 125, 131], policy gradient [24, 45, 55, 57, 64, 77] and actor-critic methods [17, 41, 121, 129] are also employed.

When both training phases are performed, the most common approach is to combine supervised pre-training with policy improvement by MCTS [16, 81, 107, 126, 154].

Overall, there is some variety in the employed training approaches, but the dominant strategies are supervised training before the search and policy improvement by MCTS, sometimes combined in one approach as in the original AlphaGo publication [99]. In a combinatorial game setting, policy improvement by MCTS requires some mechanism by which the opponent’s moves are generated. In the following, such a mechanism and its relevance for applications beyond games are discussed.

5.2 Self-play beyond games

One of the components leading to the success of AlphaGo is the concept of self-play [99]. In a self-play setting, opponents in a multi-player game are controlled by (some version of) the same policy, i.e. the policy plays against itself [43]. Learning in such a scenario has the advantage that the policy always faces an opponent of comparable skill, which evolves as the training progresses. However, since many non-game-playing applications have an inherently single-player nature, the role of self-play beyond games is not obvious.

To gain a clearer understanding of the applicability of self-play in such cases, we surveyed its usage among the publications included in our review. During this process, it became apparent that many authors use the term self-play, but that the meaning of the term varies. This may be due to the lack of an accepted, standardized definition. In the following, we first delineate different meanings of the term we encountered and then report the usages of different versions of self-play in our review.

In a two-player setting, the term self-play can be understood intuitively. As used in the original AlphaGo publication [99], self-play entails that the policy currently being learned plays a game against some older version of itself in a two-player turn-based setting. This means that this other version of the current policy is being used to generate new states by playing every other turn of the game, as well as to obtain the final reward of the game.

In single-player settings, the term self-play is also often used, but its meaning is less obvious. In a single-player setting, the state generating property described above is not applicable, since the state transitions do not depend on the actions of another player. Generating a new state requires only the current state and the agent’s action. The reward generating property described above, however, is applicable if the reward function is designed accordingly. If the reward is not simply dependent on the performance of the current policy, but on the relative performance compared to some prior version of the policy, the reward generating property of self-play is transferred to the single-player setting. In other words, the process of a policy trying to beat its own high score has similarities to the concept of self-play in two-player settings. While this is sometimes also called self-play, Mandhane et al. [66] introduce the term self-competition for this type of approach. In the remainder of this review, we will adopt this term and reserve self-play for multi-player settings to avoid confusion. While simple versions of self-competition can be implemented trivially, Laterre et al. [60] introduced a more substantiated form of self-competition named ranked reward, followed by the approaches of Mandhane et al. [66] and Schmidt et al. [88].

However, many authors claim to implement self-play without obviously applying any of the concepts described above (see e.g. [80, 111]). While it is hard to be certain about what is meant in such instances, we suspect that two further concepts are sometimes termed self-play in the literature. The first is the practice of keeping track of the best policy by evaluating the current policy against the previous best one. If the current policy can outperform the previous best one on some defined set of problems, it replaces the currently saved best policy. This simply serves the purpose of having access to the best policy after training completes since training does not necessarily improve the policy monotonically. In such cases, the outcomes of evaluation are not used as rewards to train the policy. Consequently, no learning follows from the policy playing against another version of itself, i.e. it is not a mechanism by which the current policy is improved, but merely evaluated.

The last concept, which we suspect is sometimes described as self-play is policy improvement by MCTS as introduced above (see e.g. [12]).

To be clear, we will use the term self-play only to describe multi-player cases where the state generating property as well as the reward generating property hold, and the term self-competition only for single-player cases where the reward generating property holds.

Table 16 Self-competition

While we would ideally like to report the usage of self-play and self-competition for all publications included in our review, we refrain from doing so when the terms are used ambiguously and instead only report a selection of notable examples where their meaning has been clearly established.

Self-play

Actual self-play appears to be fairly rare in the non-game-playing literature. We can only attribute the use of self-play to a single work [27], in which a problem is modelled as a two-player game and a policy learned by self-play. In some other cases, problems are modelled as two-player games as well, but the resulting games are asymmetric, i.e. the players have different action spaces and hence require different policies [32, 138,139,140,141]. In such cases, two different neural networks each learn a policy.

Self-competition

In terms of self-competition, we observe instances of ranked reward [109, 123, 127] as well as naive approaches (see Table 16). In a naive approach, the performance of the current policy on the current problem is simply evaluated as some score and compared against the score of the best policy observed up to this point on the same problem. If the current policy’s score is better, the game is won (\(r=1\)), if it is worse, the game is lost (\(r=-1\)), and if it is equivalent, the outcome is a draw (\(r=0\)) [46, 52]. A variation of this is to not use the best policy, but to evaluate against the average score of a group of saved policies [122].

In one case, a naive approach, as described above, is applied, but instead of a past version of the policy, a second, completely independent policy is learned and the two policies continually compete against each other [3].

While self-competition can be used to generate rewards based on the relative performance of the policy, this is not strictly necessary, as the absolute performance can be used to compute rewards just as well. One benefit of self-competition may simply be having a reward in a clearly defined range of \([-1, 1]\) or similar, as optimal choices of MCTS hyperparameters depend on this range [8]. However, there appear to be benefits beyond this, as the ranked reward approach has been shown to outperform agents trained using a standard reward in the range [0, 1] [60]. Whether this is the case for the naive self-competition approaches as well is unclear.

Table 17 Variations of tree policies based on UCT

5.3 Guided selection

The previous sections argue that MCTS functions as a policy improvement operator. We now explore the mechanisms of this policy improvement, i.e. the inner workings of neural MCTS. A search iteration in MCTS begins with the selection phase, in which actions are iteratively chosen starting from the root state until a leaf node is encountered. As described in Section 2.3, the choice of action is determined by a tree policy, which generally takes the form

$$\begin{aligned} a =\underset{a}{\text {argmax}}{Q(s,a) + U(s,a)} \end{aligned}$$
(4)

where Q(sa) encourages exploitation of known high-value actions, while U(sa) encourages exploration of the search tree. Variations exist both in the exact formulation of (4) and in how individual terms of the equation are determined, i.e. by learned policies and value functions or by conventional means. We investigate each aspect individually in the following.

Tree policy formulations

The tree policies encountered during the review are usually based on some version or extension of the UCT rule, but some variation in the exact formulation of the rule, especially in the exploration part, can be observed.

We provide an overview of variations of U(sa) identified during our review in Table 17. While compiling the table, we modified the exact formulations reported in individual publications to arrive at a consistent notation. To this end, we assumed that all reported logarithms are natural logarithms and that \(N(s) = \sum _b N(s,b)\), i.e. N(s) refers to the visit count of all the children in state s, while N(sa) refers to the visit count of action a in state s. The exploration constant, sometimes given as \(c_{uct}\), \(c_{puct}\) or similar, is simply referred to as c in this review. P(sa) represents some prior probability of choosing action a in state s, whether it be given by a learned policy or obtained by other means.

Among the surveyed publications, a large proportion still use (some variant of) the UCT formula (see Table 17), but PUCT as it is used in AlphaZero [100] (PUCT variant 0 in Table 17) is the most frequently used selection mechanism. There are a number of less frequently used PUCT variations mostly concerning the presence of logarithms, constant factors in the numerator and denominator, and scope of the square root. These differences impact the overall magnitude of the exploration term as well as its decay as individual actions are visited more often (see Fig. 5). It is difficult to judge the impact of different formulations on the search, since authors usually do not directly compare them. In a rare exception, Xu and Lieberherr [138] try both the AlphaZero PUCT variant as well as PUCT variant 9 in Table 17 and report that the AlphaZero variant performs much better, although they do not quantify this difference.

Fig. 5
figure 5

Different UCT-style fractions with a fixed \(N(s) = 1000\). Note that the vertical axis is logarithmic and shows the value the expressions in the legend produce for different N(sa)

One notable PUCT variant, variant 3, introduces a new constant \(\mu \) which determines the impact of the prior probabilities as \(P(s,a)^\mu \). This variant seems to have been independently suggested in [95, 104, 123].

Aside from UCT and PUCT variants, the MuZero [89] selection formula or a variant of it is used by three authors. We further find two unique modifications of typical selection formulae that we cannot assign to any of the other groups: \({UCT}_D\) and \({PUCT}_B\). The former will be discussed at a later point. \({PUCT}_B\) aims to exploit the nature of deterministic single-player settings, in which future trajectories are not influenced by the choices of another player. In such cases, rather than simply looking at average state values, it may be advantageous to keep track of the best encountered values during the search. Making decisions based on average values is problematic because most of the actions in a given state may be bad choices, while one specific single action may be a good choice. On average, the value of the node will then be low, even though a promising child exists. In a deterministic setting, the best path can be executed reliably and, accordingly, it makes sense to choose nodes based on their expected best value rather than the average one. Deng et al. [23] design a selection formula that makes use of this fact, which we refer to as \({PUCT}_B\) here. In \({PUCT}_B\), the best value of an action is simply scaled by a constant and then added to PUCT variant 0.

Fig. 6
figure 6

Neural Selection. Each of the children of the current state s are considered and the one maximizing \(Q(s,a) + U(s,a)\) is chosen. Both Q(sa) and U(sa) may be influenced by neural guidance in some way

Neural guidance in the tree policy

Neural guidance may be used in both the exploitation and the exploration part of (4) (see Fig. 6). When used in the exploitation part, neural guidance is typically used to estimate Q(sa). This does not change how the selection mechanism works, only how the corresponding value is determined. Since value estimation is a function of the evaluation step, this kind of neural guidance will be explored in Section 5.5 and not discussed further at this point.

In the exploration part of (4), neural guidance is typically used to determine the prior probabilities P(sa) in PUCT-style formulae. As shown in Fig. 7, about 62% of all reviewed articles report guiding the tree search in this way, with less than 30% reporting selection phases without neural guidance. The remaining articles do not report how the selection phase is performed at all. While the latter can probably be interpreted as selection without neural guidance, we try to refrain from interpretations as much as possible and hence give separate categories for standard selection and unreported selection.

Most approaches for neurally guided selection phases take the form described above, with the exception of a few special cases. Zombori et al. [157] argue that a learned policy network tends to make predictions with high confidence even if they are of low quality, which leads to a strong unfounded bias in the search. It may be more desirable to have a policy which makes less confident predictions if the prediction quality is not sufficiently high. To achieve this, they use maximum entropy reinforcement learning and use the resulting policy to compute prior probabilities for the selection phase.

In one exception, neural guidance occurs in a form other than providing prior probabilities, as can be seen in the \({UCT}_D\) formula in Table 17. It is named after its use of a dueling network, which produces action advantage estimates A(sa) in addition to state values. In \({UCT}_D\), the action advantages are used in place of the prior probabilities P(sa). Vaguely related to the reasoning of Zombori et al. the authors argue that a policy network trained with policy gradient methods tends to concentrate on the best action for a given state, while not assigning probabilities proportional to the expected usefulness of the other actions [125]. In other words, an overly low entropy policy vector may bias the search to an undesirable degree. In contrast, the action advantages do not overly focus on the best action.

Fig. 7
figure 7

Proportion of choices in the selection phase among the surveyed articles. Standard refers to some selection strategy that does not involve the use of learned functions

5.4 Guided expansion

Once a leaf node \(s_L\) is encountered during the selection phase in MCTS, the expansion step is performed to create child nodes of \(s_L\). Guidance by neural networks can be employed in this step as well to bias and hence speed up the search. To avoid confusion, we will first give some details on possible alternate ways to implement the expansion step and only then return to the topic of neural guidance.

As discussed in Section 2.3, nodes are typically expanded either one at a time whenever an expandable node \(s_E\) is encountered, or all children of a node are expanded simultaneously if a leaf node \(s_L\) is encountered.

Fig. 8
figure 8

Neural Expansion. When a leaf node is encountered, possible actions in the leaf node’s state are sampled by some mechanism involving the learned policy. For every sampled action, a new child is created

Clearly, implementing an MCTS approach requires deciding how many children are expanded at a given time. There is, however, an additional, related decision to be made: How many and which children are considered for expansion? In the naive case, the search is free to choose any action from the set of all possible actions \(A(s_L)\) in state \(s_L\). However, it is also possible to be more selective in the expansion step. To limit the growth of the tree, only a limited number of children may be considered for expansion either randomly or according to some rule or heuristic. In other words, the search may be restricted to only choose actions from a set \(\tilde{A}(s_L) \subset A(s_L)\). This is especially relevant for continuous cases, where the number of potential children is infinite and necessarily has to be limited in some way. Once such a set \(\tilde{A}(s_L)\) has been defined, the corresponding child nodes may be expanded all at once when the leaf is encountered or one by one, whenever the expendable node is encountered during the tree search.

Neural guidance during the expansion step is possible in both paradigms, i.e. when expanding on encountering a true leaf node and when expanding on encountering an expandable node. In the former case, neural guidance means using a learned policy to determine \(\tilde{A}(s_L)\), while in the latter case, neural guidance means choosing an action in \(A(s_E)\) to create a new child node. Theoretically, it is possible to combine both of these approaches by first determining and saving \(\tilde{A}(s_L)\) when a leaf is encountered for the first time, but not expanding all corresponding nodes at this point. The children can then be expanded one by one whenever the node is encountered again by choosing some action \(a \in \tilde{A}(s_E)\). However, we do not observe this combined approach in the collected literature.

We do observe some form of neurally guided expansion in a sizeable portion of publications (see Figs. 8 and 9) and categorize them in Table 18. In some additional examples, neurally guided expansion is used, but the specifics are not reported [25, 36, 90].

\(\tilde{A}(s_L)\) can be determined by randomly sampling actions from a learned policy, but it can also be determined by enumerating all actions in the policy distribution and choosing the top k ones [48, 91, 112]. In the approach of Thakkar et al. [112], the top k actions with a cumulative policy probability of 0.995 or at most 50 actions are selected. In both types of neurally guided expansion, instead of a learned policy, a learned value function can of course be converted to a policy with a softmax operator, as is done in [119].

Fig. 9
figure 9

Proportion of choices in the expansion phase among the surveyed articles. Standard refers to some form of expansion that does not involve the use of neural networks

Fig. 10
figure 10

Evaluation by learned policy roll-out: After arriving at leaf node with state s according to the tree policy, the value of s needs to be determined. Here, the learned policy \(\pi _\theta \) is used to generate a roll-out by iteratively sampling actions until a terminal state \(s_T\) is reached. The reward of this terminal state serves as an estimate for the value of s

Table 18 Types of neurally guided expansion

While the exact impact of neurally guided expansion will vary from application to application, its general potential is demonstrated in [82] who report that their computational time is 20 times reduced with neural expansion while achieving higher quality solutions.

5.5 Guided evaluation

The evaluation step in MCTS serves to estimate the (state-)value of a leaf node encountered during the tree search. While it is often also called the roll-out step or the simulation step, its purpose is the value estimation of a leaf. Roll-outs or simulations are simply approaches to produce a value estimate. Here, we use the term evaluation, because not all evaluation approaches in the neural MCTS literature are based on roll-outs.

There are two obvious ways to use learned policies and value functions during the evaluation phase: a roll-out using the learned policy and a direct prediction by a learned value function. In the former, actions are iteratively sampled from the learned policy starting from the encountered leaf node until a terminal node is reached (see Fig. 10), while in the latter, the value of the leaf node is simply predicted by the learned value function without any roll-out (see Fig. 11).

Most authors use either learned policy or value functions as described above, with 54 occurrences of learned value functions and 26 occurrences of learned policy functions (see Fig. 12) The remaining publications either do not use neural guidance for the evaluation phase or their approach is unclear.

Fig. 11
figure 11

Evaluation by learned value function: After arriving at leaf node with state s by following the tree policy, the value of s needs to be determined. Here, a learned value function \(V_\theta \) is used to estimate the value of s directly, without any need for a roll-out

Among the neural evaluation approaches, some authors employ different evaluation approaches depending on how far the training has progressed. Song et al. [104] combine both approaches by performing roll-outs according to a learned policy network in the early phases of training and use a learned value network for estimation in later stages. Zhang et al. [150] use an initial phase of random roll-outs to pre-train a policy network and employ the learned policy network for roll-outs in later stages.

In some cases, roll-outs are not performed by naively using a learned policy, but more complex roll-out procedures are still guided by learned functions. He et al. [40] use a value network to guide a problem-specific roll-out procedure, while Xing et al. [136] use a learned policy function to guide a beam search. Kumar et al. [58] combine a value estimate as predicted by a neural network with a domain-specific roll-out policy, motivated by the fact that their reward function is more fine-grained than those typically observed in board games. Finally, Lu et al. [64] use a value network in a symbolic regression task to estimate whether a leaf node merits further refinement by an optimization method, but their approach is highly problem-specific.

Fig. 12
figure 12

Proportion of choices in the evaluation phase among the surveyed articles. Value function refers to an evaluation by a learned value function, policy to a roll-out using the learned policy and standard to a random roll-out

Deng et al. [23] perform different evaluation approaches depending on the depth of the node to be evaluated. If the node is closer to the root of the tree, they use a neural network to estimate the value of the state, while they perform a random roll-out for nodes at deeper levels of the tree. Their experiments show that this hybrid approach can balance solution quality and computation time. Since neural inferences are associated with non-negligible computational cost, replacing them with random roll-outs can decrease the search time especially at deeper levels, where roll-outs will have short lengths.

Fig. 13
figure 13

Parallel sets diagram of surveyed neural guidance configurations. Each vertical bar signifies an option in one of the MCTS phases: selection (left), expansion (middle), evaluation (right). In the expansion step, each option (standard and neural expansion) is displayed twice to allow for easier tracing of the visualized configuration

In the application described by He et al. [41], episodes can result in either a successful or a failed solution. Successful solutions can still differ in quality and are rewarded accordingly. They perform a roll-out by a learned policy, but if this results in a failed terminal state, they backtrack until they find a successful solution. They argue that this leads to better search efficiency because preceding trajectories are not repeated unnecessarily.

Design choices in other MCTS phases can also influence what is required from the evaluation phase. Kovari et al. [56] replace the exploitation term, i.e. the estimated value, of a UCT-style formula with the probability of taking an action as predicted by the policy network. Instead of a roll-out or direct value prediction, they hence simply predict this probability in the evaluation step.

5.6 Guidance in multiple phases

As discussed above, neural guidance can be employed in the selection, expansion, and evaluation of phases of MCTS. Of course, it is not necessary to limit this guidance to one phase at a time and different types of neural guidance can be combined in one approach.

To gain an overview of how different types of neural guidance are typically combined, we visualize their use in Fig. 13.

The most common approach is to guide the selection step, perform standard MCTS expansion, and then use a learned value function for evaluation. Many other combinations exist, but none of them are used as often. When standard selection is used, the relative incidence of neural expansion is higher than when using neural selection. This could, however, be explained by the fact that the respective authors simply wanted to highlight the effect of neural expansion since it is often the main focus of their respective publications.

Infrequently occurring combinations may indicate a need for further research.

5.7 Use of dynamics models

Neural MCTS is a model-based reinforcement learning approach, i.e., it requires access to a dynamics model of the environment to perform planning. Typically this dynamics model is given [99, 101] and can be readily used in the tree search. In contrast to a regular reinforcement learning environment as defined in, e.g. the OpenAI Gym standard [7], a dynamics model allows for the computation of the next state and reward given an arbitrary initial state and action to be executed. An environment, on the other hand, is in a specific state at any given time which can be influenced by actions, but does not allow for dynamics computations on arbitrary states. In other words, an environment is stateful, while a dynamics model is not.

The difficulty of developing such a dynamics model will differ from application to application. In any case, its development will require some additional effort. To circumvent this, it is also possible to learn the dynamics model and then use this learned model for planning in MCTS [89, 115]. While this adds additional complexity to the training process, it can be helpful in scenarios where an exact and efficient model of the environment cannot be easily obtained.

During our review we found that the vast majority of approaches utilize an existing dynamics model, but learned models also find some application in practice.

For instance, Chen et al. [14] investigate an autonomous driving task where a model is needed to predict the vehicle state. In this case, the vehicle state is an image, meaning that the model needs to produce an output image given an input image (corresponding to the initial state) and an action. While such a model is not trivial to implement manually, Chen et al. [14] are able to train a convolutional neural network to serve this purpose.

Similarly, Challita et al. [10] apply neural MCTS to enable dynamic spectrum sharing between LTE and NR systems and report that this requires a model of individual schedulers for LTE and NR, which is not trivial to design. Instead, they learn the model in an approach similar to the one proposed in MuZero [89]. That is, the dynamics are not computed on the raw observations, but on hidden representations, which are computed from the observations by a learned representation function. This approach is also taken by others [32, 96], but dynamics models which work directly on the observations can also be observed [21, 125].

In many cases, the dynamics model is not learned during neural MCTS training, but trained separately in advance and then simply used for inference during the tree search [9, 28, 39, 122]

As described above, one motivation to learn a dynamics model may be the difficulty of creating one manually. Another motivation is the speed with which a learned model can be evaluated [109].

In some cases, the state transitions can be modelled fairly easily, while the computation of the reward is time-consuming. Some authors do not train a full dynamics model, but a scoring model, which can be used to assess the quality of a given solution quickly. In contrast to a learned value function, which can be used to evaluate newly expanded nodes at arbitrary depths, a scoring model only assigns a score to full solutions, i.e. terminal nodes. The resulting scores can then be used as training targets for the value function [42, 106, 151].

5.8 MCTS modifications

We now turn our attention to selected modifications of typical (neural) MCTS procedures as encountered during the review.

Average and best values

As briefly mentioned in Section 5.3, deterministic single-player settings pose different requirements than combinatorial games. Action selection based on the average value of a node will lead to sub-optimal results, because a strong child node can be surrounded by weak siblings. While the \({PUCT}_B\) mechanism described in Section 5.3 is one option to address this, other authors have identified this issue as well and proposed their own solutions.

Deng et al. [23] report that their final search results are usually worse than the best solutions found during roll-outs in empirical experiments. They point out that the final action selection after tree search is performed based on the node visit counts N(s). To rectify the problem, they introduce an oversampling mechanism for good solutions. Whenever a solution is found which outperforms all previously found solutions during a roll-out, this solution will be given preference in subsequent selection phases for a certain amount of time, and will hence be visited more often.

A simpler approach is taken by Peng et al. [77] and Xing et al. [137] to address the same problem. Here, the exploitation part of (4), i.e. the average value of the node, is simply replaced with the best observed value for the node. Fawzi et al. [26] follow a similar strategy.

Zhang et al. [154] simply keep track of the maximum reward encountered in the search and the action sequence that lead to it, which is then returned after the search.

Value normalization

While combinatorial games lend themselves well to reward function formulations in the range \([-1, 1]\), in other applications, rewards are often less regular and sometimes completely unbounded. As mentioned earlier, the exploration constant c needs to be tuned for different reward ranges [8]. Further, even with a perfectly tuned c, rewards outside of ranges like \([-1, 1]\) or [0, 1] are typically not conducive to algorithm convergence [97]. A number of authors therefore suggest normalizing Q-values according to the minimum and maximum values observed in the tree search until the current point [78, 97, 137].

The cost of neural inference

The main idea behind neural MCTS approaches is to increase the efficiency of the tree search with neural inferences. While neural inferences are not too computationally expensive individually, when performed in large numbers, the required computational time can add up to significant amounts.

Deng et al. [23] vary the amount of neural inferences by switching neural guidance off during the search some proportion of the time. They find that neural guidance generally helps the search, but that further neural guidance after a certain point only increases computational cost without providing additional benefits.

Designing mechanisms to limit the application of neural guidance to where it provides maximum benefits in a targeted way may be an interesting line of research.

Fig. 14
figure 14

Distribution of reported MCTS hyper-parameters: Number of MCTS simulations (top) and exploration constant c (bottom). Note that a logarithmic axis is used in both cases. Variation on the vertical axis is not meaningful, but contains random jitter for better visibility of individual data points

5.9 MCTS hyper-parameters

One last aspect in the design of neural MCTS approaches is the choice of appropriate hyper-parameters. While choosing hyper-parameters is highly problem-specific, it can nevertheless be useful to look at average hyper-parameter values to serve as a starting point and determine reasonable bounds for a problem-specific hyper-parameter optimization. To facilitate this, we summarize the values for the MCTS-specific hyper-parameters found during our review in Fig. 14. A large amount of variation in values exists for both the exploration constant c used in UCT-style selection formulae and the MCTS-budget \(n_{MCTS}\), i.e. the number of simulations or play-outs performed during MCTS for a given time-step.

It is known that the optimal choice of c depends on the scale of the encountered rewards [8]. While reward scales vary wildly and are not always reported among the literature we survey, the most commonly reported reward scale is in the interval \([-1, 1]\), for which six different values of c ranging from 0.5 to 5.0 are chosen. In other words, the scale of the rewards does not appear to be the only criterion on which authors choose hyper-parameter values.

Instead of having a fixed value, in some cases, the exploration constant c and the MCTS budget \(n_{MCTS}\) are changed dynamically depending on circumstances.

Sometimes the exploration constant c is decreased as the training progresses [16, 157], presumably because later training iterations profit more from perfecting the current policy instead of performing further exploration. Wang et al. [126] tune c dynamically based on the currently observed maximum Q-value to balance exploration and exploitation as Q-value estimates evolve and report a significant improvement in the observed results.

Some problem settings feature varying instance sizes, where a larger instance size is generally associated with higher difficulty. Zhong et al. [155] increase c as the problem size grows, presumably because more exploration is required to adequately cover the larger state space. For similar reasons, they and others [123, 136] further choose larger \(n_{MCTS}\) for larger instance sizes. Wang et al. [121] even explicitly parameterize \(n_{MCTS}\) by the problem size.

In many problems, the size of the remaining search space decreases with increasing depth of the tree. Hu et al. [45] argue that the search budget should depend on the depth of the node the search starts from and present a mechanism that decays \(n_{MCTS}\) with increasing tree depth.

Fawzi et al. [26] report an increase in \(n_{MCTS}\) after a certain amount of training steps, presumably because later training iterations can profit more from a higher search budget and a larger proportion of the overall training time budget should hence be allocated to those later iterations.

Some authors reduce the number of MCTS simulations at test time. For instance, Chen et al. [14] reduce \(n_{MCTS}\) by a factor of ten at test time compared to the training phase. They further limit the search depth in both phases, but do so to a larger degree during test time.

In some cases, MCTS is only used to train a policy network, which is then applied without further tree search at test time [9, 10, 18, 20, 155]. This can be due to the specific requirements of the application, i.e., some applications require fast inferences at test time that render the application of tree search infeasible, but can still profit from MCTS at training time [10]. In some applications, applying MCTS at test time after having used it for training simply does not improve performance to a significant degree [20].

6 Discussion & conclusion

While focusing on usages of neural MCTS outside of games, we investigated the diversity in applications, their characteristics, and the design of employed neural MCTS approaches by performing a systematic literature review.

With regard to research question 1 posed in the introduction, we find that neural MCTS is applied in a wide variety of domains to solve different problems. While most problems exhibit similar characteristics such as discrete time and actions, finite horizon, and deterministic transitions, many authors also demonstrate that neural MCTS can be applied to problems with differing properties.

The applications encountered during the review usually have slightly different requirements and properties than combinatorial games. This does affect the way solutions are designed, an aspect investigated as part of research question 2. The concept of self-play, for instance, is generally not applicable to single-player problems. In some cases, single-player problems can be modelled as multi-player problems, but this is the exception rather than the rule. It is possible to employ a mechanism called self-competition, which replicates the way self-play generates rewards in a single-player setting.

Many authors further point out that selecting actions based on average node values is not ideal in single-player deterministic environments. Different mechanisms, all based around tracking maximal node value, can be employed to adjust the typical MCTS mechanisms in this regard.

Compared to the neural MCTS ecosystem for games, as well as the traditional RL ecosystem, the neural MCTS landscape is almost completely devoid of standardized implementations and components. Instead, almost all implementations are entirely custom-made. In a few exceptions, domain-specific implementations are reused by others, but they can only be applied to a very narrow set of problems (see e.g. [145]). While this is understandable for more fundamental research, where implementations are inherently in flux, we believe that standardized components could significantly speed up progress on the applied research side.

One reason for the lack of standardized components may be that the design of neural MCTS methods varies substantially, beginning with their training approaches, to their use of self-play related mechanisms, forms of neural guidance, and other modifications to traditional MCTS setups. A widely applicable neural MCTS framework would have to be highly configurable to accommodate different disciplines and problem settings. While this is a difficult task, the traditional RL ecosystem demonstrates that standards [7] and publicly available implementations of algorithms [79] accelerate progress, and that flexible frameworks suitable for research [6] can be designed.

In response to research question 3, it can be concluded that the forms of neural guidance used in the game literature are often used in other applications as well. The most common type of guidance consists of neural selection and evaluation by a learned value function, just as in AlphaZero [100]. Other types, and combinations of neural guidance can be found in the literature as well. Given the amount of variation in different neural MCTS systems, a central question for practitioners is in how to set up their own systems depending on the characteristics of their applications. Ideally, we would be able to map observed problem characteristics to observed neural MCTS configurations to provide a guideline for others to use. While some problem characteristics, e.g. the discrete or continuous nature of the action space, can be determined fairly reliably when reviewing existing publications, others are not so easy to ascertain. The breadth and depth of the (full) tree, for instance, are rarely reported explicitly. In some cases, it may be possible to infer them, but in a review with a multitude of different disciplines, trying to do so reliably is difficult.

Some insights can nevertheless be derived from the collected literature. In games and beyond, it is clear that neural guidance can help to increase the efficiency of the tree search, but can also incur computational cost without much additional benefit in some situations. When and how to employ neural guidance should hence be carefully weighed. During the evaluation phase, for instance, the right choice of mechanism depends on the depth of the overall tree as well as the depth of the node to be evaluated. If the length of a roll-out will be short, it may be preferable over an estimate by a learned value function with associated inference cost. At what exact depth one may be preferable over the other will depend on the size of the employed neural network, which will influence the inference cost, as well as the quality of its predictions. Competing with a high quality estimate of a learned value function may require multiple roll-outs, since individual roll-outs are high variance estimates of a node’s value. A good initial estimate can help focus the search on promising regions of the solution space, while a bad estimate can lead the search astray.

In applications with large tree breadth, neural guidance may be especially helpful in the expansion phase, where it can be used to prevent certain paths in the search tree from consideration altogether. Of course, this comes at the risk of cutting off high-quality solutions. Here as well, the quality of neural network predictions determines whether such an approach is sensible. Especially in applications with very large tree breadth, or even continuous domains, however, a search may not even be feasible without limiting the solution space to some degree.

Clearly, many questions remain unanswered. Additionally, a purely backward looking review tends to summarize what has been done in the past, rather than what should have been done. What is presented in this review is therefore primarily a map of existing approaches and less so a collection of prescriptive knowledge. The results gathered in this review can, however, serve as a foundation for further experimental studies. It is clear that different applications can benefit from different variants of neural MCTS and that no single algorithmic formulation will be the best choice for all possible problems. Practitioners will hence continue to be faced with the task of designing a suitable algorithm for their specific problem setting. Our review can serve as a starting place for this, as it provides a summary of the large set of known possible design choices. Explicitly performing experiments for multiple applications with different properties, in which the factors identified in this review are systematically controlled, can serve to create a more robust understanding of the design of neural MCTS approaches. From such an understanding, prescriptive rules (of thumb) can be derived to aid practitioners in making appropriate design choices for applications with given properties.