BI-DIRECTIONAL MONTE CARLO TREE SEARCH

This paper describes a new algorithm called Bi-Directional Monte Carlo Tree Search. The essential idea of Bidirectional Monte Carlo Tree Search is to run an MCTS forwards from the start state, and simultaneously run an MCTS backwards from the goal state, and stop when the two searches meet. Bi-Directional MCTS is tested on 8-Puzzle and Pancakes Problem, two single-agent search problems, which allow control over the optimal solution length d and average branching factor b respectively. Preliminary results indicate that enhancing Monte Carlo Tree Search by making it Bi-Directional speeds up the search. The speedup of Bi-directional MCTS grows with increasing the problem size, in terms of both optimal solution length d and also branching factor b. Furthermore, Bi-Directional Search has been applied to a Reinforcement Learning algorithm. It is hoped that the speed enhancement of Bi-directional Monte Carlo Tree Search will also apply to other planning problems.


INTRODUCTION
The shortest path problem consists of finding the path through a graph which minimizes the total edge cost. In 1959, Djikstra (Dijkstra, 1959) proposed an algorithm for finding the shortest path. The algorithm expands each node, starting from the start node, according to the cost from the start to node n, called g(n). The next node with minimum g value of all found so far is expanded until the goal node is reached.
In 1966Nicholson proposed Bi-directional Search (Nicholson, 1966 to enhance shortest path search. If Djikstra's algorithm performs a unidirectional search to find the shortest path of length d, where the average number of edges from a node is b, then the search will expand ( ) nodes. Bi-directional Search runs forwards from the start and backwards from the goal. Each direction in Bi-Directional Search hopefully expands ( 2 ⁄ ) nodes, and the sum of the two directions is less than the time required of a full Djikstra Search.
In 1968 a new algorithm, called A*, extended Djikstra's algorithm in a different way, by expanding the node with the smallest f value, such that f(n) = g(n) + h(n) (Hart, 1968). The new term h(n) is a domain heuristic underestimate of the distance to the goal node from node n. The heuristic directs the search towards a reasonable direction of the goal node, and therefore reduces the number of expanded nodes compared to Djikstra's algorithm. Let * be the value of the goal node at the end of the shortest path, i.e. cost of the shortest path. Since A* expands the node with smallest f value, and because ℎ is an underestimate of the actual cost, the first node n which has = = * will be the first goal node encountered which is also on the shortest path. (Pohl, 1969) proposed a combination, Bi-directional A*, which led to investigations as to whether the forwards and backwards searches would actually meet in the middle. Eventually, (Holte et al, 2017) proposed a new Bi-Directional A* Search algorithm that is guaranteed to meet in the middle called MM. MM expands the next node with the lowest priority according to two f values, in the forwards direction and in the backwards direction. Here is the forwards cost from start to node n, ℎ is an underestimate of the cost from node n to the goal in the forwards direction, is the backwards cost from goal to node n, and ℎ is an underestimate of the cost from node n to the start backwards. Any node n which has > * 2 will have a > * because ℎ is an underestimate, which means the MM algorithm will encounter the backwards frontier before expanding any node n which has a > * 2 . The logic works analogously in the backwards direction, and therefore MM will meet in the middle.
The current paper describes a new extension of Bi-directional heuristic search which is based on Monte Carlo Tree Search (Coulom, 2006) (Kocsis, 2006). MCTS was highly influential in the development of strong Go playing programs. For Go the main challenge was extending the αβ algorithm. There was no known way of formalizing heuristic information about the value of a given Go position. Monte Carlo Tree Search uses statistical information from many simulated games to evaluate a Go position instead. (Gelly, 2008) reports "the first program to achieve human master level" in 9x9 Computer Go. Eventually, a combination of Monte Carlo Tree Search with deep neural networks trained by supervised learning and Reinforcement Learning defeated the human grand-master of Go (Silver et al, 2016). Afterwards Monte Carlo Tree Search showed promising results in non-game applications (see (Browne et al, 2012), (Goh et al, 2019), (Matsumoto et al, 2010)).
The new algorithm proposed in this paper is called Bi-directional Monte Carlo Tree Search (Bi-directional MCTS). It can be viewed as a hybrid combining Bi-directional search with Monte Carlo Tree Search (MCTS), and can also be viewed as an enhancement of Monte Carlo Tree Search. The essential idea of Bi-directional MCTS is to run MCTS forwards from the start state, and simultaneously run MCTS backwards from the goal state, and stop when the two searches meet. Bi-directional MCTS is compared with MCTS for optimally solving 8-Puzzle and Pancakes Problem, two single-agent search problems, and the speedup of the Bi-directional enhancement is analyzed. The motivation for the present analysis is to learn how the Bi-directional MCTS scales to larger problems.

BI-DIRECTIONAL MONTE CARLO TREE SEARCH
This section of the paper describes the original MCTS algorithm, how it can be modified to solve single-agent search problems, and also how MCTS has been expanded to be Bidirectional MCTS. Monte Carlo Tree Search (see Figure 1) constructs a tree, starting from a single root node which grows as the search iterations proceed. One iteration of the Monte Carlo Tree Search algorithm applies four operators 1. Selection -repeatedly select the best of the children nodes, typically using UCT (Kocsis, 2006), until a leaf node in the tree is reached. 2. Expansion -optionally add the children of the leaf node, using one of several heuristics (Yajima et al, 2010), into the tree. 3. Playout -play from the leaf node of the tree, typically using pure random moves or guided moves, until the end of the game. 4. Backpropagation -update the statistics for each visited node in the tree.
The work in (Schadd et al, 2008) details an application of MCTS to a single player game called SameGame. This present work proposes an alternative design of MCTS for Single Agent Search. The purpose of the present study is not a comparison with the Single Player MCTS in (Schadd et al, 2008), but rather as a way to make MCTS work as a Bi-Directional algorithm.
Standard Monte Carlo Tree Search can be adapted to single agent search problems like 8-Puzzle and Pancakes Problem by modifying two of the MCTS operators as follows. The selection operator stops when the tree depth reaches solution_length_limit (The solution_length_limit parameter effectively limits the search, which is essential because solutions to 8-Puzzle can form loops and might never reach a terminal state.). The playout operator first checks if the leaf in the tree is a solved game state and returns a special SOLVED_IN_TREE flag, otherwise, a playout plays pure random moves, if the resulting game state is solved returns 1 (success) and if the total playout length is solution_length_limit returns 0.
In the new proposed algorithm Bi-directional MCTS, two trees are kept in memory, an up-tree and a down-tree. The four MCTS operators are applied in sequence to the downtree first and then applied in sequence to the up-tree. See Figure 2 for the pseudo-code of the proposed algorithm. Simple logic is added to the playout operator; If the leaf of the current tree is found in the opposing tree then a special MET_IN_THE_MIDDLE flag is returned, otherwise a playout is performed as described above. This section of the paper describes 8-Puzzle and Pancakes Problem, the two single-agent search problems which are used to test the proposed Bi-directional MCTS algorithm. The problems were chosen for testing because they are simple to implement, both have a well known algorithm (A*) and admissible heuristics for producing optimal solutions, and MCTS and Bi-directional MCTS can easily be modified for solving both problems. Additionally, the Pancakes Problem is useful for modifying the branching factor b easily by increasing the number of pancakes N on the stack.  The Pancake Problem is a famous search problem consisting of a stack of N pancakes of unique sizes. The starting state can be any non-solved state like that shown on the right hand side of Figure 4. A flip reverses the order of all pancakes on the stack from that position upwards. The state shown in the right hand side of Figure 4 can be optimally solved using the flip sequence (5, 6, 3, 6, 7, 6). The problem is solved when the pancakes are stacked in order like that shown in the left hand side of Figure 4. EXPERIMENTAL SETUP Total of 5,000 random 8-Puzzle problems were generated by applying 25 random moves backwards from the solved 8-Puzzle state, and 1,000 random Pancake problem states were generated by applying N random moves (N being the number of pancakes on the stack) backwards from the solved Pancake Problem state.
For each random 8-Puzzle and Pancake Problem starting state, A* was run to solve the problem optimally, and then MCTS and also Bi-directional MCTS were run to find a solution. The solutions produced by MCTS and Bi-directional MCTS are compared with the optimal A* solution. Puzzles with optimal solution length 1 were discarded from the analysis, and also puzzles which weren't optimally solved by both MCTS and Bi-directional MCTS algorithms. Essentially, the current analysis only considers the amount of iterations required to produce the optimal 8-Puzzle and Pancake Problem solution.
The analysis concerns the speed enhancement of the Bi-directional enhancement of MCTS, so a measurement of was calculated using the formula where playoutsMCTS and playoutsBDMCTS are the number of playouts required to optimally solve the 8-Puzzle and Pancakes Problem problems.
The parameters used for the experiments are shown in Figure 5. Solution_length_limit stops selection and playout during MCTS and Bi-directional MCTS, and ensures theoretically the tree and playout are deep enough solve the puzzle. In this analysis the Bi-directional MCTS algorithm was implemented in sequence, not in parallel. All puzzles and algorithms were implemented using C++. Experiments were run on Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz, 64 bits, with 3.6G RAM.

RESULTS
In this section is presented the experimental results, first for 8-Puzzle and then afterwards for Pancakes Problem. Table 2 shows the relevant statistics for the 8-Puzzle analysis, including average speedup, number of samples include in the analysis, the standard deviation, and the 95% confidence interval which is calculated as:  Figure 7 shows optimal solution length l plotted against the average speedup of Bidirectional MCTS vs MCTS over the runs that are included in the analysis. The error bars represent the 95% confidence interval. Also plotted is the curve = 0.02644031 * (0.52552977 * ) + 1.1205256 (3) which was fitted to the experimental data using scipy curve_fit, which performs a non-linear least squares fit to the data. In the case of Pancake Problem the analysis was performed on varying number of pancakes N where 6 ≤ N ≤ 9, and for solutions of length 6 only. Table 3 shows the relevant statistics for the Pancake Problem analysis. The speedup results are shown in Figure 9, which shows branching factor b plotted against average speedup of Bi-directional MCTS vs MCTS for the Pancake Problem. The error bars represent the 95% confidence interval. Also plotted is the curve: = 2.20277846 * (0.44500042 * ) − 9.19309151 (4) which was fitted to the experimental data using scipy curve_fit method, which performs a non-linear least squares fit to the data.  Figure 5 shows that increasing the length of the optimal solution for 8-Puzzle is marked by an increase in the speedup of Bi-directional MCTS compared to MCTS. The speedup of Bidirectional MCTS grows exponentially proportional to the optimal solution length. Figure 6 shows that increasing the number of pancakes (the branching factor b) for Pancake Problem is marked by an increase in the speedup of the Bi-Directional enhancement of MCTS.

DISCUSSIONS
The present work is a comparison between Monte Carlo Tree Search and a Bi-Directional enhanced Monte Carlo Tree Search. A* was used as a tool to construct optimal solutions to the 8-Puzzle and Pancakes Problem, and has not been compared with either of the MCTS algorithms in terms of time complexity. This is because A* is guaranteed to find an optimal solution to the puzzles when there is an admissible heuristic, whereas MCTS and Bi-directional MCTS do not have the same guarantee. Future research will analyse optimality guarantees of Bi-directional MCTS. Sturtevant and Felner (Sturtevant, 2018) compared four algorithms (A*, backwards A*, Bi-directional Brute Force Search, and Near-optimal Bi-directional Search) for solving four problems (Pancakes Problem,Roads of Colorado,and Grid Mazes). They showed that A* expands fewer nodes than Bi-directional Search, if A* has a "strong" heuristic. Otherwise Bi-directional search expands fewer nodes. This mirrors the results that are shown in Figures 5 and 6 in this present work, which suggests that the version of uni-directional MCTS in the present work is not a "strong" heuristic for 8-Puzzle and Pancakes Problem. This would be because the majority of the search effort in uni-directional MCTS (the majority of playouts) would done in the second half the search, and therefore running a Bi-directional MCTS forwards and backwards will probably remove the majority of the search effort by running the second half of a search in neither forward nor backward direction. The version of uni-directional MCTS in the present work not being a "strong" heuristic for 8-Puzzle and Pancakes Problem is one explanation for the good results of Bidirectional MCTS.
Since the new proposed algorithm Bi-directional MCTS is a form of Bi-directional search, it can only be applied to problems where the goal state is known, e.g. roads or computer networks. The results described in the present article suggest that the Bi-directional MCTS scales well as problems grow larger, since the speed-up of Bi-directional MCTS compared to MCTS increases with problem size. This is encouraging for the use of Bidirectional MCTS for larger problems because it will likely be faster than MCTS. This paper proposes a new algorithm called Bi-Directional Monte Carlo Tree Search, and presents preliminary results indicating that enhancing Monte Carlo Tree Search by making it Bi-Directional speeds up the search. Bi-directional MCTS speeds up MCTS exponentially with solution length d as shown in the 8-Puzzle results, and also with average branching factor b as shown in the Pancakes Problem results. This makes Bi-Directional Monte Carlo Tree Search potentially an effective search enhancement that can be applied to other planning problems where there is no known heuristic. Additionally, Bi-Directional Search has been applied to a Reinforcement Learning algorithm, which has previously been reported in (Baldassarre, 2003).