Efficient Searching With MCTS and Imitation Learning: A Case Study in Pommerman

Pommerman is a popular reinforcement learning environment because it imposes several challenges such as sparse and deceptive rewards and delayed action effects. In this paper, we propose an efficient reinforcement learning approach that uses a more efficient Monte Carlo tree search combined with action pruning and flexible imitation learning to accelerate the search performance, allowing the agent to avoid meaningless explorations and find some high-level strategies. Under the Pommerman benchmark, we evaluate the agent driven by the proposed approach against the heuristic and pure reinforcement learning baselines, and the results show that our method can yield a relatively high-level agent performance during combat, which demonstrates the efficiency of our method in this specific domain and its potential ability.


I. INTRODUCTION
Reinforcement Learning (RL) is a branch of machine learning (ML) that has achieved great success in recent years. The agents developed by the RL algorithm show the potential to reach a human-level intelligence in several classic computer games, such as the Atari [1], Go [2], and Starcraft [3]. The methodology behind the RL is to optimally represent the state transition process in a complex interactive system. In this system, the agent controlled by the RL algorithm can improve its strategy through several rounds of the training process to reach some acceptable future rewards. However, training an RL agent will take quite a long time to learn good policies via large numbers of trials, especially for the domains where rewards are delayed, sparse, or deceptive. During the interactive process, the agent will encounter difficulties in exploring or exploiting such an environment.
In an RL phase, the temporal difference (TD) approaches are commonly used, which relies on the Markov theory. Although this assumption can make the problem easier to solve, it always fails in the situation in which the future states are closely related to the current state. The main problem of TD learning is that the step updates are based on the initial The associate editor coordinating the review of this manuscript and approving it for publication was Alicia Fornés .
conditions of the learning parameters. Thus, this bias will lead to a poor convergence property which is called the deadly triad [4]. There is an alternative approach named Monte Carlo tree search (MCTS) [5], which does not suffer from the problem of bias since each update is made using a true sample of the decision trace. However, the mean drawback of the MCTS is the high variance derived from the random rollout process, which leads to a considerable computational request.
In this paper, we focus on the improvement of the learning performance while using the MCTS approach. The idea comes from an observation that the conventional learning process always trains an agent from a lower level, while some high-level strategies are thrown away or diluted since an increased time cost is incurred. In a tree search or decision process, time is wasted in many unnecessary explorations that make the learning process a heavy burden. Therefore, we propose an efficient approach to accelerate the learning phase that contains three key elements of novelty: 1) We construct a hierarchical searching framework in which the top level is adapted to the final result that only indicates wins, ties or losses, and the lower level is adapted to a sub-task to determine if the agent has captured the power-up item, evaded the bomb, or beat the enemy. 2) To find some high-level policies, the action space is pruned to avoid meaningless explorations, and the action filter is redesigned for different search levels and tasks. 3) We use a convolutional neural network to distinguish the filtered actions by using imitation learning to bring prior knowledge into the tree policy. This idea is further evaluated through a popular RL benchmark named Pommerman [6], a simple but challenging environment, to show its potential ability. Our approach achieved significantly better performance than a baseline heuristic method and state-of-the-art RL techniques including vanilla MCTS, DQN(Deep-Q Network) [7], A2C(Advantage Actor-Critic), and A3C(Asynchronous Advantage Actor Critic) [8].
The remainder of this paper is organized as follows: in section 2 we discuss the background of Pommerman and related methods. Section 3 describes the presented learning methods and strategies. Section 4 presents our experimental details and results. Finally, we conclude with a discussion of the present work and future directions in Section 5.

A. POMMERMAN FRAMEWORK
Pommerman [6] is derived from the classic console game Bomberman (Hudson Soft, 1983). The game is played in a randomly generated 11 × 11 grid where four agents need to eliminate each other. During each round, the agents start from a separate corner with a single bomb equipped as shown in Fig. 1. They will step simultaneously and six actions are available to be chosen: STOP, UP, LEFT, DOWN, RIGHT, and BOMB. In addition to the agents, the board consists of passages, woods, and rigid walls. The board will always keep an accessible path between each other. Rigid walls are indestructible. Wooden walls can be destroyed by bombs and will become either a passage or a power-up item. Power-ups can increase the power and blast strength of the agent's bomb or allow the bomb to be kickable.
There are three game settings in Pommerman: FFA (free for all, a single agent against 3 opponents), Team (2v2), and TeamRadio (team variant with the communication). Here, we only consider the FFA mode with no communication. The agent who survives to the end will receive a reward of 1, and the other will receive a reward of −1.

B. RELATED WORK IN POMMERMAN
The first Pommerman competition was organized in 2018 and focused on the FFA game mode. The first place was taken by Gong's agent [9] which depends on a finite state machine tree search approach. As a part of NeurIPS 2018, the second competition received more attention which mainly focused on the TEAM mode. The agents ranked first and third are both MCTS-based approaches [10], and the second is another MCTS implementation with a search depth of 2. During NeurIPS 2019, the third competition was held and the first competition was developed by Gorog M. who further enhanced his previous MCTS agent.
The Pommerman environment involves several unique challenges: 1) Sparse and deceptive rewards. As mentioned before, only a non-zero reward is obtained at the end of the episode, moreover, the final rewards cannot indicate whether the agent is eliminated by other opponents or itself, and the utility of a BOMB action becomes hard to evaluate. 2) Delayed action effects. Both objects of killing the enemy, removing woods, and getting power-up are all dependent on executing a BOMB action, but it will last 10 time-steps from the bomb placed step to the bomb exploded step. Meanwhile, the kickable property will further lead the decision process to become complicated. 3) Non-zero-sum game. In the FFA mode, the independent agent will interact with others while competitive and non-competitive behaviour coexists. The existence of the optimal solution still lacks certification on a mathematical level at this time. 4) Partial observation. In some cases, the agent can only see its nearby areas, which makes the enemy's action been unpredictable and intractable.
These unique challenges attract many researchers and some representative studies are summarized as follows. Multi-agent strategies [11] are a good review that presents and evaluates various strategies for solving the reinforcement learning problems in Pommerman. Backplay [12] is an approach that improves the sample efficiency of model-free RL by constructing a curriculum learning framework through several demonstrations. The agent starts from a state near the end of the game and searches backward to the initial state of an episode. MAGNet Agent [13] utilized a relevance graph representation of the state accompanied by self-attention and message generation, which involved the relation property of each element in learning. Continual learning [14] proposed a Continual Match Based Training (COMBAT) framework for training a group of Advantage Actor-Critic (A2C) agents in Pommerman that won the first place among the learning agents during the 2nd Pommerman Competition. Skynet Agent [15] used proximal policy optimization (PPO) with reward shaping, action pruning, and opponent curriculum learning. A3C-TP [16] extended the Asynchronous Advantage Actor-Critic (A3C) method with a novel auxiliary task of terminal prediction that predicts temporal closeness to terminal states. Later, they proposed PI-A3C [17] which integrates MCTS as a demonstrator for A3C, to avoid meaningless suicides. In hybrid search agent [9], the author used the heuristics and tree search algorithms such as breadth first search (BFS), MCTS, and flat monte carlo search (FMCS) in Pommerman. The result shows that heuristic agents using depth-limited tree search can slightly outperform hand-made heuristics. The study in [18] proposed AMS-A3C and AMF-A3C architectures to perform agent modelling, which means that the network could perform policy improvement and opponent policy recognition simultaneously. The Pommerman agent [19] compared different methods such as Q-learning, snorkel, random forest, and MCTS. While testing these agents against the baseline agent called SimpleAgent, MCTS wins most of the time. The statistical agent [20] analysed several agents that depend on the statistical planning methods and indicated that the MCTS is outperformed.

C. MONTE CARLO TREE SEARCH
MCTS is a highly selective best-first tree search method. The first practical success was seen in the computer game Go [2]. MCTS has since been applied successfully to general game playing, real-time and continuous domains, multi-player games, single-player games, imperfect information games, computer games, and more [5].
In MCTS, a searching tree is generated where each node represents a state of the domain and links to one possible action. MCTS proceeds in four phases ( Fig. 2): selection, expansion, simulation, and backpropagation. The MCTS algorithm proceeds by repeatedly adding one node at a time to the current tree until the terminated state. MCTS uses random actions, a.k.a. rollouts, to estimate state-action values. After the simulation, all collected rewards are back-propagated through all visited nodes.

D. BACKPLAY
Model-free reinforcement learning mostly requires a large number of trials to learn a good policy, especially in environments with sparse rewards, such as Pommerman. A new method called backplay [12] for reducing training time for such problems was proposed. The idea is to create a curriculum for the agent by reversing a single trajectory (i.e., state sequence) of reasonably good, but not necessarily optimal, behaviour. It starts the agent at the end of a demonstration and allow it to learn a policy in this easier setup. Then, the starting point is moved backward until the agent is training only on the initial state of the task.

III. PROPOSED APPROACH
Since we assume the standard reinforcement learning setting of the agent interacting in an environment over a discrete number of steps, for the sake of brevity, we will denote the symbols we have used in the following part. Here, at the time step t, the agent in state s t takes an action a t and receives a reward r t . The discounted return is defined and simplified as , is the expected return from state s following a policy π(a|s) and the action-value function, , is the expected return following policy π after taking action a from state s.

A. STATE SPACE AND NETWORK ARCHITECTURE
In each Pommerman environment time step, the agent will receive an observation in a dictionary that includes a board matrix of dimensions 11 × 11, the agent's ammo, position, blast strength, the ability of bomb kick, etc. Continual learning and terminal prediction are mostly used in the CNN architecture [14], [21]. In our approach, to fit such information with a CNN model, we maintain a feature map of 11×11×20. The first two dimensions are used to fit the board size and the last is the number of channels. These channels include the item position on the board (passages, walls, teammate, enemies, etc.), the agent's ability (ammo, blast strength, and bomb kick), and the bomb information (bomb life, bomb strength, bomb movement direction, and flame life) which are all encoded with the one-hot scheme. Moreover, as shown in Fig. 3, the neural network structure we used contains 2 convolutional layers, followed by policy and value heads. Each convolutional layer has 64 3 × 3 filters with a stride of 1, and the ReLU is used as the activation function, which reduces the interdependence of parameters, and alleviates the occurrence of overfitting problems. Moreover, there is no vanishing gradient problem in neural networks [22]. The input contains 20 feature planes, each of shape 11×11. It then convolves using 2 layers of convolution; the result thus has a shape of 11 × 11 × 64. Then, each head convolves using 64 3 × 3 kernels. Finally, the output is squashed into action probability and value estimation.
In this section, we propose a general hierarchical searching framework as shown in Algorithm 1. In this framework, we start at any state s before the decision will be made by the agent. This state can be treated as a root state s root where the search is carried out. All the available actions can be composed as the action space A. To decide which action is the best under the state s, a tree structure will be constructed with a specified action filter f A main . This filter will be related to the final game results of winning or losing. The top-level searching process will be terminated if the root node reaches the maximum visit counts. Following the standard MCTS fashion, it will yield the utility of each action V s root which can be further normalized by the softmax function.
The sub-level searching process can be treated as a supplementary search to satisfy the auxiliary sub-tasks. First, the state is delineated into three separate states: evade, attack, and balance. These sub-tasks are mainly used to avoid negative or lazy behaviours. After the state has related to a specific sub-task, another search can be carried out to find the optimized solution. Then, the action utility will also be evaluated.

B. HIERARCHICAL SEARCHING FRAMEWORK
The top searching process which using the MCTS estimate will be unreliable at the beginning of the search, but will eventually converge to a more reliable estimate after a given sufficient time, and the optimal estimate can be reached under infinite time. The sub searching process from a partial perspective determines probability distributions. Therefore, these two search results need to be further balanced. Here, we adopt the Kullback-Leibler (KL) divergence to evaluate the distribution of two sets of action utility by the following: where P mian is the distribution related to the top level search and the P sub is related to a specific sub task. Then the action could be selected from the following rule: where the w is a weight to determine the influence of a sub-task execution and ε is a threshold value. It is worth mentioning that, although from a design perspective, the two searching processes are subordinate relationship, from a realizable perspective, they can be fully paralleled, and each single tree search can also be paralleled.

Algorithm 1 Hierarchical Searching Procedure
Require: root state s root , action space Ensure: action a Initialize search tree T main , global filter f A main . node root = T main (s root , f A main ) while N root < N do node expand = tree_policy(node root ) reward = default_policy(node expand , f A main ) backup(node expand , reward) end while Q main (s root , a) = 1/N (s root , a) s child |s root ,a→s child V (s child ) P main = softmax(V (s root )) if evade_condition(s root ) then task sub = ''evade'' else if attack_condition(s root ) then task sub = ''attack'' else task sub = ''balance'' end if Initializition: search tree T sub , global filter f sub node root = T sub (s root , f A sub )) while N root < M do node expand = tree_policy(node root ) sub_reward = default_policy(node expand , f A sub )) backup(node expand , sub_reward) end while Q sub (s root , a) = 1/N (s root , a) s child |s root ,a→s child V (s child ) P sub = softmax(V (s root )) k = D KL (P main ||P sub ) if k < ε then a = argmax(P main ) else a = argmax(P main + wP sub ) end if

C. ACTION FILTERS
The problem with the MCTS game tree for more than one enemy is that the branching factor is vast, it is 6 for our agent and 6 3 for the other agent(s). That made our agent runtime very slow. Therefore, we considered working with the action filters to reduce the runtime and accelerate the training time.
We needed to create an action filter that allows the agent to ''predict'' the future value of a current state. We began noticing different results and gameplays by choosing different action filters, and their weights. We started with some basic weighting vectors and features, such as how many bombs are on the board at the current moment, how many agents are left, how many breakable walls our agent destroyed, or the distance from our enemies, and similar other conjectures. Then, we added features that allowed the agent to avoid committing suicide, such as checking if the agent is on the flames (that marks the explosion area) or near them, following the same rationale for bombs or enemies. Then, we became offensive -we placed bombs near our enemies that would cause their death down that branch and cleared our way to other agents and so on.
There are three types of action filters using these features applied in our method as shown in Algorithm 2: evade, attack, and balance, accompanied with 3 individual score functions (different weights are assigned to these features). All the filters are designed by different rules and unique conditions of entries.
Evade: This will try to escape the bombs surrounding the agent and filter all the passable positions and avoid fights. We will assign punishment for being near enemies or bombs/flames.
Attack: This is used while an enemy approaches a Manhattan distance of 6, and all the available directions to the enemy are listed. The agent will prefer moving closer to the enemies and placing bombs near them/near breakable walls. However, in this situation, the agent will die in the process. We assign high weights to place bombs near enemies/walls. Balance: This agent will take reasonable action in the current state. He knows to be offensive and defensive, balancing between the two. This is the best action filter.

D. IMITATION LEARNING WITH BACKPLAY
Although the action filters could accelerate the searching performance, we still think it is insufficient to be applied in a real-time running condition in Pommerman. The reward is only granted at the end of each game, and since the state space is large it takes a long time for the reward to propagate values of earlier states. In this sense, we use the more efficient imitation learning -Backplay -to give the filtered actions some biased probabilities, which make tree policy selection more efficient.
The key idea in Backplay is that we do not initialize the Markov Decision Process (MDP) in only a fixed s 0 . Instead, we assume access to a demonstration which reaches a sequence of states s d 0 , s d 1 , . . . , s d T . For each training episode, we uniformly sample starting states from the sub-sequence s d T −k , s d T −k+1 , . . . , s d T −j for some window [j, k]. As training continues, advance the window according to a curriculum by increasing the values of j and k until we are training on the initial state in every episode We do this based on SimpleAgent playing against itself from the Pommerman repository. To implement imitation learning with neural networks, we implemented the key idea of backplay by saving full states for the entire game. Then, for every window of 5 states starting from the end of the game, we randomly selected one of the states to be an initial state from which we ran a learning episode. After each learning episode, we shifted the window 5 states backward and repeated the process until we reached the true initial state.

E. REWARD SHAPING
As is known, one of the challenges of Pommerman is that of sparse and delayed rewards. It only provides a reward when the game ends (with a maximum episode length of 800), either 1 or −1. To address the sparse reward problem, we add a reward function during the learning.

IV. EXPERIMENTS AND RESULTS
In this section, we will show the experimental results against the baseline (3 SimpleAgents) under the FFA mode. It should be noted that the SimpleAgent, has an average win rate of 23.4% compared to the other 3 SimpleAgents.
SimpleAgent serves as a purely rule-based agent that is developed by the author of Pommerman's environment.
To illustrate our approach, we compare our proposed methods with some traditional reinforcement learning methods, such as Q-learning, DQN, A2C, and vanilla MCTS. We tested 200 games for each agent against SimpleAgent. The initial position of our agent was randomized. The expertise level of MCTS is 200 rollouts. We calculate the win ratio of our agent. The results are shown in Fig. 6. The Q-learning agent has the worst effect and the H-MCTS-NN(our hierarchical MCTS with neural network and imitation learning) performs best. For this reason the Q-learner can choose a ''good'' action only if it has been in the same state before. Since the Pommerman environment has a large state space and every battle starts on a random map, there is almost no chance to encounter the same scenario. The A2C win rate was 23.6% slightly better than that of SimpleAgent, converged faster than that of A3C, and the training time was 5 days. With limited time and hardware, we envision the possibility of further improving the results. The best performers were the vanilla MCTS and its variants. As we know that the main disadvantage of vanilla MCTSs is their runtime, the time required for them to act longer than the Pommerman environment time limit. Therefore, we use a neural network to lower its runtime, and the result shows an improvement. Fig. 4 presents the learning against the rule-based opponent curve of H-MCTS-NN, MCTS, A2C, and A3C. Our method, H-MCTS-NN, outperforms the standard MCTS, A2C, and A3C in terms of both learning faster and converging to a better policy in learning against rule-based opponents. Our training models have an average win rate of more than 60% after 20,000,000 episodes and converge at 70%. However, the original MCTS, A2C, and A3C reach no more than 40% after 40,000,000 episodes.
To fully evaluate our approach, here we also include the conventional MCTS with the same setting, the hierarchical MCTS with action filters(H-MCTS), and the H-MCTS with neural network and imitation learning biased   (H-MCTS-NN). We can vary the expertise level of MCTSs by changing the number of rollouts per action-selection. We experimented with 200 games against SimpleAgent with 100 ( Fig. 7) and 200 (Fig. 8) rollouts per move.
The results show that the H-MCTS-NN agent perform better than others, and compared to the conventional MCTS, it will both improve the winning percentage and prevent the portion of ties. The same method with 200 rollouts learns better than those with 100 rollouts. Although more searches will achieve better results, the rollouts should be carefully balanced due to the limitation of response time per step. It should be noted that this is an original version with serial logic, and it is apparent that the MCTS scheme can be easily paralleled which will be included in our future research. In general, the presented approach achieves better performance than the baseline agent and conventional MCTS, which shows the potential ability to be applied in other real-time games.
After we obtained very good results by implementing the ''balance'' H-MCTS-NN approach for the heuristic, we started running it against our own attack agent and evade agent, and obtained the following results shown in Fig. 5. As can be seen, the balance agent managed to yield reasonably good results, and wins most of the games played against other agents.

V. CONCLUSION
This paper has proposed an efficient searching approach by using hierarchical searching, action pruning, and imitation learning, which is adopted and tested in a Pommerman benchmark. The result shows that our method performs better than the provided baseline agent (SimpleAgent). There is still a considerable space for our agent to improve in the future. Future work may include the parallel algorithm for MCTS and more comparisons against other types of agents to analyse and measure the performance of our agents, further improvements, and evaluations for the multi-agent mode, such as the Team and TeamRadio modes.
HAILAN YANG received the B.S. degree in computer science and technology from the Ocean University of China, in 2018. She is currently pursuing the master's degree with the National University of Defense Technology. Her research interests include multi-agent reinforcement learning algorithms and basic software.