Training an Agent for FPS Doom Game using Visual Reinforcement Learning and VizDoom

Because of the recent success and advancements in deep mind technologies, it is now used to train agents using deep learning for first-person shooter games that are often outperforming human players by means of only screen raw pixels to create their decisions. A visual Doom AI Competition is organized each year on two different tracks: limited death-match on a known map and a full death-match on an unknown map for evaluating AI agents, because computer games are the best testbeds for testing and evaluating different AI techniques and approaches. The competition is ranked based on the number of frags each agent achieves. In this paper, training a competitive agent for playing Doom’s (FPS Game) basic scenario(s) in a semirealistic 3D world ‘VizDoom’ using the combination of convolutional Deep learning and Q-learning by considering only the screen raw pixels in order to exhibit agent’s usefulness in Doom is proposed. Experimental results show that the trained agent outperforms average human player and inbuilt game agents in basic scenario(s) where only move left, right and shoot actions are allowed. Keywords—Visual reinforcement learning; Deep Q-learning; FPS; CNN; computational intelligence; Game-AI; VizDoom; agent; bot; DOOM


I. INTRODUCTION
Doom is an FPS (First person shooter) game developed by Id-software.Its first installment was released on December 10, 1993, for the platform of "DOS" and its second installment "Doom II: Hell on Earth" was released in the following year (1994) for Microsoft Windows, play-station, and Xbox-360.Its third installment Doom-3 was released for Microsoft Windows on August 3, 2004, which was later adapted for Linux and MacOSX.Also, later on, "Vicarious Visions" ported the game to the Xbox and released it on April 3, 2005.Now the very recent and latest installment is "DOOM" developed by id-software and published by "Bethesda Softworks".It was released worldwide on Microsoft Windows, play-station 4 and X box-one as well on May 13, 2016.A common screen of the Doom game is shown in Fig. 1.
These days the research community is very active in research on "Doom" for being a hot area of research using techniques like deep reinforcement learning or visual reinforcement learning.Besides, different Doom-based research platforms like "VizDoom", "CocoDoom" and "ResearchDoom" [1] is developed for implementing deep learning techniques or methods.In the same way, every year different visual Doom AI competitions are organized where the agents (bots) are confirmed to exhibit human-like actions and to show that visual reinforcement learning in 3D FPS game environments is feasible.Like other domains, the deep learning has become wellknown in computer video games as well, in showing improved performance than conventional approaches in managing high dimensional records such as bulky visual inputs [2].Also, playing games in artificial intelligence (AI) has often been used as methods for benchmarking agents [3]- [6].So, because of such reasons it was thought to propose deep learning with Q-learning in training the agent using the Doom-based research platform "VizDoom", similar to the approach proposed in [7] but unlike in learning method (parameters) and the environment used for the experiments because the proposed agent"s learning parameters, experimental environment, total learning and testing lasting time, and partial settings are different (see Section III) which is a part of the contribution to this paper.Further, in comparison to the total reward achieved by the authors agent, the proposed agent"s total score is higher and always positive in numbers, also, initially the training reward of the author"s agent is negative in numbers where the proposed agent training percentage is always positive in numbers.Besides, the proposed neural network architecture is also different than the one proposed by the authors.In order to introduce and explain the proposed www.ijacsa.thesai.orgwork further in detail, this paper is organized into different sections.Section II presents related work, Section III describes the proposed method and experimental work and Section IV shows the results.Finally, Section V concludes the paper with the future work.

II. RELATED WORK
The initial implementations of reinforcement learning based on visual inputs were performed in [8] and [9] in which the robots football-playing skills were developed.But since the availability of VizDoom for research community there are many other contemporary works on training a "Doom AI" based agents using the VizDoom platform that includes the efforts in [10] where the authors presented CLYDE: a deep reinforcement learning Doom playing agent, participated in Visual Doom AI Competition held at the IEEE Conference on Computational Intelligence and Games 2016 where CLYDE competed with 8 other agents and managed to achieve 3rd place.Table I shows the CLYDE performance and results of the Visual Doom AI Competition 2016.Considering its relative simplicity and the fact that the authors deliberately avoided a high level of customization to keep the algorithm generic, it performed very well in a partially observable multiagent 3D environment using Deep reinforcement learning techniques that have already been traditionally applied before in fully observable 2D environments.The CLYDE architecture is shown in Fig. 2 for further observations.Similar to CLYDE, another agent called Arnold: a comprehensive and an independent agent for playing FPS games by means of screen raw pixels that exhibited the usefulness of Doom is presented in [11].Arnold was trained using deep reinforcement learning by means of an "Action-Navigation" structure that practices a distinct deep neural network for discovering the map and confronting the adversaries.The agent also utilized systems such as amplifying high level game characteristics, reward shaping and progressive updates for effective training and real performance where later Arnold outperformed typical human players and inbuilt game agents on different variation of death-match by obtaining the premier kill-to-death ratio in both tracks of the visual Doom AI Competition and was declared 2nd according to the number of frags.Table II    Similarly, a more related work to the proposed approach is performed in [12] where agents are trained for two different scenarios: a simple basic and a complex navigation maze problem using convolutional deep neural networks with Qlearning and experience replay.Where the trained agents were able to exhibit human-like behaviors.
A framework is proposed in [13] for training vision-based agents using deep learning and curriculum learning for FPS games that won the Track 1 by 35% higher score than the second place holding agent in "VizDoom" AI competition 2016 on a known map.The framework combines the state-ofthe-art reinforcement learning approach A3C model with curriculum learning.The model is simpler in design and uses game stats from AI only rather than using opponents" information [14].The basic framework of the Actor-critic model is shown in Fig. 3 for understanding and further observations.

A. Deep Reinforcement Learning
The commonly applied techniques for learning agents or bots are deep reinforcement learning techniques that are logical and efficient in decision making.A similar deep reinforcement learning technique is employed in [15] for learning agents that can make generic and interactive decision making and whose mathematical framework is based on Markov Decision Processes (MDPs).An MDP is a tuple of different fields like (S, A, P, R, γ) where "S" is the set of different states, "A" is the set of different actions the agent can make at each time step t, "P" is the transitional probability of moving from one state (s) to another state ( ́) making an action (a), "R" is the reward function representing that signal which the agent receives after doing different actions and changing states, and "γ" is the discount factor.As usual, the goal of the reinforcement learning is to learn a policy π: s→a that maximizes the overall expected discounted average reward over the agent run.A commonly used technique to learn such a policy is to learn the "action value function" (s, a) iteratively.So as to gradually approximate the expected reward in a model-free fashion.The employed augmented framework is shown below in Fig. 4 that consistently learns better.Doing the 3D reconstruction (d) using the pose and bounding boxes.Semantic maps are built (e) from projection and the DQN is trained (f) using these new inputs [15].www.ijacsa.thesai.orgFig. 5.The proposed model architecture [16] to estimate the policy given the natural language instruction and the image showing the first-person view of the environment.
A similar work (training an agent using VizDoom) of an end-to-end trainable neural structure for task-oriented language grounding in a 3D environment is proposed in [16] that supposes no prior linguistic or perceptual data and needs only raw pixels from the environment and the natural language instruction as input.The model combines the image and text representations using a Gated-Attention mechanism and learns a policy to implement the natural language instruction using standard reinforcement and imitation learning methods.The authors showed the usefulness of the suggested model on unseen instructions as well as unseen maps, both quantitatively and qualitatively.They also introduced a unique environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.The proposed model is shown in Fig. 5 for further details.

B. Deep Q-Networks
A model is trained in [14] to simultaneously learn game features statistics such as the existence of enemies or items along with minimizing Q-learning objective that showed a dramatic improvement in training speed and performance of the model.The authors proposed architecture is modularized to permit several models to be independently trained for different phases of the game.The architecture substantially outperformed built-in AI agents of the game and human players as well in death-match scenarios which is shown below in Fig. 6.A short summary of the DQN model followed in [14] is briefly presented here for supporting the proposed concept, where, learning a policy for an agent that maximizes the expected sum of discounted rewards is dealt with Reinforcement learning which is represented mathematically as, Where "T" is the game termination time and γ [0,1] is a discount factor that calculates the importance of future rewards.The Q-function for the expected return from performing an action "a" in a state "s" for a given policy is defined as: The highest return can be expected to achieve by using an approximation function to estimate the activation-value function Q. Particularly neural network parametrized by is used by DQN and the idea to obtain an estimate of the Qfunction of the current policy that is close to the optimal Qfunction by following a strategy [14], It can also be described as the goal to find such that (s, a) ( ) The optimal Q-function verifies the Bellman optimality equation If , it is obvious to consider that needs to be close in verifying it for Bellman equation that leads to the below loss function: Here "t" is the current time step, and y t = r + ́ ( ́ ́ ).The value of y t is fixed that leads to the following gradient.
The above gradient can also be computed using this below approximation.
Using Experience replay is a well-known concept for breaking the correlation between successive samples.An agent experiences ( , ) at each time step, are saved in the replay memory, where then the Q-learning updates are performed on batches of experiences arbitrarily sampled from the memory.
The is used to generate the next action at each training step with a probability for selecting the next action randomly and with a probability 1according to the best action of the network.Practically it is common to start with which is decayed gradually [17].

C. Supervised Learning Techniques
A similar approach to training an agent via VizDoom platform is presented in [18] but using supervised learning techniques for sensorimotor control in immersive settings.The approach uses a high dimensional sensory stream and a lowerdimensional measurement stream.The cotemporal structure of the streams offers a rich supervisory signal that allows training a sensorimotor control model by communicating with the environment.The model learns to perform based on raw sensory input from a composite three-dimensional environment.The authors offered a formulation that permits learning without a fixed objective at training time and follows dynamic varying goals at a testing time.They also conducted a number of experiments in three-dimensional simulations based on classical FPS game Doom, the consequences demonstrated that the applied approach outperformed sophisticated earlier formulations specifically on exciting g and challenging tasks.The results also showed that trained models effectively generalize across environments and goals.The model trained with this approach won the full Deathmatch track of the Visual Doom AI Competition that was held in earlier unseen environments.The network structure the authors used in their experiments is shown below in Fig. 7. Fig. 7. Network Structure: the initial three input modules first process image "s", measurements m, and goal "g" separately, then a joint representation j contains the concatenated output of these modules.Two parallel streams process the joint representation which predicts the normalized action-conditional differences { ̅̅̅ ( ) and the expected measurements E(j) which are then joined to produce the concluding expectation for each action [18].www.ijacsa.thesai.org

A. Basic Objective
The primary purpose of the experiments is to train a competent agent using visual reinforcement learning and "VizDoom" for first-person shooter games, particularly "Doom" to exhibit human-like behaviors and to outperform average human players and existing in-built game agents.

B. Scenario(s)
A rectangular chamber is used as a basic scenario (see Fig. 8) wherein the center of the room"s long wall an agent spawns.Along the opposite wall, an immobile monster spawns at arbitrary positions.The agent moves towards the left side, right side and shoots as well.A single shot is sufficient to eradicate the monster.The episode finishes once the 300 frames are completed or the monster is either killed, whichever approach first.For killing the monster, agent achieves 101 points, -5 for missing the shot and -1 for each individual action.But the best practice for the agent to learn killing the monster is to kill as rapidly as possible preferably with a solitary shot.

C. Deep Q-Learning
"Markov Decision Process" is used to model the problem and Q-learning to learn the policy.An -greedy policy with linear decay is used for selecting an action.The Q-function is approximated with the convolutional neural network by training it with "Stochastic Gradient Decent" using experience replay.

D. Experimental Setup
 Neural Network Architecture The network used in the experiments includes two convolutional layers with 32 square filters, 7 and 4 pixels wide, respectively, as shown in Fig 9 .A max-pooling layer follows each convolutional layer with a max pooling of size 2 and Relu (Rectified Linear Unit) function for activation.Moreover, the network contains a fully connected layer with 800 leaky rectified linear units and an output layer with 8 linear units conforming to the 8 combinations of the 3 offered actions i.e. right, left and shoot.

 Learning Settings
In the experiments the discount factor is set to γ=0.99, learning rate α=0.00025, replay memory capacity of 10 000 elements, the resolution (30, 45), and mini-batch size to 64.The agent learned from 23, 239 steps consisting of performing an action, observing a transition, and updating the network.For monitoring the learning process, after each epoch (Approx.2000 learning steps) 100 testing episodes are played.

IV. RESULTS
In the experiments, a total of 10,188 training episodes (Basic scenarios) are played.The agent learning details are presented in Table III and explained as follows.

A. Learning
The total number of epochs performed are 20 shown in Column "Ep #".The number of steps learned by the agent is obvious in column "SL/2,000" wherein the initial epochs the learning remains low, but improved in the following epochs, although in some epochs the learning remains unsuccessful.Similarly, the learning percentage also improved progressively and reached almost 100 %, which could be well understood and observed in Fig. 10.Further, column "IPS" represents the iterations per second performed in each learning Epoch.A different number of episodes (basic scenarios) are played during each epoch.The minimum, maximum and mean values are the actual learning and testing output achieved during each epoch by the agent and are displayed in their corresponding columns under learning and testing results in the table.The "ETT" represents the agent elapsed testing time in minutes.The learning steps are kept limited to 2,000 for each epoch in the current experiments which will be kept large and dynamic for different scenarios like rocket basic, deadly corridor, defend the center, defend the line, and health gathering in the future work in order to train and develop competitive agents.

B. Final Testing
Similarly, in final testing phase, the agent is tested on 100 basic scenario(s) once the whole training finished (after 20 th epoch), the agent"s total score after each testing episode is shown in Table IV and its performance can be well understood and observe from the graph shown in Fig. 11.
As it is obvious in the graph that the agent behavior in shooting the spawning monster is balanced and its minimum, maximum and average shooting scores are 17, 94 and 74 which shows that the performance of the agent in basic "move and shoot" scenario(s) is more decent and optimum because the agent is always tested even after each epoch while training in order for monitoring and observing its performance that is always found improved gradually with the passage of time.But as far as the agent overall testing output is concerned with basic scenario(s) so it performed well by moving to the proper position and shooting accurately.In this paper, an agent is trained using Deep Q-Learning and "VizDoom".The agent is tested for almost 2000 (finally on 100 Ep) Doom scenarios where it demonstrated an intelligent behavior and the results achieved are better and positive in numbers than the results proposed by Hyunsoo and Kyung-J.K. 2016.After the scientific analysis, monitoring and observations of the simple "move and shoot" basic scenario(s) results, it is also observed that the speed of the learning system largely rely on the quantity of frames the agent is permitted to skip while learning, in particular skipping frames from 4 to 10 are profitable, which is the future considered work, but this time with larger number of learning steps and Doom scenarios (episodes) by allowing the agent to access the sound buffer as presently agents are deaf.

Fig. 4 .
Fig. 4. Framework overview (a) Observing image and depth from VizDoom, Running Faster-RCNN (b) for object detection and SLAM (c) for pose estimation.Doing the 3D reconstruction (d) using the pose and bounding boxes.Semantic maps are built (e) from projection and the DQN is trained (f) using these new inputs[15].

Fig. 6 .
Fig.6.The proposed architecture of the model in[14].The Convolutional layers are given an input image.The output is split into two streams produced by the convolutional layers.The first one (bottom) flattens the output (layers 3) and inputs it to LSTM, as in the DQRN model.The second one at the top directs it to an extra hidden layer "layer 4", after then to a final layer representing each game features.While training, the game features and the Q-learning objectives are trained mutually.
shows the Arnold performance and results of the Visual Doom AI Competition 2016.

TABLE I .
CLYDE PERFORMANCE IN TERMS OF A TOTAL NUMBER OF FRAGS IN COMPARISON WITH OTHER BOTS IN THE VISUAL DOOM AI COMPETITION 2016 Fig. 2.An architecture of the agent CLYDE www.ijacsa.thesai.org

Learning Performance and Dynamics
EpisodesAgent Testing Performance www.ijacsa.thesai.org

TABLE IV .
AGENT"S TESTING SCORES IN 100 EPISODES (SCENARIOS)