Optimal Skipping Rates: Training Agents with Fine-Grained Control Using Deep Reinforcement Learning

. These days game AI is oneof the focusedand active researchareas in artificial intelligencebecausecomputergames are the besttest-beds for testing theoretical ideas in AI before practically applying them in real life world. Similarly, ViZDoom is a game artificial intelligence research platform based on Doom used for visual deep reinforcement learning in 3D game environments such as first-person shooters (FPS). While training, the speed of the learning agent greatly depends on the number of frames the agent is permitted to skip. In this paper, how the frame skipping rate influences the agent’s learning and final performance is proposed, particularly using deep Q-learning, experience replay memory, and the ViZDoom Game AI research platform. The agent is trained and tested on Doom’s basic scenario(s) where the results are compared and found to be 10% better compared to the existing state-of-the-art research work on Doom-based agents. The experiments show that the profitable and optimal frame skipping rate falls in the range of 3 to 11 that provides the best balance between the learning speed and the final performance of the agent which exhibits human-like behavior and outperforms an average human player and inbuilt game agents.


Introduction and Research Motivation
Due to the success and achievements of Google DeepMind [1] technologies, deep learning methods are widely employed in video games particularly in first-person-shooter (FPS) games such as Doom to obtain human-level control through raw pixels information [2].So for this purpose ViZDoom is introduced which is a unique test-bed platform based on FPS Doom for deep learning research from raw visual information that engages the first-person perspective in a partially realistic 3D world that permits programming agents to play the game consuming the screen memory buffer.
Achieving such objectives has become possible now as computer systems have passed humans in terms of complex calculations and massive raw data processing; however, they still struggle in matching their ability to respond in complex 3D realistic environments [3].
In addition, visual signals are the major roots of information about the surroundings for living and artificial beings.Because of the advancements in dealing with the visual information, a progress has been observed in this area of research in the form of employing deep architectures in a set of Atari 2600 games [4] from raw pixel information where Atari 2600 games have been extensively accepted as a standard for visual learning systems.Now more significant progress is expected in 3D realistic environments due to the increase in computing power (GPU's [5] and TPU's [6]) and advancements in machine learning particularly visual learning and evolution of neural networks [7].
Visual deep reinforcement learning is a delightful research area of artificial intelligence as it places the player and the artificial intelligence agents on similar playing field particularly when it reaches partially observable environments [8].Before the innovation of ViZDoom, a firstperson-shooter based environment that could have permitted research on agents depending entirely on raw visual information was not present, which can be considered a serious cause obstructing the development of vision-based deep reinforcement learning because involving in vision-based reinforcement learning needs a huge volume of encoding efforts [9].The presence of a ready-to-use tool assists in managing experiments, and concentrating on the objective of research as the job of playing the first-person-shooter game in a 3D realistic environment is far more difficult than playing many Atari games because it includes a vast variety of skills such as routing through a map, gathering items, and identifying and battling opponents [10].To facilitate and support computationally dense machine learning research ViZDoom is provided with off-screen execution or rendering characteristics.Off-screen execution minimizes the performance liability of truly displaying the game on the screen and makes it feasible to run the simulations on the servers by eliminating the need for using GUI [11].
So far, after studying the existing research on first-personshooter games, specifically Doom, we thought to propose a devoted research to show how the number of skipping frames influences the learning process particularly using the ViZDoom AI research platform which could be of an extreme benefit to the research community to have a state-of-the-art frame skipping scale while training the agents or bots using any 3D realistic environment such as ViZDoom.Therefore, in this paper, the proposed work is based on finding an optimal frame skipping scale that provides the best balance or adjustment between the learning speed and the agent's final performance which might help in making the base for further improvement and research on FPS games.A sample screen of the Doom game is shown in Figure 1.
In order to explain the proposed research work in further detail, the paper is prepared in different sections.Section 2 explains the research on Doom using the ViZDoom AI research platform.Section 3 presents the proposed methodology.Section 4 shows results with experiments.Finally, Section 5 concludes the paper with future work.

Research on Doom Using the ViZDoom AI Research Platform
ViZDoom is a Doom-based AI research platform used for reinforcement learning from raw visual information.It allows developing AI bots that play Doom using only the visual information (the screen buffer).It is primarily intended for research in machine visual learning, and in particular for deep reinforcement learning.One of the recent research works based on visual reinforcement learning and the ViZ-Doom AI research platform is proposed in [12] by training an AI agent for the game Doom.The agent outperformed both human players and inbuilt game agents.However, in comparison, the concept proposed in this paper is different in the form of proposing an optimal scale for frame skipping while training game AI agents or bots.
The early research works based on visual reinforcement learning were performed long ago in [13,14] by simply developing the robots soccer ball skills which were followed by state-of-the-art works using the ViZDoom AI research platform for training intelligent agents such as [15] in which a deep reinforcement learning based agent Clyde was developed to play the game Doom.Clyde participated in the Visual Doom AI Competition held at the IEEE Conference on Computational Intelligence and Games in 2016 [16].In this competition, Clyde competed with 8 other bots and survived to achieve the 3 rd place.Moreover, it also performed well in partially observable multiagent 3D virtual environments using deep visual reinforcement learning methods which were applied conventionally before in the fully observable 2D environments.
In the same way, another deep visual reinforcement learning based autonomous and comprehensive agent known as Arnold showed useful performance on the first-personshooter game Doom.It performed well by simply considering the information on the screen in the form of raw pixels.Besides, deep reinforcement learning action navigation architectures based on convolutional neural networks were used to train Arnold for exploring and fighting the opponents on the game maps.Moreover, for effective training several techniques were employed such as augmenting high level game information, reward shaping, and using sequential updates to support Arnold in outperforming average human players and inbuilt game agents on different variations of the death-match by obtaining the highest kill-to-death ratio on both tracks of the Visual Doom AI Competition where Arnold was placed the 2nd in terms of the number of frags [17].
AI agents have been trained using the ViZDoom AI research platform in [18], which is a correlated research work on training agents that performed on two different scenarios, i.e., a simple basic move-and-shoot scenario and a complex maze exploring problem scenario using the convolutional deep neural networks, Q-learning [19], and experience replay memory for storing the game transitions [20].The agents were tested on similar game scenarios or maps that demonstrated human-like behaviors and were able to outperform inbuilt game agents.
An AI agent is trained on two different maps, i.e., FlatMap and CIGTrack-1 in [21] using deep visual reinforcement learning and curriculum learning for first-person-shooter game Doom.Later on, this game AI agent won the Track-1 of the ViZDoom AI Competition held at the IEEE Conference on Computational Intelligence and Games in 2016 on known maps by 35% greater score than the agent which secured a second position.The proposed framework for this agent is simple and links the state-of-the-art reinforcement learning concept of A3C model [22] with curriculum learning that does not rely on the opponent (adversaries) information; rather it uses the game states information from AI in realtime.
Reinforcement learning and deep learning are the real generic and useful methods for training AI bots or agents that result in rational and well-organized behavior for making intelligent decisions.In this regard, a correlated example can be found in [23] which employed deep visual reinforcement learning methods for training AI bots or agents to make basic and interactive intelligent decisions.Such RL and DL based methods are mathematically modeled using Markov decision processes (MDPs) [24].An MDP is a data structure consisting of multiple parts or to be more specific it is an ordered set of data constituting a record or tuple such as (S, A, P, R, ) where "S" is the set of different states, "A" is the set of changed actions the agent usually takes at each time step "t", "P" is the transitional probability of moving from one state (s) to another state ( ś ) taking an action (a), "R" is the reward function which represents the signal that the agent gets after taking several actions and changing states, and "" is the discount factor.Normally, using deep visual reinforcement learning methods the goal is to secure a policy : s→a to improve the average expected discounted rewards and well renowned state-of-the-art general action value function   (s, a) to learn a policy for estimating the regular expected rewards.
In addition, to support first-person-shooter games in an era in which technology has advanced and been upgraded to a high extent, it is of extreme significance to analyze the impact of skip counts (frame skip rates) while training AI agents particularly using the ViZDoom AI research platform.In the same way, besides computer games, in research on image and video processing the effect of frame skipping is also measured from an enormous group of user-studies by observing the performance over different sets of experiments where in some cases the data is in-sighted over a range of frame repeats normally to assess the video streaming which shows the significance of frame skip rates in the research domain of artificial intelligence.
Moreover, frame repeats are found significant for agent's movements as the degradation in performance for tasks related to agent or player's movements from lower frame skipping rates does not drop as quickly as for tasks related to the shooting with high frame skipping rates.Furthermore, the existing literature on training bots and agents reveals that only optimal frame skipping rates (frame repeats) are acceptable for getting well-balanced agents with better performance [25].
As in first-person-shooter games, frame repeats have a major impact on agents' performance, so sometimes the chosen frame repeat can be preserved even if scenario resolutions are sometimes to be sacrificed in order to reduce the training overburden and complexity level.On the other hand, it is worth noting that the scenario details such as the maps and their dark and light backgrounds along with several kinds of weapons add an interest to research on firstperson-shooter games, so a tradeoff needs to be decided before choosing any option that spoils the real desired results or requirements.
To support and present more related work on firstperson-shooter games, a generic model can be trained similarly to the one in [26] to simultaneously learn game features information such as the existence of opponents (adversaries) or objects along by lessening the Q-learning objective to reveal a progress in the training speed and performance of the model.In this regard, the proposed architecture in the referred article is modularized to be trained in the form of two different autonomous models for numerous phases of the game.The architecture suggestively outperformed inbuilt game AI agents and human players in death-match scenarios.Now to mathematically model the overall concept which is presented in this section so far, a state-of-the-art DQN model is chosen that uses deep visual reinforcement learning to learn a policy for training agents to increase the sum of the expected discounted rewards, i.e., R t. ; it can be mathematically represented as follows: where "T" denotes the game termination time and "" represents the discount factor, i.e.,  ∈ [0, 1], that computes the importance of the future rewards.The Q-function for predicting the return after executing an action "a" in a state "s" for a given policy  can be mathematically defined as follows.
To achieve a maximum return using a function approximator for estimating the activation-value function Q, the DQN can use a neural network which is parametrized by , and to achieve an estimate of the Q-function of the current policy adjacent to the optimal Q-function,  * can be mathematically represented as follows.
In other words, the goal is to find  such that   (s, a) ≈  * (, ).The optimum Q-function validates the Bellman optimality equation.
If   ≈  * , it is obvious to specify that   needs to be adjacent in verifying the Bellman equation that leads to the below loss function: where t is the current time step, and y t = r + max á    ( ś, á ).The value of y t is fixed and corresponds to the following gradient.
The approximation in (7) can also be used to compute the gradient instead of using an accurate estimate results from (6) for gaining the gradient.
One of the well-known approaches for breaking the correlation between sequential samples is to use the experience replay memory; i.e., at each time step the agent experiences (  ,   ,   ,  +1 ) are saved in the replay memory; in addition, the Q-learning updates are executed on batches of experiences subjectively sampled from the replay memory.An  −   [27] can be used to create the next action at every training step with a probability  for selecting the next action randomly and with a probability 1 - allowing the best action of the network.In practice, it is common to start with  = 1 and to progressively decay  to its end limit.
An approach of using the supervised learning techniques is proposed in [28] for a sensorimotor mechanism in immersive environments, which is moreover a correlated concept of training agents or bots using the ViZDoom AI research platform.The approach practices a high dimensional sensory stream and a lower-dimensional measurement stream.The concurrent structure of the streams delivers rich supervisory signals that empower training a sensorimotor control model by interacting with the environment.The model learns to act based on raw sensory input from a complex 3D environment.Such formulation empowers learning without a fixed goal at training time and pursuing dynamically changing goals at test time [29].In this way, broad experiments were managed in 3D simulations based on the classical firstperson-shooter Doom video game; the results validated that such an approach can outperform the current revolutionary inventions mainly on challenging tasks, and models trained with such an approach can efficiently generalize across environments and goals; for example, one of the models trained with such concept won the full death-match track of the Visual Doom AI Competition held earlier in unseen environments.
After studying the research works on such artificial intelligence competitions in computer games and the related research on developing and training agents using ViZDoom AI research platform, so far none of the articles could specify a devoted research on the optimal frame skipping rates for training AI agents based on Doom, which is a serious factor obstructing the improvement of vision-based reinforcement learning.In short, it is important and of interest to the community researching on agents and bots to have a basic frame skipping scale using game AI research platforms such as ViZDoom for visual deep reinforcement learning.To sum up, there is an essential research question significant enough (at least in its current state) to find an optimum scale of frame skipping which could be briefly defined as follows.

Research Question.
What is the needed optimal skip count scale (range) in order to develop a balanced, welltrained, and robust agent particularly using any 3D AI research platform such as ViZDoom?As learning is the slowest when the agent does not skip any frame and learning is faster and smoother when the agent skips more frames, the primary purpose of the research is to examine how the number of skip counts influences the learning process and to find a standard and optimized skip counts scale (range) that can provide a balance or tradeoff between the final performance and the learning speed, specifically using any 3D AI research platform such as ViZDoom.But conversely, too large skip counts could make the agent graceless due to the lack of balance control that results in suboptimal concluding results.

Proposed Methodology
A rectangular chamber is considered as a basic scenario shown in Figure 2 where an agent spawns in the middle of the room's long wall, and a static monster spawns at arbitrary positions along the opposite wall.The agent moves toward the left and right and shoots.A solo shot is sufficient to massacre the monster.The scenario finishes either by killing the monster or with the completion of 300 frames, whichever comes first.The agent gets a score of 101 if it kills the monster, otherwise scores -5 for a miss hit and scores -1 for each action (living reward).
A convolutional neural network (CNN) architecture of three convolutional layers with 32 square filters, 7, 4, and 2 pixels wide, is used, respectively, which is shown in Figure 3.Each convolutional layer is trailed by a maxpooling layer with max pooling of size 2 and ReLU function for activation.Moreover, there is a fully connected layer with 800 leaky rectified linear units and an output layer with 8 linear units corresponding to the 8 combinations of the 3 available actions, i.e., left, right, and shooting [12].
Deep Q-learning, a method of deep reinforcement learning (see Section 2), is used to learn the policy.In order to experiment, the problem is modeled as a Markov Decision Process (MDP).A −greedy policy is used to select an action with linear decay .The convolutional neural network is used to approximate the Q-function trained with "Stochastic Gradient Decent" [30].Besides, a reply memory is used to store the game transitions.

Experiments and Results
The main objective of the experiments is to determine an optimal skip count scale (range) for producing (developing) well-balanced and robust agents or bots and to show how the number of skipped frames affects the learning process, particularly using any 3D game AI research platform such as ViZDoom.

Experiment-1 (
Step-Size=2,000).The effect of skip counts is determined by training the agents for each skip count up to 20 epochs.The discount factor is set to =0.99, learning rate =0.00025,replay memory capacity of 10,000 elements, resolution (45, 60), and minibatch size 32.Each time the agent learned for 40,000 steps involving executing an action, perceiving a transition, and updating the network.To determine and monitor how the number of skip counts affects the learning process while the agent learns, 100 test episodes are played after each 2000 learning steps and as well after the agent got fully trained.All of the experiments are performed in PyCharm 2017.2 professional version using ViZDoom 1.1.5,OpenCV 3.3 [31], CMake 2.8+, GCC 4.6+, and Python 3.6 (64-bit) with NumPy on an Ubuntu 16.04.3Server with Intel5 Core6 i7-7700 CPU @3.60 GHz x 8 and NVIDIA GeForce GTX 1080/PCIe/SSE2 GPU for processing CNNs.The whole learning and testing process in Figure 4 lasted for approximately 2 hours and 30 minutes, and in Figure 5 for approximately 1 hour and 30 minutes of playing approximately more than 35,000 game episodes collectively.
In Figure 4(L), the x-axis denotes the learning steps and y-axis denotes average learning results of the agent.The skip count legend shows the labels for 19 different skip counts considered during the experiments.
The performance of the agent for each considered skip count can be observed in the graph where the agent learns to get the perfect score gradually but the average learning score (result) is not better and high for all skip counts because there exist poor performances even below the score of 50.However, the results for the skip counts that are considered optimal (via experiments) are high and reach above the 70 as can be clearly observed in graph (L).
In Figure 4(R), the x-axis represents the testing steps and the y-axis represents the average test score (results) of the agent.
To test and verify the agent learning ability, the agent is tested on the same scenario(s) on which it was trained where it is observed that the shooting performance of the agent is not highly accurate and best for all skip counts except the optimal skip counts that range from 3 to 11 that can be observed for further study and understanding in graph (R).
To study and understand the performance of the agent for only optimal skip counts (3-11), a clear and simple view of the graph in Figure 4 is provided in Figure 5 by not considering the skip counts that result in irrational behavior and worst performance.

Experiment-2 (
Step-Size=6,000).Similarly, another secondary experiment is conducted in order to confirm the validation of the proposed research question where the learning rate and the experimental setup are the same as described in experiment-1 except for the difference of learning step which is set to 6,000 in order to see any improvement or change in the agent learning and testing performance, or in other words to observe the effect of skip counts.After setting the learning step-size to 6,000 each time the agent learned for 1,20,000 steps including by executing an action, perceiving a transition, and updating the network.To analyze and observe the behavior and performance of the agent, similarly, 100 test episodes are played after each 6,000 learning steps and as well after the agent got fully trained.
This time the whole learning and testing process lasted for almost 8 hours and 30 minutes by playing approximately more than 5,79,292 game episodes.
Table 1 shows the agent average final score for each skip count with the total number of episodes played and the total amount of time taken.It is worth noting that the "episodes" column, which indicates that the swiftness of the learning system greatly depends on the number of frames the agent is permitted to skip during learning, means the more the number of skip counts, the more the number of episodes played and vice versa.In the table, the "average final score" column presents the final performance of the agent for each skip count where the highest scores for the optimal skip counts scale (range [3][4][5][6][7][8][9][10][11] are in italic font. In Figure 6, the x-axis signifies the skip counts and the yaxis signifies the average final scores of the agent.The graph shows the performance (average scores) for all the considered skip counts where only the data points on or above the dotted    2016), in which a neural network architecture employed in experiments is proposed, which is comprised of two convolutional layers that provided a base for suggesting an optimal skip count scale of 4 to 10.
However, in this paper, the proposed neural network architecture consists of three convolutional layers with the differences of learning and game settings, which means that the optimal skip counts scale is determined on a neural network architecture of three convolutional layers with modifications in hyperparameters, where according to the experiments and results it is proposed that the best optimal skip count scale lies in the range of 3 to 11.
Further in a simple move-and-shoot basic scenario, no concept of rewards shaping exists or applied that does not  compute the final score, but in fact it is used while training the agent to help understand its goal.In such type of scenario(s), the agent movement matters where it is not allowed to move forward or backward except left and right.Besides, the Michal et al. ( 2016) experiments were based on 15 skip counts, 7 of them are graphed (Figure 7).However, our proposed experiments, in comparison, are based on 19 skip counts (Figure 4) performed on a latest powerful GPU machine technology.In addition, unlike Michal et al., the experimental environment and the learning settings are also partially different as the learning rate is set to =0.00025 and square filter width to 2 (third layer) with a minibatch size of 32.
In this paper, the final average results for the agent trained on different skip counts are at least 10% better than the results proposed by Michal et al. (2016) in [18] as they faced a few sudden, but transient, drops in the best and average score of the learning dynamics, which can be observed and brainstormed by comparing Figure 4 with Figure 7 because the agents that learned with skip counts less than 3 are less robust, which cannot give an accurate and best results, and the agents trained with higher skip counts are more susceptible to irrational behaviors such as waiting idle or going the way contrary to the monster, which results in a higher change on the plots.Also, excessively huge skip counts make the agent clumsy due to the absence of fine-grained control that results in suboptimal concluding scores.On the other hand, the agents trained with a certain lower skip counts are found somehow robust, but the learning consumes a lot of time and results in a lesser number of scenarios.In short, to conclude, the skip count in the range of 3-11 delivers the greatest stability between the learning speed and the final performance.The outcomes also specify that it would be profitable to begin learning with extraordinary skip counts to maneuver the precipitous learning curve and progressively decline it to fine-tune the performance.

Conclusion and Future Work
In this paper, we proposed how the number of skip counts influences the learning process by employing convolutional deep neural networks (CDNN) with Q-learning and experience replay in a new game learning environment known as ViZDoom.According to the experiments, the results achieved are at least 10% better compared to the publication from Michal et al. (2016).Thus, it is concluded that skipping 3 to 11 frames is profitable in order to achieve human-like behavior of the agent in outperforming an average human player or inbuilt game agents.The learning steps are set to 2,000 and 6,000 and testing episodes 100 after each 2,000 and 6,000 learning steps for each epoch that would be kept dynamic and larger in future work for the different scenarios (collection of maps) such as a deadly corridor, defending the center, defending the line, and health gathering scenario(s).

Figure 1 :
Figure 1: A sample screen from Doom showing the first-person perspective.

Figure 2 :
Figure 2: A rectangular chamber as a basic scenario.

Figure 4 :
Figure 4: Showing the agent learning and testing performance for 19 different skip counts.

AgentFigure 5 :
Figure 5: Showing the agent learning and testing performance for 3-11 optimal skip counts only.

Figure 6 :
Figure 6: Proposing an optimum (standard) skip count scale and showing that the number of skip counts influences the learning process.

Table 1 :
Agent final performance for each skip count that affects the learning performance.