Increasing sample efficiency in deep reinforcement learning using generative environment modelling

Reinforcement learning is a broad scheme of learning algorithms that, in recent times, has shown astonishing performance in controlling agents in environments presented as Markov decision processes. There are several unsolved problems in current state‐of‐the‐art that causes algorithms to learn suboptimal policies, or even diverge and collapse completely. Parts of the solution to address these issues may be related to short‐ and long‐term planning, memory management and exploration for reinforcement learning algorithms. Games are frequently used to benchmark reinforcement learning algorithms as they provide a flexible, reproducible and easy to control environments. Regardless, few games feature the ability to perceive how the algorithm performs exploration, memorization and planning. This article presents The Dreaming Variational Autoencoder with Stochastic Weight Averaging and Generative Adversarial Networks (DVAE‐SWAGAN), a neural network‐based generative modelling architecture for exploration in environments with sparse feedback. We present deep maze, a novel and flexible maze game‐engine that challenges DVAE‐SWAGAN in partial and fully observable state‐spaces, long‐horizon tasks and deterministic and stochastic problems. We show results between different variants of the algorithm and encourage future study in reinforcement learning driven by generative exploration.

environments such as the Atari Arcade (Mnih et al. 2013(Mnih et al. , 2015, AlphaZero (Silver et al. 2017), OpenAI Five and AlphaStar (Arulkumaran, Cully and Togelius 2019), but no algorithms are capable of human performance without extensive hardware requirements that are accessible to only a few institutions, such as Deep Mind and Open AI. Due to this, reinforcement learning still has several open research questions to address before the general public can deploy these algorithms successfully. It is possible that many of the issues in the current state of the art can be resolved with algorithms that adequately accounts for planning, exploration and memory efficiency at different timehorizons.
In current state-of-the-art RL algorithms, long-horizon RL tasks are difficult to master because there is as of yet no optimal exploration algorithm that is capable of proper state-space pruning. Exploration strategies such as ϵ-greedy are widely used in RL, but cannot find an adequate exploration/exploitation balance without exhaustive hyperparameter-tuning. Environment modelling is a promising exploration method where the goal is to imitate the behaviour of a target environment. By constructing an artificial model of the environment, the need to interact with the ground-truth environment is reduced significantly. This enables the RL-algorithm to explore large parts of the state space without the cost of exhausting the ground-truth environment. A balance between exploration and exploitation must be accounted for to learn the underlying dynamics of the environment. The algorithm must, therefore, select observations carefully to only learn essential features during the exploration phase.
By combining generative modelling with deep reinforcement learning, we find that it is possible for agents to learn optimal policies using only generated training data samples. The approach that we present is the dreaming variational autoencoder with two extensions, based on generative adversarial networks and stochastic weight averaging. We also present a new learning environment, deep maze, that aims to bring a vast set of challenges for reinforcement learning algorithms and is the environment used for testing the presented algorithms.
This article is organized as follows. Section 2 surveys recent study in the field. Section 3 briefly introduces the reader to preliminaries.
Section 4 proposes The Dreaming Variational Autoencoder for environment modelling to improve exploration in RL. Section 5 introduces the deep maze learning environment for exploration, planning and memory management research for reinforcement learning. Section 6 shows results in the deep line wars environment and that RL agents can be trained to navigate through the deep maze environment using only artificial training data. Finally, we summarize our findings and outlines future study in Section 7.

| LITERATURE REVIEW
In machine learning, the goal is to create an algorithm that is capable of accurately representing a target environment. There is, however, little literature on generative modelling for game environments on the scale we propose in this article. The primary focus of recent RL research has been through improvements in on and off policy algorithms, while less attention has been put into generative exploration. This section introduces a thorough literature review of reinforcement-based generative modelling. Bangaru, Suhas and Ravindran (2016) proposed a method of deducing the Markov Decision Process (MDP) by introducing an adaptive exploration signal (pseudo-reward), which was obtained using deep generative model. Their approach was to compute the Jacobian of each state and used it as the pseudo-reward when using deep neural networks to learn the state-generalization. Xiao and Kesineni (2016) proposed the use of generative adversarial networks (GAN) for model-based reinforcement learning. The goal was to utilize GAN for learning dynamics of the environment in a short-horizon timespan and combine this with the strength of far-horizon value iteration RL algorithms. The GAN architecture proposed illustrated near authentic generated images giving comparable results to Chua, Calandra, McAllister and Levine (2018) recently proposed Probabilistic Ensembles with Trajectory Sampling (PETS). The algorithm uses an ensemble of bootstrap neural networks to learn a dynamics model of the environment over future states. The algorithm then uses this model to predict the best action for future states. The authors show that the algorithm significantly lowers sampling requirements for environments such as half-cheetah compared to SAC and PPO.
Stochastic optimal control with latent representations (SOLAR) is an algorithm that learns the dynamics of the environment by exploiting the knowledge from a reinforcement learning policy. This enables the algorithm to learn local models which are used in policy learning for complex systems. SOLAR is built around a probabilistic graphical model (PGM) structure that allows efficient learning of the environment model. By exploiting the locality of the model, the authors show that the gradients give good direction for policy improvements during training. The algorithm was compared to model-free methods and showed significantly better performance and data efficient compared to algorithms such as LQR-FLM. (Zhang, Patras and Haddadi 2018) Ha and Schmidhuber (2018) proposed in Recurrent World Models Facilitate Policy Evolution, a novel architecture for training RL algorithms using variational autoencoders. This article showed that agents could successfully learn the environment dynamics and use this as an exploration technique requiring no interaction with the target domain. The architecture is mainly three components; vision, controller and model, the vision model is a variational autoencoder that outputs a latent-space variable of an observation. The latent-space variable is processed in the model and is fed into the controller for action decisions. Their algorithms show state-of-the-art performance in self-supervised generative modelling for reinforcement learning agents.
One of the most recent advancements in generative modelling for reinforcement learning is the Neural Differential Information Gain Optimisation (NDIGO) algorithm by Azar et al. (2019), a self-supervised exploration model that learns a world model representation from noisy data. The primary features of NDIGO are its robustness to noise due to their method to cancel out negative loss and to give positive learning more value. The authors show in their maze environment that the model successfully converges towards an optimal world model even when introducing noise. The author claims that the algorithm outperforms previous state-of-the-art, being the Recurrent World Model from.

| BACKGROUND
We base our work on the well-established theory of reinforcement learning originally formulated in Sutton, Precup and Singh (1999), defining the problem as a MDP as 〈S, A, ℛ, T 〉 pairs. The state-space, S represents all possible states while the action-space, A represents all available actions the agent can perform in the environment. ℛ is the reward function while T denotes the transition function T : which is a mapping from state s t ∈S and action a t ∈A to the future state s t + 1 . After each performed action, the environment dispatches a reward signal, ℛ : S ! r.
We call a sequence of states and actions a trajectory denoted as τ = (s 0 , a 0 , …, s t , a t ) and the sequence is sampled through the use of a stochastic policy that predicts the optimal action in any state: π θ (a t | s t ), where π is the policy and θ are the parameters. The primary goal of the reinforcement learning is to reinforce good behaviour. The algorithm should try to learn the policy that maximizes the total expected discounted reward Mnih et al. (2015).
Several algorithms try to address the problem of reinforcement learning, commonly divided into three categories; Value Iteration, Policy Iteration and Actor-Critic Algorithms, with variations of on-policy and off-policy learning.
Autoencoders are commonly used in supervised learning to encode arbitrary input to a compact representation, and using a decoder to reconstruct the original data from the encoding. The purpose of autoencoders is to store redundant data into a densely packed vector form. In its simplest form, an autoencoder consists of feed-forward neural network where the input and output layer is of equal neuron capacity and the hidden layer smaller, used to compress the data. Such autoencoder can be defined as: ϕ : Note: The purpose of this illustration is to show that the proposed algorithm uses generative modelling to inherit sample efficient generative modelling for sequential data for use in reinforcement learning algorithms.
notation, X defines the input, Z , the encoding andX as the reconstructed data. There are, however several issues with autoencoders, as they are not generative, in the sense that they can transition between features in the input data. To remedy this, Kingma and Welling (2013) proposed the Variational Autoencoder (VAE) algorithm that enables interpolation between features in the latent space. The interpolation is done by introducing a vector of means μ and a vector of standard deviations σ. These vectors, along with the additional KL-Loss gives the algorithm the ability to learn variations in the input data, enabling the output to be vastly more diverse.
Our background for choosing the following branch of algorithms are described in Table 1. Algorithms based on reinforcement learning learns by hands-on interaction. Experience is attained by exploring the environment, but in some cases, the environment may be exhausted before the agent can learn to behave optimally. To remedy this, we introduce a model-based exploration approach using VAE to model the environment with sequential data as input. This increase the sampling efficiency of the RL algorithm by a significant amount. To increase the performance of the environment model, GAN is used to strengthen the quality of the environment-model, where the VAE and GAN continually reinforce each other's performance.

| THE DREAMING VARIATIONAL AUTOENCODER
The Dreaming Variational Autoencoder (DVAE) is an end-to-end solution for generating probable future statesŝ t + n from an arbitrary state-space S using state-action pairs explored prior to s t + n and a t + n . Figure 1 illustrates the DVAE model, where Algorithm 1 works as follows. First, the agent collects experiences for utilizing experience-replay in the Run-Agent function. At this stage, the agent explores the state-space guided by a Gaussian distributed policy. The agent acts, observes, and stores the observations into the experience-replay buffer D. After the agent reaches terminal state, the DVAE algorithm encodes state-action pairs from the replay-buffer D into probable future states. This is stored in the replaybuffer for artificial future-statesD.

4:
for i = 0 to N_EPOCHS do

6:
Illustration of the DVAE model. The model consumes state and action pairs, yielding the input encoded in latent-space. Latent-space can then be decoded to a probable future state. Q zjX ð Þis the encoder, z t is latent-space and P Xjz ð Þis the decoder. DVAE can also use LSTM to better learn longer sequences in continuous statespaces 8:  and T θ = P Xjz ð Þis the decoder. With state s 0 and action A right as input, the algorithm generates stateŝ 1 which in the table can be observed is similar to the real state s 1 . With the next input, A down , the DVAE algorithm generates the next stateŝ 2 which again can be observed to be equal to s 2 .
Note that this is without ever observing state s 1 . Hence, the DVAE algorithm needs to be initiated with a state, for example, s 0 , and actions follows. It then generates (dreams) next states.
The requirement is that the environment must be partially discovered so that the algorithm can learn to behave similarly to the target environment. To predict a trajectory of three timesteps, the algorithm does nesting to generate the whole sequence: To combat the divergence behaviour in continuous state-space, we extend the model to use generative adversarial networks. The most common cases of divergence are found where there are (a) complex state-transitions, (b) many state-transitions and (c) sparse state-transitions. In general, we find these properties in continuous and stochastic environments. By using an adversarial approach, we see that DVAE-GAN better generalize for such problems and is far more stable due to increased diversity compared to DVAE (See Table 2 for a detailed comparison). Figure 3 illustrates the proposed DVAE-GAN extension to the original DVAE architecture. In this model, two new components; the generator G and discriminator D is introduced. This is an adversarial approach previously proposed by Makhzani, Shlens, Jaitly, Goodfellow and Frey (2015). The generator G(n| S t , A t ) samples from a Gaussian distribution, while being conditioned on the current state and action, to predict the latent-space distribution z gan . The discriminator D(z vae , z gan ) is a neural network that tries to predict the validity of the input, in this case, if the latent-space variable is from the ground-truth distribution. A min-max game between the generator and the discriminator fuels learning where the generator minimizes its error towards the real latent-space and the discriminator learns to distinguish between the real and fake latent distribution. In the VAE model, the latent-space distribution is sampled from z vae = α t + (μ t × N(0, 1)) as proposed by Kingma and Welling (2013). The discriminator should then evaluate z vae to be genuine, and its parameters are updated according to the confidence of the prediction. To increase the stability of the VAE, we add the loss of the discriminator to the VAE loss, where we use the original loss function first proposed by Kingma and Welling (2013).

| Stochastic weight averaging approach (DVAE-SWAGAN)
We introduced DVAE-GAN which tries to combat divergence for complex environments. The GAN architecture increases the diversity of the latentspace, but model collapse and instability become a problem. Table 2 outlines a quick overview of the benefits of each of the introduced generative models. VAE is known for its stability but has limited capabilities in its latent-space capacity. The quality of VAE is good but suffers from a severe blurring of the decoded latent-space. Compared to other algorithms, the VAE architecture does not generalize well to a diverse data set. The generative adversarial networks have gotten increased attention due to their diverse latent-space. The images of GANs are known for its sharpness, but they suffer from artefacts in the produced output. These networks are state-of-the-art in the sense that they generalize well to the data set. DVAE-GAN as proposed by Makhzani et al. (2015) have few artefacts, which is an improvement from the regular GAN architecture. There is, however, high variance in the model which makes it unstable beyond what is seen in vanilla VAE and GAN. To improve the model instability, we introduce Stochastic weight averaging (SWA), first proposed by Izmailov, Podoprikhin, Garipov, Vetrov and Wilson (2018). DVAE-SWAGAN is a F I G U R E 3 The proposed DVAE-GAN architecture. While the original VAE architecture persist, as described in Algorithm 1, a new generative adversarial networks component is added for increased generalization across the latent-space combination of VAE, SWA and GAN to significantly reduce the variance of the predictions. The algorithm is in principle the averaged ensemble of DVAE-GAN along the trajectory of SGD. We use a cyclical learning rate (Smith 2015) and average the weights each training iteration creating the DVAE-SWAGAN model, seen in Figure 4.

| ENVIRONMENTS
The DVAE algorithm was tested in two environments. The first environment is the deep line wars that were introduced by Andersen et al. (2017), a simplified Real-Time Strategy game. We present deep maze, a flexible environment with a wide range of challenges suited for reinforcement learning research. The deep line wars environments feature a continuous environment where complex strategies must be planned. The deep maze environment provides simpler rules and a non-continuous action and state-space that in-comparison is far simpler then deep line wars.

| The deep maze environment
The deep maze is a flexible learning environment for controlled research in exploration, planning and memory for reinforcement learning algorithms. Maze solving is a well-known problem, and is used heavily throughout the RL literature Sutton et al. (1999), but is often limited to small F I G U R E 4 By using ensembles of DVAE-GAN models, the training is significantly more stable and prediction across sparse states yield better results for sequential input and fully observable scenarios. The deep maze environment extends the maze problem to over 540 unique scenarios including Partially Observable Markov Decision Processes (POMDP). Figure 5 illustrates a small subset of the available environments in the deep maze, ranging from smallscale MDP's to large-scale POMDP's. The deep maze further features custom game mechanics such as relocated exits and dynamically changing mazes. RL agents depend on sensory input to evaluate and predict the best action at the current timestep. Preprocessing of data is essential so that agents can extract features from the input data. For this reason, deep maze has built-in state representation for imaging and raw state matrices. The game engine is modularized and has an SDK that enables the development of third-party scenarios. This extends the capabilities of deep maze to support nearly all possible scenario combination in the realm of maze solving. 1 The deep maze learning environment presents the following scenarios. (a) Normal, (b) POMDP, (c) Limited POMDP and (d) Timed Limited POMDP. The first mode exposes a seed-based randomly generated maze where the state-space an MDP. The second mode narrows the state-space observation to a configurable area around the player. In addition to radius-based vision, the POMDP mode also features raytracing vision that better mimic the sight of a physical agent. The third and fourth mode is intended for memory research where the agent must find the goal in a limited number of time-steps. In addition to this, the agent is presented with the solution but fades after a few initial time steps. The objective is for the agent to remember the solution to find the goal. All scenario setups have a variable map-size ranging between 2 × 2 and 56 × 56 tiles.
F I G U R E 7 The training loss for DVAE in the 2 × 2 No-Wall and 8 × 8 deep maze scenario. The experiment is run for a total of 1000 (5000 for 8 × 8) epochs. The algorithm only trains on 50% of the state-space to the model for the 2 × 2 environment while the whole state-space is trainable in the 8 × 8 environment F I G U R E 8 For the 2 × 2 scenario, only 50% of the environment is explored, leaving artefacts on states where the model is uncertain of the transition function. In more extensive examples, the player disappears, teleports or gets stuck in unexplored areas F I G U R E 6 The graphical user interface of the deep line wars environment

| The deep line wars environment
The deep line wars environment was originally introduced in Andersen et al. (2017). Deep line wars is a real-time strategy environment that makes an extensive state-space reduction to enable swift research in reinforcement learning for RTS games.
The game objective of deep line wars is to invade the enemy player with mercenary units until all health points are depleted, see Figure 6.
For every friendly unit that enters the far edge of the enemy base, the enemy health pool is reduced by one. When a player purchases a mercenary unit, it spawns at a random location inside the edge area of the buyers base. Mercenary units automatically move towards the enemy base.
To protect the base, players can construct towers that shoot projectiles at the opponent's mercenaries. When a mercenary dies, a fair percentage of its gold value is awarded to the opponent. When a player sends a unit, the income is increased by a portion of the units gold value. As a part of the income system, players gain gold at fixed intervals.

| EXPERIMENTS
We centre our experiments around the DVAE, DVAE-GAN, DVAE-SWA and DVAE-SWAGAN. The goal of the proposed extensions is to improve the model stability so that the DVAE algorithm can produce better quality output for continuous and sparse state-spaces.
In this section, we show results of model-based reinforcement learning using DVAE in the deep-maze and deep-line wars environment.
We show the performance of encoding raw pixel input to a compact representation and to decode this representation to probable future states.
F I G U R E 9 Results of 8 × 8 Deep Maze modelling using the DVAE algorithm. To simplify the environment, no reward signal is received per iteration. The left caption describes current state, s t , while the right caption is the action performed to compute, s t + 1 = T s t , a t ð Þ T A B L E 3 Results of the deep maze 11 × 11 and 21 × 21 environment, comparing DQN Mnih et al. (2015), TRPO Schulman et al. (2015) and PPO Schulman et al. (2017)  F I G U R E 1 0 A typical deep maze of size 11 × 11. The lower-right square indicates the goal state, the dotted-line is a retrace of the predicted optimal path for that maze, while the final square represents the player's current position in the state-space. The controller agent is DQN, TRPO and PPO (from left to right) F I G U R E 1 1 The DVAE algorithm applied to the deep line wars environment. Each epoch illustrates the quality of generated states in the game, where the left image is real state s and the right image is the generated stateŝ 6.1 | The dreaming variational autoencoder

| Deep maze
For the deep maze environment, the algorithm must be able to generalize over many similar states to model a large state-space. DVAE aims to learn the transition function T s t , a t ð Þ, bringing the state from s t to s t + 1 . We use the deep maze environment because it provides simple rules, with a controllable state-space complexity. Also, we can omit the importance of reward for some scenarios.
We trained the DVAE model on two No-Wall deep maze scenarios of size 2 × 2 and 8 × 8. For the encoder and decoder, we used the same convolution architecture as proposed by Pu et al. (2016) and trained for 5000 epochs 2 in the 8 × 8 scenario and 1000 epochs for 2 × 2, respectively. The reasoning behind different epoch lengths is because we expect simpler environments to converge faster. For the encoding of actions and states, we concatenate the flattened state-space and action-space, having a fully connected layer with ELU activation before calculating the latent-space. We used the Adam optimizer Kingma and Ba (2015) with a learning-rate of 3e-05 to update the parameters. To calculate the loss we used the same loss function as proposed by Kingma and Welling (2013). Figure 7 illustrates the loss during the training phase for the DVAE algorithm in the No-Wall Deep Maze scenario. In the 2 × 2 scenario, DVAE is trained on only 50% of the state space, which results in noticeable graphics artefacts in the prediction of future states, seen in Figure 8.
In the 8 × 8 environment, the algorithm is allowed to train on all possible states, and we observe in Figure 9 that there is significantly better image quality throughout the sampling process. The goal of this experiment is to observe the performance of RL-agents using the generated experience-replayD from Algorithm 1 in Deep Maze environments of size 11 × 11 and 21 × 21. In Table 3, we compare the performance of DQN Mnih et al. (2013), TRPO Schulman, Levine, Abbeel, Jordan and Moritz (2015) and PPO Schulman, Wolski, Dhariwal, Radford and Klimov (2017) using the DVAE generatedD to tune the parameters.
Because TRPO and PPO are on-policy algorithms, the generated states must be generated on-the-fly so that the algorithm remains on-policy. Figure 10 illustrates three maze variations of size 11 × 11, where the agent has learned the optimal path. We see that the best performing algorithm, PPO Schulman et al. (2017) beats DQN and TRPO using eitherD or D. The DQN-D agent did not converge in the 21 × 21 environment, but it is likely that value-based algorithms could struggle to map inaccurate states with graphical artefacts generated from the DVAE algorithm. These artefacts significantly increase the state-space significantly, but empirical data suggest that on-policy algorithms perform better on noisy state-spaces.

| Deep line wars
The DVAE algorithm works well in more complex environments, such as the deep line wars game environment Andersen et al. (2017). Here, we expand the DVAE algorithm with LSTM to improve the capability of generating time-bound data, such as animations seen in Figure 1. Figure 11 illustrates the state quality during training of DVAE in a total of 6000 epochs. Both players draw actions from a Gaussian distributed policy. The algorithm understands that the player units can be located in any tiles after only 50 epochs, and at 1000 epochs we observe the algorithm makes significantly better predictions of the probability of unit locations (i.e. some units show more densely in the output state). At the end of the training, the DVAE algorithm is to some degree capable of determining both towers, and unit locations at any given time-step during the game epoch.

| Extending the dreaming variational autoencoder
The goal of DVAE-SWA, DVAE-GAN and DVAE-SWAGAN is to perform better in continuous state-spaces, such as the deep line wars environment. The experiments were performed using a map size of 11 × 11 sampling actions from a PPO policy.
F I G U R E 1 3 The first row represents the ground truth future state, while the second row is the predicted future state. The DVAE-SWAGAN algorithm sampled states after training for 1000 epochs, see Figure 12. Notice that the quality is notably better compared to Figure 11. We found that states were generalized too much producing some inaccurate predictions. Despite this inaccuracy, the RL agents learned a policy capable of beating random agents. We believe this is because similar states often represent similar value functions Figure 12 shows the training loss of the algorithms DVAE, DVAE-SWA, DVAE-GAN and DVAE-SWAGAN for 1000 epochs (x-axis), where the y-axis describes the loss value. The new architectures perform significantly better than DVAE across all loss components of the architecture.
The consequence of lower loss is better image quality (autoencoder loss), and better transitions (variational loss). For these experiments, we tried to model the deep line wars environment using RGB input. Figure 13 illustrates the resulting images for each of the algorithms. Here, we see that DVAE-SWAGAN perform significantly best, in terms of quality and accuracy.

| CONCLUSION AND FUTURE STUDY
This article introduces The Dreaming Variational Autoencoder along with its extensions DVAE-SWA, DVAE-GAN and DVAE-SWAGAN as a neural network-based generative modelling architecture to enable exploration in environments with sparse reward.
The DVAE algorithm successfully generates authentic world models in non-continuous state-spaces where the dynamics of the environment is simple. It works well for small environments but is limited when the state-sequence become too large. The algorithm performs marginally better when using LSTM for sequence prediction, but is a significant performance drop due to the increased model complexity. For most environments, such as deep line wars and deep maze, it is sufficient to run DVAE using only fully connected nodes.
The DVAE-SWAGAN improves the original model significantly and enables the algorithm to imitate environment models with continuous state-space better. DVAE-SWAGAN performs better in all environments including deep line wars and most environments found in the GYM reinforcement learning environment.
There are, however, several fundamental issues that limit DVAE, and DVAE-SWAGAN from fully modelling environments. In some situations, exploration may be a costly act that makes it impossible to explore all parts of the environment in its entirety. The algorithms cannot accurately predict the outcome of unexplored areas of the state-space, making the prediction blurry or incorrect. To combat this, the model should be improved further to include some sense of logic, and understanding of the environment dynamics. In current state-ofthe-art, this frequently introduced as domain knowledge that is manually crafted by the programmer, but the hope is that future research will find a method for self-supervised domain knowledge modelling.
Reinforcement learning has many unresolved problems, and the hope is that the deep maze and the deep line wars learning environment can be a useful tool for future research. For future study, we plan to introduce an inverse reinforcement learning component to learn the reward functionR. We also plan to explore non-parametric variants. DVAE and environment modelling is an ongoing research question, and the goal is that reinforcement learning algorithms could utilize this form of dreaming to make the algorithm far more sample efficient.
Prof. Ole-Christoffer Granmo is director and founder of the Centre for Artificial Intelligence Research (CAIR) at the University of Agder, Norway. He obtained his master's degree in 1999 and the Ph.D. degree in 2004, both from the University of Oslo, Norway. Granmo develops theory and algorithms for systems that explore, experiment and learn in complex real-world environments. His research interests include artificial intelligence, machine learning, learning automata, bandit algorithms, deep reinforcement learning, Bayesian reasoning and computational linguistics. Within these areas of research, Dr. Granmo has written more than 125 refereed journal and conference publications. He is co-founder of the Norwegian Artificial Intelligence Consortium (NORA). Apart from his academic endeavours, Granmo is also co-founder of the company Anzyz Technologies AS.