1 Introduction

In recent years, Imitation Learning (IL) has been successfully applied in many domains, including robotics (Wu et al. 2018), natural language processing (Li et al. 2019), as well as autonomous driving (Bhattacharyya et al. 2018). IL tries to use limited expert knowledge (usually expert trajectories) to recover policy or even environments (Shi et al. 2019; Chen et al. 2020) by matching actions of expert data and the learning policies. Inverse Reinforcement Learning (IRL; Russell 1998) is one of the successful approaches for this setting, which tries to learn a cost function from expert demonstrations. Lately, the generative-discriminative framework (Ho and Ermon 2016) has been given research attentions (Song et al. 2018; Chen et al. 2020) in IRL, resulting in performance breakthroughs.

However, such a framework may still fail in many environments, mainly due to a poorly-learned RL generator. We conducted studies on such a phenomenon. We found that the imbalance between generator and discriminator frequently occurs in such a framework, especially in GAIL, and causes severe training problems. One of the key issues is that the supervised learning discriminator learns much faster than the RL generator. The discriminator is designed to tell the difference between expert data and generator data. A fast learner tends to depart such data at the very beginning of training, but at that time the generator is still in initial stages and cannot produce expert-like data. This may make the discriminator easily tell the difference between expert data and generated data, making it provide equally low rewards to the generator without prominent directions. In this case, it is common to finally have a well-trained discriminator but a poor generator that suffers from gradient vanishing. However, in GAIL, our goal is to have a well-trained generator, while the discriminator should act more like a teacher, guiding the generator to the ultimate convergence, by giving generator proper fake rewards that can keep the generator from gradient vanishing.

In this paper, we disclose that enhancing the gradient of generator training can be realized by increasing reward variance of the generator. A step further, we discover that the advantage function, in actor-critic RL algorithms, guarantees the update of critic network, so tuning the advantage function value variance is a more direct way to guide the RL agent. We prove that the variance of advantage function value mainly comes from the following two aspects: (1) The cumulative reward of generated state-action pairs; (2) The value function of states. Thus, the optimization can be staged into two steps: (1) Optimizing the value network for getting the optimized value function parameters. (2) Regularizing the discriminator network for getting the best variance of the advantage function value. Thus we presented our algorithm, GAIL with variance regularization (GAIL-VR), to do the optimization.

Experiments in various tasks are conducted. An average of 30% convergence speedup is recorded, as well as imitation score increment in at least 4 Atari games and 4 MuJoCo/Pybullet locomotion environments. Also, we find that GAIL-VR can help prevent reward drop after extensively long training episodes and perform better with data or parameter constraints, meaning that GAIL-VR is more adaptable in complex environments.

2 Background

2.1 Markov decision process and reinforcement learning

A Markov Decision Process (MDP; Sutton and Barto 1998) is often formalized by a tuple (\({\mathcal {S}}, {\mathcal {A}}, P, r, \rho , \gamma\)), where \({\mathcal {S}}\) is the state space, \({\mathcal {A}}\) the action space, \(P(s'|s,a): {\mathcal {S}} \times {\mathcal {A}} \times {\mathcal {S}} \rightarrow [0,1]\) the transition distribution, \(r(s,a):{\mathcal {S}} \times {\mathcal {A}} \rightarrow {\mathbb {R}}\) the reward function, \(\rho _0(s): {\mathcal {S}}\rightarrow [0,1]\) the initial state distribution, and \(\gamma \in [0,1)\) the discount factor. Policy \(\pi (a|s):{\mathcal {S}}\times {\mathcal {A}}\) is the distribution over action conditioned on the current state. The discounted cumulative return is \({\hat{R}}(s_0, a_0) = \sum _{t=0}^{T}\gamma ^tr(s_t,a_t)\), where \(a_t \sim \pi (a_t|s_t), s_{t+1} \sim P(s_{t+1}| s_t, a_t)\). \(Q^\pi (s, a) = {\mathbb {E}}[{\hat{R}}(s_0, a_0)|s_0=s,a_0=a]\) is the action value function that measures the expected return at the state-action pair (sa). \(V^\pi (s) = \mathop {{\mathbb {E}}}_{a\sim \pi (a|s)}[Q^\pi (s, a)]\) is the value function. The advantage function \(A^\pi (s,a) = Q^\pi (s,a)-V^\pi (s)\) denotes the advantage of action a at state s. We mainly focus on the actor-critic algorithms (Schulman et al. 2017, 2015), which consist of two components: actors to execute the policies and critics to score the policies.

2.2 Generative adversarial imitation learning (GAIL)

Imitation learning (IL) recovers the expert policy from the demonstrations and does not need any information of reward function. Current IL methods can be roughly divided into two categories: Behaviour Cloning (BC) and Inverse Reinforcement Learning (IRL) (Ng and Russell 2000; Abbeel and Ng 2004). BC tries to imitate the expert by maximizing the likelihood of the state-action pairs in the demonstrations. However, since the objective of BC does not consider the distribution the policy will generate in the environment, BC often suffers from covariate shift issues (Ross and Bagnell 2010; Ross et al. 2011).

In contrast to BC, IRL does not directly optimize the policy according to the expert data. In IRL, a reward function is first recovered. The policy is optimized with the reward function via reinforcement learning methods, instead of supervised learning in BC. Recently, some methods based on adversarial framework have been proposed and outperform conventional IRL methods in terms of data efficiency (Ho and Ermon 2016; Finn et al. 2016b, a; Fu et al. 2017). Among these methods, Generative Adversarial Imitation Learning (GAIL; Ho and Ermon 2016) adopted a discriminator network \(D(s,a): {\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathbb {R}}\) to provide the agent with rewards. The target of the discriminator is distinguishing whether a state-action pair comes from expert demonstration or is generated by the agent. The optimization objective of the discriminator D(sa) is formalized as a standard cross entropy loss:

$$\begin{aligned} \mathop {\max }_{D}~~\mathop {{\mathbb {E}}}_{s,a\in {\mathcal {D}}^\pi }\left[ \log D(s,a)\right] + \mathop {{\mathbb {E}}}_{s,a\in {\mathcal {D}}^E}\left[ \log (1 - D(s,a))\right] , \end{aligned}$$

where \({\mathcal {D}}^\pi\) and \({\mathcal {D}}^E\) denote the state-action pair set generated by agent and expert respectively. Reward of a state-action pair is measured by the similarity of it to the data in the expert data: \(r(s,a)=-\log (1-D(s,a))\). To minimize the discrepancy between the distribution generated by the agent and expert, the agent needs to maximize its cumulative reward, which can be reached via RL. Hence, the optimization objective of the agent can be written as:

$$\begin{aligned} \mathop {\max }_{\pi } ~~\mathop {{\mathbb {E}}}_{s,a\in {\mathcal {D}}^\pi }\left[ - \log (1-D(s,a))\right] . \end{aligned}$$

3 Related work

The theoretical convergence of GAIL has been proved. Chen et al. (2020) revealed that the convergence of the minimax optimization in GAIL can be proved using stochastic first-order optimization algorithms. However, since there is a considerable gap between the theoretical convergence and the actual performance of GAIL, there are some modifications focusing on sample efficiency and convergence in various environments. Kostrikov et al. (2019) addressed the reward bias provided by discriminator in GAIL, and pointed out that an unbiased reward can make the algorithm get better performance in environments that have short rounds and need quick actions. However, such improvement does not consider the long-term convergence or stability, and can work in only a few environments. Peng et al. (2019) focused on the information given to generators and discriminators. They borrowed the idea of the information bottleneck (Slonim and Tishby 2000) to encode data into latent space and constrain the information given to generator and discriminator. But this work constrains the problem to transfer learning and does not have a universal application potential on all IRL problems. Baram et al. (2017) tried to address the fragile convergence of GAIL by training policies using the exact gradient of the discriminator, which is like the training fashion of GAN. They introduce a forward model and use Stochastic Value Gradient (Heess et al. 2015) as the base RL algorithm. Besides, Fu et al. (2017) and Geng et al. (2020) tried to recover the ground-truth reward under certain conditions. By recovering such reward function, the transferability of the learned reward can be improved.

To the best of our knowledge, there is no work directly considering the imbalance between the generator and the discriminator and the result of crucial gradient loss, which are key issues our work focuses on.

4 GAIL with variance regularization

In this section, we will focus on GAIL’s two main issues when using the generator-discriminator framework and try to depict the difference between GAIL and GAN, which will significantly influence the performance.

4.1 Learning speed imbalance between generator and discriminator

In GAIL, the generator interacts with the discriminator in an adversarial manner. The discriminator tries to give low rewards to the generator by solving a classification problem. Meanwhile, the generator aims at obtaining higher rewards via reinforcement learning, e.g. policy gradient. One significant difference between GAN and GAIL is that, in GAN the generator and the discriminator are solving classification problems by supervised learning, but in GAIL while the discriminator keeps using supervised learning, the generator are trained by RL using rewards. However, the variance of the gradient estimation of reinforcement learning is much higher than supervised learning (Baram et al. 2017), and the partiality or delay of the reward signal as well as the sequential decision behavior of MDP in stochastic environments makes the exploration efficiency of reinforcement learning not satisfactory. As a result, the learning speed of the reinforcement learning will be slower than supervised learning. Thus, the learning speed of the discriminator will be faster than the generator, which is where the learning speed imbalance between the generator and the discriminator comes from.

Fig. 1
figure 1

A typical average given reward (Left) and reward standard error (Right) for each transition, compared between expert data and generated data

Such an imbalance will influence the convergence of GAIL. At the beginning of the training, since the discriminator is commonly initialized randomly, it cannot distinguish the expert data and the generated data at all. As the training proceeds, the discriminator will gradually possess the ability of separating the data, and provides the generated data with low rewards and the expert data with high rewards. The generator will update its policy to obtain higher rewards, and thus be more and more similar to the expert policy. However, with higher learning speed, the discriminator will be able to entirely depart the generated data and the expert data. As a result, the discriminator can figure out every state-action pair generated by the generator and give a low reward to it, rewards the generator gets will be low and have little variation. Figure 1 is a typical illustration of this problem, showing that the reward obtained by the expert data and the reward obtained by the generator are not even in the same order of magnitude. Such low-mean and low-variance rewards will harm the training of the generator. To depict the harm, we take the loss function of TRPO (Schulman et al. 2015) as an example. The loss function of TRPO is \(J(\theta )={\mathbb {E}}_{s,a\sim \rho _\text {old}}\frac{\pi _\theta }{\pi _{\text {old}}}A^{\pi _\text {old}}\), where \(\pi _\text {old}\) denotes the sampling (behavior) policy, \(\rho _{\text {old}}\) is the distribution generated by \(\pi _\text {old}\). The gradient of \(J(\theta )\) w.r.t. \(\theta\) is proportional to \(A^{\pi _\text {old}}\), i.e., the advantage function. However, as the generator obtains low-mean and low-variance rewards from the discriminator, the return discrepance between different action at a state will be small. Because no matter what action the agent takes, the discriminator is able to classify it to fake data and gives a low reward to the agent. Thus, the advantage function, which measure the advantage of taking an action in a state, will be small. Meanwhile, the policy gradient, \(\nabla _\theta J(\theta )=\frac{\nabla _\theta \pi _\theta }{\pi _{\text {old}}}A^{\pi _{\text {old}}}\), which is proportional to the advantage function, will be small. Such a gradient size reduction will slow down the training of the generator, resulting that the discriminator will be stronger than the generator. As the training proceeding, the learning speed imbalance between them will be larger and larger. The learning of the generator will be slower and slower and be stuck eventually.

Fig. 2
figure 2

Left: Discriminator loss during GAIL training. Right: Average gradient of RL generator neurons

In Fig. 2, we present two curves concerning the discriminator loss and corresponding policy gradient size during GAIL training. At the beginning of the training, due to the low-variance gradient estimation of supervised learning, the discriminator loss suddenly reduce to a small number. Meanwhile, the gradient size of the generator also reduce fast. Whereas, the generator still possesses certain learning ability, which increases the discriminator loss. However, the gradient size of the generator continuously reduces in the subsequent episodes. At around 280 episodes, the generator fails to compete with the discriminator. Finally, the discriminator loss and the gradient size reduce fast. The learning ability of the generator and the discriminator is completely out of balance. The generator almost stops learning, resulting the training of GAIL fails.

In the following subsection, we will introduce GAIL-VR to alleviate the such an imbalance essentially by increasing the advantage variance.

4.2 Solution: discriminator reward variance loss

RL algorithms rely on the difference between environment-offered rewards and the current value function, e.g., the advantage function in actor-critic algorithms. In most of the actor-critic algorithms, the empirical advantage function can be computed as:

$$\begin{aligned} A(s,a)=r(s,a) +\gamma V_\phi (s^{\prime })-V_\phi (s), \end{aligned}$$
(1)

where \(s^{\prime }\) is the next state, \(V_\phi\) is a neural network parameterized by \(\phi\). The target of the \(V_\phi\) is to approximate the value function of \(\pi\). The optimization objective of \(V_\phi\) is:

$$\begin{aligned} \min \limits _{\phi }~ [{\hat{R}}(s,a)-V_\phi (s)]^2, \end{aligned}$$
(2)

The value of Eq. (2) is related to the value network of the generator and the cumulative discounted rewards \({\hat{R}}(s,a)\), which is provided by the discriminator in GAIL. We try to eliminate the gradient vanishing phenomenon by optimizing Eq. (2) using both the generator and discriminator:

$$\begin{aligned} \max \limits _{w} \min \limits _{\phi } \mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }[{\hat{R}}_w(s,a)-V_\phi (s)]^2, \end{aligned}$$
(3)

where \({\hat{R}}_w(s,a)=-\sum _{t=l}^T\gamma ^{t-l}(\log (1-D_w(s_t,a_t)))\) denotes the cumulative discounted reward given by the discriminator parameterized by w. Compared with Eq. (2), a maximization operator over the parameter of discriminator is added. Such an operator will keep the variance of the advantage function from reaching a small value, and thus keep a part of advantage from falling to low value.

To show the connection between the maximization operator and the maximization of the advantage variance, we decompose the loss of the value function to the summation of the variance of the advantage function and estimation bias of \(V_\phi\):

$$\begin{aligned} \begin{aligned}&\mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }\left[ ({\hat{R}}(s,a)-V_\phi (s))^2\right] \\&\quad =\mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }(V^\pi (s)-V_\phi (s))^2 +\sum _{t=0}^{\infty }\gamma ^{2t}\frac{\vert {\mathcal {D}}^\pi _t \vert }{\vert {\mathcal {D}}^\pi \vert } \mathop {\text {Var}}_{s,a \in {\mathcal {D}}^\pi _t}\left[ A^\pi (s,a)\right] \\&\quad \ge \sum _{t=0}^{\infty }\gamma ^{2t}\frac{\vert {\mathcal {D}}^\pi _t \vert }{\vert {\mathcal {D}}^\pi \vert } \mathop {\text {Var}}_{s,a \in {\mathcal {D}}^\pi _t}\left[ A^\pi (s,a)\right] , \end{aligned} \end{aligned}$$
(4)

where Var represents the variance, \({\mathcal {D}}^\pi\) denotes the total data collected by \(\pi\), \({\mathcal {D}}_t^\pi\) denotes the data collected starting at time step t (e.g., \(|{\mathcal {D}}_0^\pi |\) denotes the number of data collected from the initial states). The equality in Eq. (4) holds iff \(\forall s \in {\mathcal {D}}^\pi , V^\pi (s)=V_\phi (s)\). From Eq. (4), we can conclude that the loss of the value function consists of two parts: (i) the difference between the value function of current policy and the value network; (ii) the variance of the advantage function. Part (ii) is not related to the value network. Thus, the minimization operator w.r.t. the parameter of the value network does not influence the advantage variance at all. As a result, we can get the advantage variance in a such way: find a \(\phi ^\star\) to minimize the value loss and regard the minimum value loss as the advantage variance. Ideally, \(\phi ^\star\) can guarantee the discrepancy between \(V^\pi\) and \(V_\phi\) to a small level. That is:

$$\begin{aligned} \begin{aligned}&\mathop {\min }_{\phi }~\mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }\left[ ({\hat{R}}(s,a)-V_{\phi }(s))^2\right] = \mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }\left[ ({\hat{R}}(s,a)-V_{\phi ^\star }(s))^2\right] \\&\quad =\mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }(V^\pi (s)-V_{\phi ^\star }(s))^2 +\sum _{t=0}^{\infty }\gamma ^{2t}\frac{\vert {\mathcal {D}}^\pi _t \vert }{\vert {\mathcal {D}}^\pi \vert } \mathop {\text {Var}}_{s,a \in {\mathcal {D}}^\pi _t}\left[ A^\pi (s,a)\right] \\&\quad \approx \sum _{t=0}^{\infty }\gamma ^{2t}\frac{\vert {\mathcal {D}}^\pi _t \vert }{\vert {\mathcal {D}}^\pi \vert } \mathop {\text {Var}}_{s,a \in {\mathcal {D}}^\pi _t}\left[ A^\pi (s,a)\right] . \end{aligned} \end{aligned}$$
(5)

Thus, we (approximately) obtain the advantage variance by seeking an (sub-)optimal value network. In order to keep the discriminator from providing rewards that leads to policy gradient vanishing, we optimize the discriminator to maximize the advantage variance:

$$\begin{aligned} \begin{aligned}&\mathop {\max }_{w}~\sum _{t=0}^{\infty }\gamma ^{2t}\frac{\vert {\mathcal {D}}^\pi _t \vert }{\vert {\mathcal {D}}^\pi \vert } \mathop {\text {Var}}_{s,a \in {\mathcal {D}}^\pi _t}\left[ A^\pi (s,a)\right] \\&\quad \approx \mathop {\max }_{w}~\mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }\left[ ({\hat{R}}(s,a)-V_{\phi ^\star }(s))^2\right] \\&\quad = \max \limits _{w}~\min \limits _{\phi }\mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }\left[ {\hat{R}}_w(s,a)-V_\phi (s)\right] ^2, \end{aligned} \end{aligned}$$
(6)

which is exactly Eq. (4).

In the original GAIL optimization objective of discriminator, there are no constraint about the generator’s virtual rewards. Such an optimization goal only focuses on telling the difference between agents and experts, with no attention to the inner reward distribution. We will try to solve this issue through the relationship between discriminator loss, reward, and RL advantage functions, using the results in Sect. 4.2.

4.3 Discriminator reward variance regularization algorithm

We define the difference between the real reward and the value function estimation as:

$$\begin{aligned} \begin{aligned} f(w,\phi )&= \mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }[{\hat{R}}_w(s,a)-V_\phi (s)]^2,\\ g(w)&=\arg \mathop {\min }_{\phi }~ f(w,\phi ) = \phi ^\star , \end{aligned} \end{aligned}$$
(7)

where f is the value loss, g(w) is the optimum parameter of the value function given a discriminator \(D_w\). Based on Eq. (7), we can get the partial derivative:

$$\begin{aligned} f_\phi (w, g(w))=\left[ \frac{\partial f(w,\phi )}{\partial \phi }\right] _{\phi =g(w)}=0. \end{aligned}$$
(8)

By examing the gradients of f(wg(w)), we can find that,

$$\begin{aligned} \nabla _w f(w,g(w)) = f_\phi (w, g(w))\nabla _wg(w) + f_w(w,g(w))=f_w(w,g(w)), \end{aligned}$$
(9)

where \(\nabla _w\) denotes the total derivative, \(f_w\) is the partial derivative of \(f(w,\phi )\) w.r.t. w: \(f_w(w,g(w))=\left[ \frac{\partial f(w,\phi )}{\partial w}\right] _{\phi =g(w)}\). Equation (9) tells us that, although \(\phi ^\star\) and w are highly related, we can ignore the gradient of \(\phi ^\star\) w.r.t. w when calculating \(\nabla _w f(w,g(w))\). This characteristic makes this regularization practical, because \(\nabla _w \phi ^\star\) is hard to be computed. We now have the way to optimize the generator and discriminator iteratively:

First, we train the RL agent to get the value network converged, thus getting \(\phi ^\star =g(w)\);

Then we can get the gradients of f(wg(w)) with respect to w: \(f_w(w,g(w))\);

By combining the Discriminator Reward Variance Loss with the GAIL loss, we can obtain our final regularized optimization objective:

$$\begin{aligned} \begin{aligned} \max \limits _{w} \mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^\pi }[\lambda ({\hat{R}}_w(s,a)-V_{\phi ^*}(s))^2 +\log (D_w(s,a))] \\+\mathop {{\mathbb {E}}}_{s,a \in {\mathcal {D}}^E}[\log (1-D_w(s,a))], \end{aligned} \end{aligned}$$
(10)

where \(\lambda\) is a regularization factor.

The training procedure of GAIL-VR is summarized in the Algorithm 1. Note that such an algorithm is widely applicable in almost all actor-critic RL algorithms with value functions, including A3C (Mnih et al. 2016), TRPO, as well as PPO (Schulman et al. 2017).

figure a

5 Empirical study

5.1 Experiment settings

We choose three kinds of mainstream Reinforcement Learning test environments: MuJoCo, Atari, and PyBullet.

We use four contenders in all the experiments. The first is GAIL, without any modifications or tricks. Then we implement Wasserstein Adversarial Imitation Learning (W-GAIL), GAIL with gradient penalty (GAIL-GP; Gulrajani et al. 2017), and Model-based GAIL (MGAIL; Baram et al. 2017) as contenders. Note that we do not take the results of the original GAIL paper, since the paper is testing GAIL on the MuJoCo-v0 environments, but these environments are outdated, and recent research mostly uses MuJoCo-v2 as test environments. However, the original paper code no longer supports v2 environments since it’s based on Python2. As a result, we instead collect our own PPO expert trajectories on v2 environments and implemented the GAIL on our own. We have tested our implementation of GAIL on v0 environments and got similar results to the original paper, which verifies our implementation. However, due to differences in expert trajectory data, environmental characteristics, convergence failure in some random seeds and other factors, the result on MuJoCo-v2 which is shown in the next section may be a bit different, and we promise that extensive hyper-parameter searches has been conducted and this is the best result within our reach.

The computing infrastructure and parameter settings are presented in Appendix B. The common hyper-parameters (e.g., network architecture, learning rate, seeds per environment, etc.) of each method are the same and without extra fine-tuning. However, for original GAIL, the best hyper-parameter exceeds the common search range of hyper-parameters for HalfCheetah-v2 and Walker2d-v2. In line with the principle of optimizing the control group’s results as much as possible, extra fin-tuning of GAIL results on these two environments are conducted. We thus provide the hyper-parameter search range and the final hyper-parameters in the appendix.

5.2 Results on algorithm performance

Figure 3 demonstrates the test results on MuJoCo environments. GAIL-VR obtains both better final performance (rated as an average reward) and training speed than counterparts in most of the environments. We listed four typical drawings with average reward in initial training processes (usually around 250–500 training steps in MuJoCo environments) and variance.

We have also tested rewards obtained in various games, compared with GAIL and W-GAIL. The comparison result is listed in Table 1.

The results show that GAIL-VR has a universally positive effect on most of the typical RL environments. We can see that it works both in low or high dimensional state/action spaces and compatible with either state data input or raw game image input. Such a performance improvement implies the excellent application potential of GAIL-VR. Besides, GAIL-VR only depends on an actor-critic RL algorithm and a generator-discriminator framework and does not require a specific environment. The problem GAIL-VR hopes to tackle is also universal in imitation learning with this framework. What’s more, the hyper-parameters used in the experiments are set to the parameter we commonly used, instead of fine-tuning for each environment. This also indicates that GAIL-VR possesses high stability w.r.t. the hyper-parameter, making GAIL-VR a parameter-tuning-free method.

Table 1 Rewards obtained in each game using GAIL-VR and GAIL without modifications. The first three rows show the state and action space and the number of expert trajectories used; the bottom five rows show the rewards obtained by GAIL-VR and its counterparts.
Fig. 3
figure 3

Reward curves of GAIL-VR and contender methods in MuJoCo environments. Shaded region indicates the standard deviation

Fig. 4
figure 4

Performances of GAIL-VR in 4 MuJoCo environments with comparison to other methods. Performance is evaluated by average return and scaled

5.3 Results on stability

Stability on less data GAIL-VR is found to perform well with little expert data. From the experiment result of ours and  Ho and Ermon (2016), GAIL and its variants perform well with abundant expert data. However, when the data is not sufficient, GAIL and its variants may not perform well. Figure 4 shows the test results in insufficient data. Our method outperforms most of the GAIL variants significantly.

The performance improvement of GAIL-VR may result from improved data efficiency by highlighting the importance of critical expert action data. For each expert trajectory, there will always be some state-action pairs that are critical to learning the whole policy. However, in previous works using the generator-discriminator framework, even if agents were able to take actions that are very similar to expert actions, some minor flaws may cause the discriminator to appear shortcut during training and recognize the action features of the agent, resulting in the inability to give a higher reward. But in our algorithm, we strengthen the identification of the key data regions (where the advantage function variance is larger, indicating that there is a more critical policy improvement) so that the discriminator can enhance the reward variance of these regions to avoid premature occurrence of the discriminator failure that gives all the results of low reward.

Stability of long training periods It is quite common that, after training for a long time, the achieved reward decreases from the highest performances. However, in long-training scenarios, our method outperforms GAIL in stopping such a decrease. Table 2 shows that GAIL-VR benefits from preventing reward drop after long training processes, and even recorded cases where reward is still rising during long-term training, such as Reacher-v2. Note that we compress the counterpart results into one column because there is no significant statistical difference between different algorithms in performance decline happen time and decline range. We believe that such a performance boost can be explained by the prevention of “low reward trap”, mentioned in Sect. 4. By keeping the mean value and maximizing the variance, the discriminator is forced to give higher rewards to state-action pairs closer to experts, keeping the gradient alive in the complete training process. In addition, the lowered reward will be more focused on the bad transitions, making the distribution of value function more ideal.

Table 2 Rewards ± standard deviation obtained through long training periods.
Fig. 5
figure 5

Statistical frequency histogram of rewards given by the generator. The data is collected in Mujoco Hopper environment at timestep 1700. More data is available in the supporting materials

5.4 Results on our motivations

Our algorithm shifts the distribution of rewards to a better one. Figure 5 shows the given rewards distribution between GAIL and GAIL-VR. Our algorithm makes the median reward much higher and strengthens the variance, proving that GAIL-VR really works under our assumptions.

Fig. 6
figure 6

Left: A typical discriminator loss trend comparison during training. Right: Average gradient comparison during training. The datasets and parameters are kept the same

Also, GAIL-VR successfully reversed the trend that Discriminator Loss (D-loss) declined and then stopped rising. Figure 6 (Left) shows a typical set of D-loss training data. Except for using different algorithms, the other parameters of the two groups of training are kept the same. From the figure, our algorithm successfully makes the D-loss rise, which shows that we have succeeded in preventing the disappearance of the gradient.

In addition, we have done test on the gradient of generator. Figure 6 (Right) shows a typical training on HalfCheetah environment under the same parameter setting. GAIL-VR can prevent quick gradient vanishing of the generator. Note that the figure is drawn in logarithmic coordinates, so the gradient enhancement is significant.

6 Conclusion

In this paper, we introduced GAIL-VR, a simple method to improve the convergence ability of GAIL, especially in complex environments. The principle behind GAIL-VR is to alter the role of discriminator from competitor to teacher by optimizing the discriminator to maximize the variance of the advantage functions. GAIL-VR can be applied to any actor-critic algorithm under the generator-discriminator framework. Empirical studies show that GAIL-VR can improve the convergence, boost performance as well as help stabilize the training. We also provide a new sight to the research of the generator-discriminator frameworks. Our study outlines that the balance between generator and discriminator can be easily broken, especially in complex environments, and provided an efficient way to tackle this issue.