Abstract

Reinforcement learning (RL) with sparse and deceptive rewards is a significant challenge because nonzero rewards are rarely obtained, and hence, the gradient calculated by the agent can be stochastic and without valid information. Recent work demonstrates that using memory buffers of previous experiences can lead to a more efficient learning process. However, existing methods usually require these experiences to be successful and may overly exploit them, which can cause the agent to adopt suboptimal behaviors. This study develops an approach that exploits diverse past trajectories for faster and more efficient online RL, even if these trajectories are suboptimal or not highly rewarded. The proposed algorithm merges a policy improvement step with an additional policy exploration step by using offline demonstration data. The main contribution of this study is that by regarding diverse past trajectories as guidance, instead of imitating them, our method directs its policy to follow and expand past trajectories, while still being able to learn without rewards and gradually approach optimality. Furthermore, a novel diversity measurement is introduced to maintain the diversity of the team and regulate exploration. The proposed algorithm is evaluated on a series of discrete and continuous control tasks with sparse and deceptive rewards. In comparison with the existing RL methods, the experimental results indicate that our proposed algorithm is significantly better than the baseline methods in terms of diverse exploration and avoiding local optima.

1. Introduction

In recent years, deep reinforcement learning has been demonstrated to effectively solve sequential decision-making problems in a great deal of application domains, such as computer and board games playing [1, 2], continuous control [35], and robot navigation [6]. Despite these success stories, reinforcement learning with sparse and deceptive rewards remains a challenging problem in the field of RL [79] because maintaining a good trade-off between exploration and exploitation becomes more intractable in tasks with sparse and deceptive rewards.

Optimizing with the sparse feedback requires agents to reproduce past good trajectories efficiently and avoid being struck in local optima. In tasks with large state spaces and sparse rewards, a desired positive reward can only be received after the agent continuously executes many appropriate actions, and hence, the agent can rarely collect highly rewarded trajectories. In the meantime, the gradient-based parameter update of modern deep RL algorithms might result in a catastrophic forgetting of past experiences because the gradient-based parameter update is incremental and slow, and it has a global impact on all parameters of the policy and the value function [10]. Therefore, the agent might suffer from severe performance degradation when the ideal trajectories with highest returns are rarely collected, incurring the unstable policy optimization process. Finally, agents can adopt suboptimal myopic behaviors and be struck in local optima due to overly exploiting past imperfect experiences and do not explore the state-action space systematically.

Such tasks with sparse and deceptive reward signals are common in real-world problems. Recently, many workhas studied how making use of the nonparametric memory of past experiences improves policy learning in RL. Prioritized experience replay [11] proposes prioritizing past experiences before learning the policy parameters from them. Self-imitation learning [9, 12, 13] builds a memory buffer to store past good trajectories and thus can rapidly learn the right strategies from these past experiences when faced with a similar situation. Memory-augmented policy optimization [14] leverages a memory buffer of prior highly rewarded trajectories to reduce the estimate variance of the policy gradient. Episodic reinforcement learning [15] uses past good experiences that are stored in an episodic memory buffer to supervise an agent and force the agent to learn good strategies. Model-free episodic control [16] and neural episodic control [17] use episodic memory modules to estimate the state-action values. Diverse trajectory-conditionedself-imitation learning [10] proposes learning a novel trajectory-conditioned policy that follows and expands diverse trajectories in the memory buffer.

These existing work uses nonparametric memories of past good experiences to rapidly latch onto successful strategies and improve the learning efficiency of policy and value function. However, we must note that the exploitation of the past good trajectories described in the abovementioned work might hurt the performance of the agent in tasks with sparse and deceptive reward functions. There are two main reasons that can cause the performance degradation of algorithms in tasks with sparse and deceptive rewards. First, the past self-generated trajectories stored in the memory buffer are imperfect. The trajectories in the memory buffer are not gold trajectories but highly rewarded trajectories collected by accident. Second, the RL agent usually limits its exploration to a small portion of the state-action space because of prior experience and network initialization [18]. In this way, the agent can easily generate trajectories leading to suboptimal goals. The exploitation of these successful suboptimal trajectories with limited directions might cause the agent to learn myopic behaviors. This will limit the agent’s exploration region and prevent the agent from discovering alternative strategies with higher returns.

In this study, we conduct a practical RL algorithm by regarding previous diverse trajectories as guidance in the sparse reward setting, even if these trajectories are suboptimal or not highly rewarded. Our critical insight is that we can utilize imperfect trajectories with or without sparse rewards to regulate the direction of policy optimization while preserving the diversity of agents by virtue of two steps. In the first policy improvement (PI) step, we develop a new method that exploits the self-generated guidance to enable the agent to reproduce diverse past trajectories efficiently, while encouraging agents to smoothly expand these trajectories and visit underexplored regions of the state-action space gradually. Specifically, our method guides agents to revisit the regions where past good trajectories are located by minimizing the distance of state representations of trajectories. Meanwhile, our method allows for flexibility in the action choices to help the agent choose different actions and visit novel states. In the second policy exploration (PE) step, we introduce a novel diversity measurement to drive the different agents on the team to reach diverse regions of the state space and maintain the diversity of an ensemble of agents. By designing this new diversity measurement, our algorithm does not have to maintain a set of autoencoders [19] and can prevent the agents from being stuck in local optima. Our main contributions are summarized as follows:(1)We develop a novel two-step RL framework that makes better use of diverse self-generated demonstrations to promote learning performance in tasks with sparse and deceptive rewards(2)To the best of our knowledge, this is the first study that regards self-generated imperfect demonstration data as guidance and shows the importance of exploiting these previous experiences to indirectly drive exploration(3)We illustrate that by regarding the agent’s self-generated demonstration trajectories as guidance, the agent can reproduce diverse past trajectories quickly and then smoothly move beyond to result in a more effective policy(4)A new diversity metric for the ensemble of agents has been proposed to achieve deep exploration and avoid being stuck in local optima(5)Our method achieves superior performance over other state-of-the-art RL algorithms on several challenging physical control benchmarks with sparse and deceptive rewards in terms of diverse exploration and improving learning efficiency

The rest of this article is organized as follows: Section 2 describes the progress of the related work. Section 3 briefly describes the preliminary knowledge for the article. Section 4 introduces our proposed method for reinforcement learning with sparse and deceptive rewards. Experimental results are presented in Section 5. Finally, we draw our conclusions in Section 6.

2.1. Exploration and Exploitation

It is a long-standing and intractable challenge to balance exploration and exploitation in the field of RL. The exploration enables the agent to visit the underdiscovered state-action space and collect trajectories with higher returns. The exploitation, on the contrary, encourages the agent to make use of what it already knows to maximize the expected returns. There is lots of work which aims to improve the exploration ability of the RL agent. Some work proposes to add stochastic noise to the output actions [1, 35, 20] or parameters of policy and value networks [2123] to encourage exploration. Many other work defines the concept of intrinsic reward to promote the agent to visit the underexplored state-action space [2426]. Furthermore, there are plenty of work which introduces a new optimization objective to change the gradient update direction of parameters [18, 2729]. Another straightforward idea to expand the agent’s exploratory area is to employ a team of agents to explore the environment collaboratively and share the collected experiences with each other [3032]. In all these methods, although the agent can access the underdiscovered area due to randomness and artificial incentive, it is still difficult for the agent to get rid of local optima in long horizon, sparse reward tasks because it is rare for the agent to collect trajectories with nonzero rewards in these hard-exploration environments. Our method maintains different memory buffers to store past good trajectories for each agent in the team, and an agent only shares past good trajectories with each other in order to calculate a diversity measurement.

2.2. Memory-Based RL

The existence of a memory buffer enables the agent to store and utilize past experiences to aid in online RL training. Many prior work proposed storing past good experiences in replay buffers with a prioritized replay mechanism to accelerate the training process [6, 11, 33]. Episodic reinforcement learning methods [1517] memorize past good episodic experiences by maintaining and updating a look-up table and act upon these good experiences in the decision-making process. Self-imitation learning (SIL) methods [9, 12] train the agent to imitate the highly rewarded trajectories with the SIL and GAIL objective, respectively. There are many other previous works where the agent learns a range of diverse exploratory policies based on episodic memory [10, 34]. Unlike these previous methods, our method encourages the agent to visit the part of the state space where the agent can obtain higher rewards by calculating the distance between the current trajectory and the past good trajectory.

2.3. Imitation Learning (IL)

The goal of imitation learning is to train a policy by imitating the demonstration data generated by human experts. Behavior clone (BC) is regarded as a simple IL approach where the unknown expert policy is estimated from demonstration data by supervised learning. BC methods, however, usually suffer from the heavy distribution shift problem [35]. Inverse reinforcement learning (IRL) solves forward RL problems by recovering the reward function from demonstration data [36, 37]. Generative adversarial imitation learning (GAIL) [38] formulated the IL problem as a distribution matching problem, which can avoid estimating the reward function. All these IL methods rely on the availability of high-quality and sufficient human demonstrations. In contrast, our method treats the past good trajectories generated by the agent as demonstration data.

3. Preliminaries

In this study, we show that the MMD metric can be used as a distance constraint to prevent the agent from falling into local optima. Using the exterior penalty function method, we transform the constrained RL optimization problem into an unconstrained optimization problem, and the MMD distance can be regarded as a kind of intrinsic reward. Our method can be naturally combined with the hierarchical reinforcement learning algorithm, and the policy gradient can adjust pretrained skills and the high-level policy during the training phase.

3.1. Reinforcement Learning

We consider a discounted Markov decision process (MDP) defined by a tuple , in which is a continuous state space, is a (discrete or continuous) action space, is the transition probability distribution, is the reward function, in which we assume that the minimum and maximum value of the reward function is Rmin and Rmax, respectively. Further, is the distribution of the initial state , and is a discount factor. A stochastic policy , parametrized by , maps the state space to the set of probability distributions over the action space . The state-action value function is defined as .

In general, the objective of RL algorithms is to find an optimal policy that maximizes the expected discounted return:

We use to denote the entire history of the state and action pairs, where , , and .

Similar to [39], when , we define the stationary state-visitation distribution for the policy by , where the initial state . The expected discounted return can be rewritten as , where is the state-action visitation distribution.

3.2. Policy Gradient Algorithms

The study of this paper is based on policy gradient RL algorithms, which is one of two classes of RL algorithms. Policy gradient algorithms use gradients to iteratively optimize policy parameters of the agent. Here, we give a brief introduction of a well-known policy gradient algorithm: proximal policy optimization (PPO).

3.2.1. PPO

PPO [20] uses a clipped objective function to constrain the step size during an update and prevent drastic parameter changes. This leads to a stable training process compared with other algorithms. The PPO agent adjusts the parameters of the policy by maximizing the following clipped objective function:where and represent the current policy and the old policy, respectively, and is the clipping ratio, which is a hyperparameter that is empirically determined.

3.3. Maximum Mean Discrepancy

Maximum mean discrepancy (MMD) can be used to measure the difference (or similarity) between two probability distributions [29, 4043]. Let and be two sets of samples, which are taken independently and identically from two distributions and . and are defined in a nonempty compact metric space . Then, we can define MMD aswhere xx, yy and is a class of functions on . If satisfies the condition , if and only if , then MMD is a metric in the space of probability distributions in measuring the discrepancy between any two distributions and [44].

What kind of function class makes the MMD a metric? According to literature [40], the space of bounded continuous functions on satisfies the condition, but it is intractable to compute the MMD with finite samples in such a huge function class. Fortunately, when is a reproducing kernel Hilbert space defined by kernel , it is enough to uniquely identify whether or not, and the MMD is tractable in the space:

4. Proposed Approach

In this section, we formulate a novel RL framework named Policy Optimization with Soft self-generated guidance and diverse Exploration (POSE). The proposed method utilizes a team of agents to explore the environment simultaneously and encourages them to visit nonoverlapping areas of state spaces. Every agent in the team maintains a memory buffer storing past good trajectories, and these offline data can be regarded as guidance to enable the agents to revisit diverse regions in the state space where the agents can receive high rewards and drive deep exploration.

4.1. Overview of POSE

One feasible approach to achieve better exploration in challenging tasks with sparse and deceptive rewards is to simultaneously employ a team of agents and enforce them to explore different parts of the state-action space. In this way, diverse policies can be learned by different agents in the team, which prevent agents from being stuck in local optima.

As shown in Figure 1, our POSE method employs a team of agents to interact with the environment and generate many state-action sequences. Different from the multiagent RL setting [46], where all agents live in a shared environment and the action of an agent can affect other agents’ states and action choices, in our design, each agent exists in an independent copy of the same environment and has no interaction with other agents in the team when sampling data. In each training iteration, every agent of the team collects an on-policy training batch containing trajectories in the environment. Meanwhile, we also maintain a replay buffer for each agent to store specific trajectories generated in previous rollouts. Furthermore, POSE is introduced in Algorithm 1 of Appendix A in detail. Next, we will explain how to organize the trajectory buffer.

4.1.1. Organizing Trajectory Replay Memory for Exploration and Exploitation

We maintain a trajectory replay memory for the -th agent of the team. The number of trajectories in is no more than , and hence, in is one of the top- trajectories ending with the state embedding . is the number of steps of the trajectory . Different from the SIL method [9], in which only the successful trajectory with a return above a certain threshold is eligible to be added to these memories, the imperfect trajectories that are not highly rewarded or even do not reach any goal can also be considered as offline demonstrations in our method. For example, the trajectory is on the path to some goal, although it does not reach the goal. Furthermore, every replay memory only stores trajectories with similar state embeddings, which correspond to the same goal or state region. If the embedding of a new trajectory is similar with the trajectories in the -th agent’s memory and this trajectory is better than (i.e., higher return, shorter trajectory length, or shorter distance to the goal) the worst trajectory of this memory, we replace this worst trajectory by this new entry .

4.1.2. Guiding Agent to Reproduce Trajectory to State of Interest

To achieve better performance than existing RL methods in the sparse and episodic reward setting, we introduce a novel method that is beneficial to improve the data efficiency of RL algorithms and help the agent to reproduce previous trajectories in the memory buffer efficiently. We introduce a new distance measure that calculates the difference between different trajectories and then develop an RL optimization problem with distance constraints by regrading diverse past demonstrations as guidance. Intuitively, POSE can be viewed as a simple method to encourage the agent to revisit the parts of state space where the past trajectories are located. We will introduce how to train the policy in detail in Section 4.2.

4.1.3. Improving Exploration by Generating Diverse Trajectories

Compared with typical distributed RL methods such as A3C [30] and IMPALA [32], our POSE method not only simply employs multiple agents to collect amounts of trajectories independently by interacting with the parallel environments but also uses a diverse exploration mechanism to ensure exploration efficiency. We propose a new diversity metric to drive the different agents on the team to reach diverse regions in the state space and maintain the diversity of an ensemble of agents. In our framework, an agent of the team is impelled to pay more attention to visiting the state underexplored by other agents. Consequently, this helps the agents in the team to explore the environment systematically and avoid being stuck in local optima.

4.2. Policy Improvement with Soft Self-Generated Guidance

We assign a trajectory to a domain-dependent behavior characterization function to describe its behavior. For example, in MuJoCo Maze tasks described as the benchmark [47], can be as simple as a two-dimensional vector sequence, and each component of the sequence represents the agent’s location in every timestep:

The behavior characterization suggests that states or position information, not actions, are used to distinguish different trajectories, and a similar approach is adopted by [48, 49]. If needed, other behavior characterization functions can also be defined to adjust the focus of the distance measurement based on different aspects such as state visit, action choices, or both.

A particular distance between the current trajectory and the replay memory can be computed as follows:

Here, extracts the coordinate information corresponding to each state-action pair of the trajectory. Meanwhile, and are viewed as two deterministic policies in equation (6), and and are the state-action distributions that are induced by deterministic policies and , respectively. Furthermore, the distance between the current trajectory and the trajectory replay memory is defined as follows:i.e., is the minimum value of distances between and each trajectory in the replay memory .

Existing methods maintaining a memory buffer may overly exploit those good experience data, but these trajectories collected during the training process can be imperfect. The excessive exploitation of the imperfect demonstrations might lead to myopic behaviors and hurt performance in some cases. We choose to update the parameters of the agent by regarding previous good trajectories as guidance rather than directly imitating these imperfect trajectory data. By imposing a distance constrains in the trajectory space, each agent of the team is encouraged to revisit the region where past good trajectories are located. In this way, the agent not only exploits what it already knows to maximize reward but also reduces the overuse of previous good trajectories. Furthermore, our method allows for flexibility in the action choices and enables the agent to smoothly move beyond to find near-optimal policies. This intuition is based on the following observation:

Assumption 1. For any given bounded tolerance factor , there always exists trajectories with higher returns than the demonstration trajectory, and they stay in a region of radius d around the demenstration trajectory, even when the demonstration trajectory is imperfect and generated by the agent interacting with the environment.Based on this distance in the trajectory space, we define a new RL optimization problem with constraints for the -th agent as follows:Here, is a constant, and is the replay memory of the -th agent. contains the trajectory data collected by the -th agent at the current epoch, and represents a trajectory in the buffer .
From the perspective of policy optimization, it indicates that using constraints would better fit our problem settings for two reasons: (1)Convergence. The constraint could affect the policy update when there are trajectories that do not satisfy the constraints. In this way, it directs the agent to generate trajectories that stay in the constraint domain defined by the distance on the trajectory space. According to Assumption 1, the more frequently the agent visits the state space around demonstration trajectories, the more likely the agent is to produce trajectories with higher returns.(2)Optimality. The agent’s replay buffer containing previous specific trajectories is maintained by a dynamic update mechanism. Once the agent collects better trajectories with higher returns, shorter trajectory length, or shorter distance to the goal, the worst trajectories in the buffer will be replaced. Consequently, compared with the self-imitation learning methods, our method can leverage the imperfect demonstration trajectories for guiding the policy while eliminating their side effects in optimization, thus working better with imperfect demonstration trajectories.

4.2.1. Optimization Process of Soft Self-Imitation RL Objective

This section mainly describes how to efficiently optimize the RL objective with the distance constraint on the trajectory space. For simplification, we have omitted the superscript for some symbols, and these symbols have the same meaning as above unless otherwise specified.

First, using the Lagrange multiplier method, the optimization problem (9) can be converted into an unconstrained form:where is a Lagrange multiplier, which is used to determine the effect of the constraint, and , if , else . Then, the gradient with respect to the policy parameters of the objective (8) is given bywhere is a distribution induced by over the trajectory space, and represents the probability of the trajectory . can be expressed in terms of the environment dynamics model and the policy of the agent, i.e., . Therefore, the gradient of the score function of the trajectory distribution has the expression , and it does not contain the environment dynamic model.

4.3. Policy Exploration

As described in Section 4.2, if an agent has collected specific trajectories when interacting with the environment and stores them in the trajectory buffer, our method will regard these trajectories as guidance in the policy improvement step and direct the agent to revisit the regions where past good trajectories are located gradually. However, it might cause the agent to get stuck in local optima. To achieve better exploration performance in challenging tasks with sparse and deceptive rewards, we employ a group of heterogeneous agents to interact with the environment simultaneously. We hope to enable different agents on the team to reach diverse regions of the state space and drive deep exploration.

To maintain the diversity of the team, we first introduce a novel measurement of diversity. Considering the continuous control tasks that we focus on, we use the mean MMD distance of the different agents in the team as the diversity measurement. The mean MMD distance is computed with current trajectories collected by the team and their mean trajectories. The mean trajectory is generated by performing the mean of Gaussian action distribution of the agent at each timestep. Specifically, let denote the mean trajectories of agent , and let denote the ensemble of agents we employ, and let be the set of mean trajectories of all agents. The diversity measurement of an agent is calculated aswhere is defined as

For discrete control tasks, we utilize the current optimal trajectories to compute the diversity measurement of agent team , and the optimal trajectory are produced by the agent through performingthe action with the highest valueat each timestep. Finally, the objective function of policy exploration is given as

Intuitively, the policy exploration step prevents the team of agent from being stuck in the same local optimum by driving different agents to reach diverse regions of the state space. This is achieved by finding an ensemble of policies that maximizes the diversity measurement defined in equation (12). In the meantime, we require that every new policy after the policy update lies inside the trust region around the old policy , which is defined as , and hence it can avoid suffering severe performance degradation.

4.3.1. Optimization Process for the Diversity RL Objective

To solve the constrained optimization problem, we propose to approximately solve it linearizing around current policies . The gradient of diversity measurement in equation (14) with respect to the policy parameters can be easily calculated with the mean trajectories and current rollouts. Denoting the gradient of diversity measurement as and the Hessian matrix of the KL-divergence for agent as , the linear approximation to equation (14) is as follows:

Then, we adopt the conjugate gradient method [4] to approximately compute the inverse of and the gradient direction. However, the trust-region optimization can incur slow update of parameters and thus reduce the sample efficiency and improve the computational cost. In the beginning phase of the training process, we can use a first-order optimization method like [20] to solve the optimization problem in equation (14).

5. Evaluation of Results

In this section, we present our experimental results and compare our method’s performance with other baseline methods. In Section 5.1, we present an overview of experiment setups we used to evaluate our methods. Sections 5.25.3 report the results in different experiment environments, respectively.

5.1. Experimental Setup
5.1.1. Environments

(1) Grid World. To illustrate our method’s effectiveness, we design a huge 2D grid world based on Gym [50] with two different settings: sparse rewards and deceptive rewards. At each timestep, the agent observes its coordinates relative to the starting point and chooses from four possible actions: move east, move south, move west, and move north. At the start of each episode, the agent starts from the bottom-left corner of the map, and an episode terminates immediately once a reward is collected. In the sparse reward setting, shown in Figure 2(a), there is only a single goal located at the top-right corner, and the agent will be rewarded with 6 when reaching this goal. On the other hand, in the deceptive reward setting depicted in Figure 2(b), there are a misleading goal with a reward of 2,in the upper left room of the grid world that can lead to local optima.

(2) MuJoCo. For continuous control tasks, we also conduct different experiments in environments based on the MuJoCo physical engine [51]. Two continuous robotic control tasks, ant and swimmer mazes, are selected to evaluate the performance of our methodology and baseline methods. In each task, the agent takes a vector of physical state containing the agent's joint angles and task-specific attributes as the input of the policy. Examples of task-specific attributes include goals walls and sensor readings. Then, the control policy generates a vector of action values which the agent performs in the environment. We compare our method to previous algorithms in the following tasks:(i)Swimmer maze: the swimmer is rewarded for reaching the goal positions in the maze shown in Figure 3(a). The agent can obtain suboptimal rewards (+200) by arriving at the leftmost goal, and the agent is rewarded by 500 when it reaches the optimal goal in the rightmost of the maze of maze.(ii)Ant maze: the ant is rewarded for arriving at the specified positions in the maze, as shown in Figure 3(b). The ant can collect the small rewards when reaching the goal below the maze and maximize the reward if it reaches the goal up the maze.

5.1.2. Baseline Methods

The baseline methods used for performance comparison vary in different tasks. For discrete and continuous control tasks, we compare our algorithm with the following baseline methods: (1) PPO [20], (2) SAC [45], (3) DPPO: distributed PPO [52], (4) Div + A2C: A2C with a distance measure regularization [27], (5) PPO + SIL: PPO with self-imitation learning [9], and (6) PPO + EXP: PPO with count-based exploration bonus augmented reward function: , where is the number of times the state cluster under the state representation is visited during training, and is the hyperparameter that controls the weight of the exploration bonus term. For each baseline, we adopt the parameters that produce the best performance during the parameter search, and not all baselines are adopted in each task.

5.2. Performance in Huge Grid World

In this experiment, we evaluate different methods in 2D mazes with different reward settings: sparse rewards and deceptive rewards. We consider five baseline methods in these experiments: A2C, PPO, PPO + SIL, PPO + EXP, and Div-A2C. The performances of each method are reported in terms of average return and success rate learning curves. All learning curves are averaged over 8 runs.

5.2.1. Sparse Reward

Figure 2(a) shows the sparse reward grid world. In this maze environment with the discrete state-action space, we implement our method based on the PPO [20] algorithm. In this experiment, we select A2C, vanilla PPO, PPO + SIL, Div-A2C, as well as PPO + EXP as the baselines. For each method, we adopt the settings that produce the highest performance during the hyperparameter search. The learning curves are presented in Figure 4. Compared with other baseline methods, we notice that our method is able to learn the optimal policy at a higher learning rate and achieve better performance evaluations in terms of average return and success rate in this task. Our method can encourage the agent to visit the underexplored regions of the state space and make better use of past good trajectories maintained in the memory buffers. Therefore, our method can prompt the agent to focus on the state space, from which the agent can collect the sparse rewards with a higher probability. The PPO and A2C agents do not have any specially designed mechanisms for exploration; hence, they only arrive at the treasure and obtain the sparse rewards with a very low probability. These agents are not able to learn the optimal policy leading to the treasure due to the low percentage of valid samples. In the training process, the PPO + EXP agent can explore the environment better and occasionally collect treasure to achieve the best episode return. We also note that it is rare and difficult for the Div-A2C method to encounter the sparse rewards in one short burst. While the PPO + SIL agent can utilize past good trajectories, the success rates and average returns of this method are lower than those of our method at the end of the training process.

5.2.2. Deceptive Reward

Figure 2(b) illustrates the deceptive grid world. From Figure 5, we notice that the average rewards of the baseline methods have a slight increase compared with those in the sparse reward settings. These methods, except for PPO + EXP, can only achieve suboptimal rewards by collecting the apples, and hence, the agents are stuck with the suboptimal policy due to deceptive rewards. In contrast, our method not only adopts myopic and suboptimal behaviors but also treats the past good trajectories as guidance and allows the agent to reach diverse regions in the state space by improving upon past trajectories to generate good new trajectories. Although PPO + EXP can reach the treasure and obtain the highest rewards, its learning process has greater instability, leading to inferior performance.

As shown in Figure 6, we also plot the state-visitation counts for all methodsin the maze with deceptive rewards, which explicitly illustrate how different agents explore the 2D gridworld environments. From the state-visitation count graph, it can be seen that the four baseline approaches are either prone to falling into a local optimum or they cannot explore the environment sufficiently to visit the goal with larger reward. PPO + EXP can obtain optimal rewards from the treasure, but it requires considerable computation to visit the meaningless region of the state-action space due to intrinsic rewards. However, it can be observed from Figure 6 that our POSE method is able to escape from the area of deceptive rewards, explore a wider and farther region of the 2D grid world significantly, and arrive at the goal with the optimal reward successfully without spending too much computation visiting the insignificant region of the state space.

5.3. Performance Comparison in MuJoCo Environments

We evaluate our method on continuous control tasks shown in Figure 3 with sparse and deceptive rewards based on the MuJoCo physical engine and similarly plot the in-training median scores in Figures 7 and 8. We consider four baseline methods in these experiments: SAC, PPO, PPO + SIL, and DPPO. The performance of each method is reported in terms of average return and success rate learning curves. Similar to the experiments in the discrete maze, all learning curves are averaged over 5 runs.

Compared with baseline methods, it is observed that our method can learn considerably faster and obtains higher average returns and success rates. The average returns and success rates of POSE and other baseline methods in the swimmer maze are usually higher than those in the ant maze because the dimensions of the state and action space of the swimmer are lower than those of the ant, and the swimmer will not trip over due to incorrect action inputs. POSE achieves a success rate of almost 100% in less than 200 epochs.

While PPO and PPO + SIL often adopt myopic behaviors and converge to suboptimal policies, POSE is able to eliminate the local optimum and find better strategies to obtain larger episode returns. We also compare our algorithm with the state-of-the-art RL methods, SAC and DPPO. SAC is based on the maximum entropy RL framework, which trains a policy by maximizing the trade-off between expected return and entropy. As a result, SAC cannot achieve a significantly better performance of exploration than PPO in our experiments. It rarely encounters the optimal rewards received from the treasure and occasionally gathers trajectories leading to the treasure in the swimmer maze. Thus, this off-policy method might forget the past good experience and fail to learn the optimal policy to achieve the best reward. DPPO can learn the policy leading to the optimal reward, but this method has a lower learning rate and success rate. In contrast, our POSE method can also successfully generate trajectories that visit novel states and save the trajectories with high returns in the buffer. The POSE agent is able to replicate the past good experience by using the highly rewarded trajectories as guidance and learning the optimal policies with the highest learning rate and success rate.

6. Conclusion

In this study, we study the problem of how to design a practical RL algorithm for tasks where only sparse and deceptive feedback is provided. We propose a novel two-step policy optimization framework called POSE, which exploits diverse imperfect demonstrations for faster and more efficient online RL. By regarding diverse past trajectories as soft guidance, the agent can reproduce the trajectories easily and smoothly move beyond them to find near-optimal policies, which regulates the direction of policy improvement and accelerates the learning speed. Furthermore, a novel diversity measure is introduced to drive each agent of the team to visit the different underexplored regions of the state space and achieve deep exploration. Experimental results on physical control benchmarks demonstrate the effectiveness of our approach over other baseline methods in terms of the efficient exploration and avoidance of local minima in tasks with long horizons and sparse or deceptive rewards.

Appendix

Algorithm training process.Algorithm 1 describes our method in detail. At each iteration, the algorithm is executed according to the framework shown in Figure 1, and the experiences generated by each agent in the environment are stored in their respective on-policy training batches. Then, we use the on-policy trajectory data to update the soft self-imitation learning batches according to Section 4.1. Furthermore, we compute the advantage of the current policy and the distance between the current trajectory and the soft self-imitation replay buffer for each agent, and we utilize them to estimate the gradient of in equation (11). Finally, we update the parameters of with the gradient ascent algorithm and adapt the penalty factor according to the MMD distance.

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author upon request.

Input: number of agents , learning rate , and on-policy training buffer for each agent, highly rewarded trajectory buffer for each agent, sequence length , and number of epochs .
(1)Initialize policy weights of each agent.
(2)Initialize the prior good trajectory buffer .
(3)for to do
(4) Collect rollouts and store them in their own on-policy training for each agent.
(5) Update the soft self-imitation training batches for each agent.
(6) Compute advantage estimates , .
(7) Estimate the distance between current trajectories and highly rewarded trajectories in soft self-imitation replay buffer for each agent.
(8) Estimate the gradient in equation (11), .
(9) Perform policy improvement step: for each agent.
(10) Estimate and .
(11) Perform policy exploration step by update policy parameters according to equation (15).
(12)end for

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Authors’ Contributions

Guojian Wang: Conceptualization, Methodology, Software, Formal analysis, Writing–Original Draft. Faguo Wu: Writing–Review and Editing, Supervision. Xiao Zhang: Validation, Supervision, Funding acquisition. Jianxiang Liu: Formal analysis, Software, Visualization.

Acknowledgments

This work was supported by the National Key R&D Program of China (grant no. 2022ZD0116401).