Reinforcement Learning in Robotic Motion Planning by Combined Experience-based Planning and Self-Imitation Learning

High-quality and representative data is essential for both Imitation Learning (IL)- and Reinforcement Learning (RL)-based motion planning tasks. For real robots, it is challenging to collect enough qualified data either as demonstrations for IL or experiences for RL due to safety considerations in environments with obstacles. We target this challenge by proposing the self-imitation learning by planning plus (SILP+) algorithm, which efficiently embeds experience-based planning into the learning architecture to mitigate the data-collection problem. The planner generates demonstrations based on successfully visited states from the current RL policy, and the policy improves by learning from these demonstrations. In this way, we relieve the demand for human expert operators to collect demonstrations required by IL and improve the RL performance as well. Various experimental results show that SILP+ achieves better training efficiency higher and more stable success rate in complex motion planning tasks compared to several other methods. Extensive tests on physical robots illustrate the effectiveness of SILP+ in a physical setting.


Introduction
Motion planning is a fundamental module employed in many robotic platforms [1][2][3][4][5][6].For manipulators, sampling-based motion planning (SBMP) methods, including Rapidly-exploring Random Tree (RRT) [7], Probabilistic Roadmap (PRM) [8], are widely used in recent decades.These methods can be easily implemented on robots with high-dimensional degrees of freedom (DoF) as they approximate the collision-free space by sampling, instead of depending on the explicit geometry modeling of the collision and collision-free configuration spaces.In addition, they are probabilistically complete.However, the main disadvantage of sampling-based motion planning is the slow adaptability in dynamic environments.In traditional trajectory planning, the assumption is that all necessary task-space information is available at the start of planning and that the world does not change when movement starts.Driven by the growing demand for intelligent autonomous robots that react instantly in changing environments, there is an increasing interest in neural motion planners (NMPs) [9][10][11] which approximate the planners with neural networks, such as imitation learning (IL) and reinforcement learning (RL).Instead of relying on a precise pre-planned path, such systems can use a more general policy for dealing with, e.g., unexpected obstacles, in order to adapt motion control, online.
While the use of reinforcement learning is very attractive for training a robust robot-control policy for diverse and uncertain conditions, the training times can be extremely long and costly.A basic solution to the problem of lengthy exploration and sparse rewards is to use additional information, such as in imitation learning.In this study we will propose another, related approach, which uses existing planning algorithms to improve the effectiveness of RL-based NMPs.This can be seen as partially supervised training within a RL framework.Lack of high-quality data is the primary barrier for training NMPs in high-DoF manipulation tasks due to the requirement of encountering a massive number of diverse system states in neural-based IL and RL.Additionally, in IL, the main challenge is the lack of representative data concerning the boundary of obstacles when the task requires obstacleavoiding behaviors [9].In physical robots, exploring regions around obstacles to collect training data during RL is unsafe and impractical.The combination of IL and RL can potentially boost the performance due to improved pertinent exploration and it can accelerate the convergence in RL with the exploitation of ''expert'' knowledge.In other approaches, the imitation learning scheme requires a separate, heavy data-preparation process to bootstrap the approach.This can be realized using non-human examples from a simulated planner [12], via demonstrations by a human user [13][14][15] or by gathering data in parallel from multiple robots [16] in an ensemble.It would be advantageous https://doi.org/10.1016/j.robot.2023.104545Received 9 February 2022; Received in revised form 24 August 2023; Accepted 26 September 2023 if a separate data-collection stage is not necessary, instead employing a process that is based on collected experience alone during on-line training.
Considering the advantages and disadvantages of traditional and learning-based motion planning algorithms, we explore the integration of planning and learning in motion planning tasks to tackle the data collection problem.On the learning side, RL makes decisions on selecting the next move in order to reach the goal configuration.As we know, RL is a process that balances exploration and exploitation.However, most experiences in the early training stage are not being exploited efficiently when they are not involved in successful trials.Still, such experiences could facilitate the obstacle-avoiding behaviors with collision-free biases.Therefore, we regard these episode-explored states as candidate collision-free nodes for the graph-based motion planning algorithms.In this work, we use a PRM-based method for supportive, online demonstration generation during training.By planning on the candidate collision-free nodes, this method generates up-to-date demonstrations, episode by episode, and saves them in the demonstration experience replay buffer.Since the demonstrations are generated based on the experience from the RL policy, we regard the planning module as a form of experience-based planning.We exploit those demonstrations for imitation learning, which is similar to the idea of learning from good past experiences.We categorize this method as a self-imitation learning algorithm and call it self-imitation learning by planning.The selfimitation learning scheme guides the RL to learn from demonstrations automatically and continuously without requiring a human expert for laborious data collection.
In this paper, we propose a new algorithm, called SILP+, which is an enhanced version of SILP [17].Compared to SILP, SILP+: (1) Proposes a Gaussian-process-guided exploration method to reduce undesirable collisions near obstacles.The reduced number of collisions improves the safety during exploration process and reduces the training time by 18%.The importance and details are explained in Section 3.5, and experimental results can be check in Section 4.4.(2) Analyzes the extrapolation error in actor-critic neural networks and proposes a reward-based filter to stabilize the training process and helps increase the success rate.Motivations and differences are explained Section 3.4, and the experimental results can be check in Section 4.5.(3) Provides detailed and extensive analysis on different methods of dealing with collision failures during training.We discover that the RL agent learns better with positive feedback, but negative experience also helps.Details and discussions can refer to Section 4.3.
Besides, we tested SILP+ in a pick-and-place task with the physical UR5e platform.The experimental results verify the robustness and adaptability of SILP+ in noisy and uncertain environments.
Although the current study mainly focuses on motion-planning tasks for manipulators, the empirical findings and analysis are of interest to the general study of motion planning and reinforcement learning due to the following reasons: First, SILP+ presented here is an algorithm that combines planning and learning to improve the performance of motion planning.It represents a new way of utilizing the advantages of planning in a learning scheme, while avoiding a high computation load; Second, the empirical results, the analysis of the collision failures and the analysis of the extrapolation error in actor-critic networks provide helpful inspiration for designing high-performance RL architectures for motion-planning tasks.
The remainder of this paper is organized as follows.Section 2 introduces the background and related work about neural motion planners and learning from demonstration, followed by the proposed methodology SILP+ in Section 3. In Section 4, experiments are conducted in both simulations and on the physical robot arm UR5e to analyze the efficiency and feasibility of the proposed SILP+ in motion planning tasks.Section 5 draws conclusions and envisions future work.

Neural motion planning
Methods using neural networks to assist or approximate motion planners have attracted increasing attention because of recent advances in deep learning and the demand for adaptable motion planners in changing environments.Recent work includes biasing sampling to critic regions with machine learning techniques.For example, Zhang et al. [18] employed RL to learn the probability of rejecting a sampled state in order to bias the sampling to the less dangerous regions.Ichter et al. [19] predicted sampling nodes in critic regions on top of traditional SBMP methods using a conditional variational autoencoder.Francis et al. [20] proposed PRM-RL for long-range navigation, in which RL functions as a short-range local planner and also a collisionprediction module for the high-level PRM planner.A similar framework can be seen in [21] which uses a regular global planner (RRT or A*) and an RL policy optimization method as the local planner.Besides, there are pure neural motion planners that approximate the motion planners with neural networks, mapping states directly to paths or actions.Qureshi et al. [10] proposed MPNet as the neural motion planner, and the MPNet showed less planning time compared with traditional motion planners.Jurgenson et al. [9] adapted DDPG and learned collision and reward models for visual motion planning tasks, which improved the accuracy and planning time compared with SBMP methods.The combination of planning with RL has also been investigated.Benjamin et al. [22] proposed a hierarchical framework for long-horizon reaching tasks: planning at the high level to find the subgoals in the replay buffer and learning at the low level to control the robot to reach the subgoals.It demonstrated how efficient it was when planning was embedded with learning.However, they focused on 2D reaching tasks; no obstacles existed in the environment and they planned on the whole replay buffer, which may involve heavy computation in a more complex task.Differently, Xia et al. [23] used an SBMP planner at the low level, planning from the current state to subgoals and training the off-policy RL algorithm at the high level for subgoals generation.

Learning from demonstrations
Learning a task from scratch without prior knowledge is a daunting process; even human beings and animals rarely try to learn from scratch [24].They utilize previous experiences and demonstrations from instructors to derive strategies to approach a learning problem, which is called learning from demonstrations (LfD) or IL.LfD is widely used in learning-based robotic tasks, including helicopter maneuvering [25,26], mobile robot navigation [27,28], surgery [29,30], manipulation [31,32].However, there are also limitations in LfD caused by sparse or poor datasets.Firstly, the controlling error will accumulate in behavior cloning when the agents encounter unfamiliar and unseen states in the demonstrations.Secondly, the policy's performance depends heavily on the demonstrations' quality; the agent cannot perform better than the supervisor without additional information to help improve [33,34].
One of the solutions is the combination of RL and LfD, called reinforcement learning from demonstrations (RLfD) [35,36], which exploits the strengths of both sides and overcomes their shortcomings.The demonstrations are used to guide and improve RL policies, and then the RL provides feedback on the actions via the reward function and explores better policy than that of the supervisor [31,37,38] by exploration.Our SILP+ is similar to the work presented in [31], where demonstrations were stored in a demonstration replay buffer and embedded in an auxiliary behavior cloning loss to guide the learning.We modify the LfD framework by adding a planning module for demonstration generation, and these demonstrations are utilized for further self-imitation learning.Similar to SILP+, DAgger in [39] also adopts the idea of online supervision, which gives on-time evaluation feedback for the encountered states by relabeling actions.However, unlike DAgger, which retrieves expert guidance on every singe step, SILP+ plans on all states experienced within an episode, and therefore, it discovers the core steps in the episode through the global knowledge.

Self-Imitation Learning
The main idea of Self-Imitation Learning (SIL) [40] is to improve the sample efficiency in RL by utilizing good decisions in the past.However, the quality of the method depends heavily on the RL exploration strategies.It is difficult for the RL policy to obtain informative steps without direct supervision in complex robotic tasks.Besides, SIL was initially designed for the on-policy discrete settings, which is not straightforward to be used in off-policy, continuous action scenarios.
Recent work [41] targeted at the robotics applications with continuous action space and proposed ESIL that combined hindsight with SIL such that the agent learned from good experience selected by Hindsight Experience Replay (HER) [42].The major difference between ESIL and SILP+ is how they create good experiences.ESIL changed the goal based on HER to transform useless experience to positive feedback, while SILP+ uses rigid planners to convert ordinary trajectories into optimized successful paths.Therefore, SILP+ collects higher-quality experiences for SIL to learn.

Leveraging prior experience
SILP+ can also be categorized as a data augmentation method, manipulating previous experience for better training.In this category, Weber et al. [43] learned the policy with data from imaginations, in which part of the training data was aggregated by rolling out the policy.The widely known HER [42] is also a method of utilizing prior experience.It generates informative data by regarding the experienced next state as the goal.However, the amount of useless data increases as more successful experiences are gained in HER, and thus impairs the sample efficiency.Nevertheless, our method can select the most promising state as the next state, and those selected states in the episode could form a successful path to provide more informative data for RL training.

Methodology
The combination of RL and LfD has been used to tackle the problem of sample efficiency in RL.It is straightforward and efficient, but the preparation of expert demonstrations might not be easy in many situations.Therefore, a more practical approach is learning by utilizing the agent's informative experiences.In the context of RL, SIL is a technique that takes advantage of this idea and encourages the agent to learn from actions with higher rewards.However, this technique was proposed for discrete control tasks and faced the problem of lacking informative, positive experiences in high-dimensional continuous tasks, such as robotics motion control.
Our SILP+ is a combination of the concepts of SIL and experiencebased planning.Experience-based planning ensures qualified guidance by providing direct corrections on visited states using planning methods, while SIL guides the policy learning with good examples from the planning module.This section first formulates the motion planning task in an RL scheme, followed by the explanation of experience-based planning with PRM.Then, we introduce SIL in continuous control tasks and illustrate how planning is embedded with SIL to formulate SILP+.In addition, we propose a Gaussian-process-guided exploration method near collision regions to improve the exploration efficiency.Finally, a model-based reward filter is employed to reduce the extrapolation error in actor-critic RL.

RL for motion planning
We formulate our motion planning task as follows.Let  represent the world space.The set of obstacles in the world space is denoted by .The configuration space (i.e., the -space) is denoted by , in which a configuration of the robot is denoted by  ∈ .The forward kinematics   ∶  →  maps the robot configuration in  to the world space .If  () in  belongs to the obstacle set , then  is in the collision configuration space   .The collision-free configuration space then is defined by:    =  ⧵   .Given the collision-free starting configuration  0 ∈    and the goal configuration   ∈    , the motion planning task is to find a path that starts from  0 and ends at   , while avoiding the collision configuration space   .This task can be formulated as a Markov Decision Process (MDP), which can be solved in the reinforcement learning framework.
We embed our motion planning task in an episodic off-policy RL framework, in which the environment and task are randomly sampled within the workspace at each episode [44].The goal of the agent is to maximize the expected accumulated future returned reward ] from the current step  with a discounted factor  ∈ [0, 1] weighting the future importance.Each policy  has a corresponding action-value function   (, ) = E[  |  = ,   = ], representing the expected return under policy  after taking action  in state .Following policy ,   can be computed by the Bellman equation: where  represents the action space and  is the state distribution.
Let  * (, ) be the optimal action-value function.RL algorithms aim to find an optimal policy  * such that   * (, ) =  * (, ) for all states and actions.The learned policy should predict the action at every step, guiding the manipulator to reach the goal while avoiding collisions.The detailed formulation is described below.
• States: A feature vector is used to describe the continuous state, including the robot's proprioception, the obstacle and goal information in the environment.We restrict the orientation of the gripper as orthogonal and downward to the table, so three joints out of six in our UR5e platform are active in the learning process.
• Actions: Each action is denoted by a vector  ∈ ([−1, 1]) 3 , which represents the relative position change for the first three joints.The corresponding three joint angle changes are 0.125 rads.To ensure the action is always within the defined action space , the RL policy network uses tanh as the activation function which ensures that the output action is always within [−1,1] which is within A. Besides, when the action makes the state out of space, we do a random action selection.This way can also help the agent learn the boundaries of the working space.• Rewards: A success is reached if the Euclidean distance between the end-effector and the goal dis(ee, ) < err, where err controls the reach accuracy.Given the current state and the taken action, if the next state is not collision-free, then a severe punishment is given by a negative reward  = −10.If the next state results in a success, we encourage such a behavior by setting the reward  = 1.In other cases,  = −dis(ee, ) to penalize a long traveling distance.An episode is terminated when the predefined maximum steps or a success is reached.
Since the goal and obstacles are static in each episode, the state transition function   is defined by  .Given the state   and action   , the next state  +1 can be calculated by the function   under the position controller:  +1 =   (  ,   ).We make a natural assumption that the goal states are reachable.
Since the transition function is known, our off-policy reinforcement learning framework can also be regarded as model-based.However, unlike traditional model-based RL directly involving the model in policy optimization and decision-making, we mainly use the model to generate demonstrations to facilitate our ''model-free'' RL agent learning better from its own experience.Besides, most of the modelbased RL heavily depends on the model accuracy to avoid suboptimal performance, our method focuses more on the model-free exploration process.Thus, we have a lighter dependence on the model accuracy.The model we used is mainly for demonstration generation, and the self-imitation learning module helps the robot learn only from good experiences.

Experience-based planning with PRM
In motion planning, in order to respond to new requests efficiently, past solutions are stored in memory and can be retrieved and repaired for later usage to speed up the planning process.This strategy is called experience-based planning [5,45,46], where the experiences represent the previous solutions.However, in the context of RL, experiences refer to the past discrete decisions in the Markov Decision Process.It describes how the policy acts on a specific state and what the next state is under this action.Although most of these experiences are not part of the solutions, especially at the beginning of the exploration process, they still provide information to help understand the environment and the task.In this study, we generate demonstrations for SIL by planning on these experiences from the current policy.
Many graph-search-based planning approaches can be used in our SILP+ to plan paths; here, we propose a PRM-based path planning as the planner.PRM is a multi-query SBMP algorithm, which exploits the fact that in SBMP, it is cheap to check whether a single robot configuration is in free space or not.The roadmap contains nodes and edges, in which a node represents a specific location, and an edge corresponds to a path connecting two nodes.After the roadmap has been generated, planning queries can be answered by connecting the user-defined initial and goal configurations.
The basic PRM is designed for static path planning applications, and it can realize multi-query path planning by constructing a global roadmap.In our environment, the location of the obstacle is randomly selected within a defined workspace in each episode, making the basic PRM inefficient, as the roadmap needs to be constructed in every episode.Hence, we build a directed graph on the set of visited collisionfree states denoted by   in each episode.Each node in   corresponds to a state in MDP.These states are already collision-checked as they are selected from the RL experience; therefore, the nodes collision-checking process can be omitted and the efficiency of the algorithm has thus been improved.We denote the edge between nodes   and   as    →  .Edges which have lengths greater than  or intersect with obstacles are ignored.Finally, the graph  is formulated as follows: where  ∈   and  represent the nodes and edges in the graph, respectively, and   represents the edge collision space.Once we construct a network of nodes and edges in the robots' configuration space by PRM, we use A-star [47] algorithm to determine the shortest paths between the starting nodes and the end nodes.Here, A-star functions as a local planner to extract paths in each episode, in which the end effector's positions are used to calculate the heuristic function.The goal of the local planner is to select the next node that ends up with the least value in the heuristic function.The adapted PRM-based path planning algorithm is illustrated in Fig. 1 and explained as follows: 1. Collision-free nodes construction: Instead of randomly generating configurations in the workspace and selecting collision-free ones as the candidate nodes, we directly select the collision-free states from the experiences as candidate nodes and each node corresponds to a state in the MDP.The selected nodes are shown in Fig. 1(1).This step is also explained in Algorithm 1. 2. Start and goal configurations sampling: In the RL training process, the real goal probably cannot be reached, especially at the beginning of the training.Therefore, we randomly select  start and goal pairs from the filtered collision-free nodes   .Here, for the purpose of illustration, we use  = 1 and illustrate the start and goal nodes in Fig. 1(2 The cost of a candidate neighbor  is defined as:  () = ℎ()+ (), where functions ℎ and  are the cost from   to  and that from  to the goal, respectively.Both functions use Euclidean distance as the cost value.6. Select the next node: The node with the least cost will be selected as the next start node, as shown in Fig. 1(6) and be appended to the candidate path nodes   .The edge that connects the current node and the next node will be further processed when converting the path node   into demonstrations.Starting from the current node and repeating steps (3) to ( 6), the algorithm terminates when the final goal has been reached, the maximal running time has been used or no available neighbors can be found.

Online generation of demonstrations
In order to learn from demonstrations with RL, we need to convert the path node   = { 0 , … ,   } to MDP format for further imitation learning.Before proceeding, we make the following assumptions: (1) The forward kinematics function   (  ) is given.Based on   , the end effector's position is predictable using the robot's joint values.(2) The inverse model   is given so that the action   =   (  ,  +1 ) that controls the agent from one state to another state is accessible.Then, we illustrate the pseudo-code for online demonstrations generation in Algorithm 2, in which the objective is to convert the state-based nodes   to (, ,  ′ , , ) tuples and save them in a demonstration replay buffer   .The conversion iterates the successive nodes  and  ′ and calculates the action  using the inverse model   .Based on (, ,  ′ ), the reward  and the Boolean value  can be calculated.However, the predicted action may be out of the action space .In this case, we insert an extra node between the states with a half value of , and the new next state is calculated by   .This process is repeated until the action is within action space .We save the constructed demonstrations in   for policy learning.
The overall structure of SILP+ is depicted in Fig. 2. At the end of each training episode, we obtain the visited states, as shown in the left sub-figure.These experiences are stored in the interaction experience replay buffer.Based on the visited states, we plan paths using PRMbased planner, and the result is shown as the directed-dotted lines in the right sub-figure.Then, these paths are converted as demonstration tuples saved in the demonstration replay buffer for further imitation learning.

Self-imitation learning
Self-imitation learning is an RL method that encourages actions whose returns were higher than the expectation [40].It was proven to be able to improve the performance of actor-critic methods in several discrete control tasks.One of the challenges in SIL is the difficulty to perform the task if the exploration performs poorly.For instance, a random exploration never generates a good experience within a reasonable time.Therefore, the policy cannot benefit from imitating the good experience when there is no good experience.To this end, we propose to combine the SIL with experience-based planning, called SILP+, in the context of motion planning.The planning module provides the demonstrations based on the visited experience, while SIL pushes the policy update towards the demonstrations.
The method of embedding demonstration into SIL is adapted from [31], where demonstrations are stored in a separate replay buffer where   and   are the action and state sampled from buffer   .The   represents the learning parameters in the policy.The policy imitates the good choices from the demonstrations by adding the behavior cloning loss to the objective  , which is weighted with hyperparameters  1 and  2 .The updated objective   is shown in Eq. ( 4): In order to avoid learning from imperfect demonstrations, a    is employed in SILP [17] to prevent adding behavior cloning loss when the policy's action is better than the action from the demonstrations.However, experience replay buffer-based off-policy RL methods, such as DDPG and SAC, prone to the extrapolation error in Q function approximation.They struggle to learn when the data is different from the current policy's data distribution.For example, when updating the target Q values in DDPG and SAC, the next state  ′ and action  ′ from the target policy are involved.It risks to obtain unreliable Q value estimations as these state-action pairs are likely to be unfamiliar to the policy and not existed in the replay buffer.Fujimoto et al. [48] demonstrated that the actor-critic algorithms deteriorate when the data is uncorrelated, and the value estimation produced by the Qnetwork diverges.With these considerations, Q filter used in [31] could introduce the extrapolation error and lead to an incorrect filter for the action gap-guided RL training.
Since we use planning as the source of demonstrations, it is natural to utilize models to form a more reliable Q filter.To this end, we replace the Q filter with a predicted reward filter, in which instead of comparing the state-action values between the policy and demonstrations, we predict the rewards   based on the actions from the policy and demonstrations.The objective in ( 4) is changed to: where where and (  ,   ,  +1 ,   ) is one of the MDP tuple batches in   .For computing (  , (  )) in ( 6), we use (7) with   being replaced with   (  , (  )).
Correspondingly, next state  + should be predicted based on   .
While larger values of  can potentially yield better performance [49] by capturing longer-term dependencies, they often require more time during training due to the increased number of time steps to be considered.Given the complexity of our problem domain and the objectives of our study, we opted for the one-step return  = 1 to strike a balance between performance and training efficiency.However, it is straightforward to use  > 1.

Gaussian process-guided exploration
In SILP [17], when a collision happens during training, the agent would go one step back to the before-collision state and randomly select another action to continue.Under this strategy, we observed a scenario that the policy is likely to get stuck on selecting random collision-free actions near the obstacles until running out of the steps.This issue is due to the limitation of random sampling, which need to take an impractical amount of samples to cover the whole space.Therefore, the agent is easy to run out of the pre-defined steps before it encounters a valid action.To tackle this problem, we propose Gaussian-processguided exploration to improve the exploration quality near the collision regions.The main idea is to model the reward landscape based on the collected experiences when the collision happens and select the action with the most promising reward.
We use the Gaussian process regression to approximate the function   which maps from the experienced states to rewards.First, we assume   is distributed as a Gaussian process:   ∼ (0, (,  ′ )), with zero prior mean and the covariance function  ∶  ×  → R. The output () of the function   at input  can be written as () =   () + , with the Gaussian noise  ∼  (; 0,  2  ).After having collected  observations, denoted by   = {  ,   } = { 1 , … ,   ,  1 , … ,   }, the predictive distribution at location  is given by with the predictive mean ( , where the entries of the vector   () ∈ R  are [  ()]  = (  , ), the entries of the Gram matrix   ∈ R × are [  ] , = (  ,   ), and the entries of the vector of observations   ∈ R  are [  ]  =   .When there is a collision during training, we collect all of the states and rewards in the episode and fit the states and rewards as the input and output in   .Then we use the Monte Carlo planning to select the intended action: first,  random state configurations are sampled; then, the rewards are predicted for these configurations using the learned model   ; later, we scale the reward into a sampling probability within [0, 1] and make sure that the sum of those probabilities is 1; finally, we choose one of those configurations under the scaled probabilities as our desired next state and retrieve the corresponding action.The kernel in Gaussian process regression we used is: Matern 5/2 [51].We depicted an example of fitted reward distributions in Fig. 3. From the figure, we can see that the experienced collision regions have lower rewards expectation and there are small probabilities to choose the next action in the vicinity of these regions.

Experiments
We conducted experiments in both simulations and real robot settings to answer the following questions: (1) Under the same environment and task settings, does SILP+ perform better than other baseline algorithms in terms of success rate and sample efficiency?(2) Will SILP+, which additionally includes an online demonstration generation step increases computation burden and leads to a slower training process?(3) How do collisions affect the learning performance?(4) How do the Gaussian-guided exploration and extrapolation errors affect the performance?(5) Can the policy learned in simulation transfers well to a physical robot where noise and uncertainty exist?

Task and training setup
We used Gazebo with an ODE physics engine as the simulator for training policies, in which a 6 DoF robot arm UR5e is equipped with a Robotiq-2f-140 gripper to accomplish long horizon motion planning tasks.The workspace for the end-effector is restricted to  ∈ [0, 0.8] m,  ∈ [−0.3, 0.8] m,  ∈ [0, 0.6] m to simplify the tasks and avoid unnecessary collisions.A box with the width and height of 0.2 m and 0.3 m respectively is used as an obstacle in the task.The obstacle's position (the center of the mass) is limited to  ∈ [0.3, 0.7] m,  ∈ [0.1, 0.4] m,  = 0.15 m (see the purple region in Fig. 4(a)).The initial arm pose and the goal pose were restricted within the reachable workspace of the end-effector, as mentioned before.In order to balance the number of collision and non-collision interactions, we restrict the initial pose, goal pose and obstacle position to satisfy  2 >  1 >  3 , where  1 is the Euclidean distance between the initial end-effector's position and the obstacle's position,  2 is the Euclidean distance between the initial end-effector's position and the goal position, and  3 is the Euclidean distance between the obstacle's position and the goal position.This means that the robot needs to learn a generalized policy to reach the target from different directions while avoiding the obstacle.The relative positions are projected in 2D and depicted in Fig. 4(b).

Baseline comparison
We employed two state-of-the-art off-policy RL algorithms as our basic baselines: deep deterministic policy gradient (DDPG) [52] and soft actor-critic (SAC) [53,54] to evaluate SILP+.We call them DDPG-SILP+ and SAC-SILP+, respectively.The design of the neural networks and hyperparameters for DDPG and SAC, as well as the training details can be found in [17].The procedure of SILP+ is described in detail in Section 3.2, above.
We compared several existing and adapted methods with the proposed SILP+ approach.Performance indicators are (a) the success rate, i.e., the number of successful task attempts divided by the total number of attempts, and (b) training time for the following thirteen methods and method combinations: For HER, the number of imagined goals is four for each visited state.The DDPG-Demon and SAC-Demon methods impose demonstrations into the regular experience replay buffer.Similar to SILP+, those demonstrations come from online PRM planning.The difference between SILP+ and Demon is that Demon does not use a behavior cloning loss-based SIL to specifically learn from good experience.Instead, Demon is more of a data argumentation method.PRM-0.1 and PRM-1 were implemented in Moveit under the planning time limitation of 0.1 s and 1 s, respectively.We used 10 demonstrations to train the BC model.The demonstrations in BC-SAC-SILP+ were collected by the well-trained SAC-SILP+ policies, and the demonstrations in BC-PRM were collected by PRM through Moveit with planning time limited to 1 s.
For DDPG-and SAC-based methods, we set the training epoch to 1K.Here, the epoch represents certain iterations of updates in parameters.Each epoch contains ten episodes.The results are summarized in Table 1.The training time means the wall time for the defined 1K epochs, which is summarized from three trained policies with different seeds.The final success rates are measured under the three trained policies; each policy being tested 1K times.The planning time is the accumulated time for rolling out the learned policy, which is also summarized from the three trained policies, and each policy has been rolled out 1k times.Note that the PRM methods do not have training time as online planners do.The training time for BC-based methods comprises the data collection and policy training time.
From the success rate column, we observed that SAC-SILP+ achieved the highest success rate (0.973) and the lowest standard variance (0.002) compared with other methods.The success rates in the DDPG spectrum are lower than SAC spectrum methods, but SILP+ can boost DDPG's performance to the same level of SACs, as we can see from the data in DDPG-SILP+ and SAC-SILP+.In addition, traditional PRM methods performed worse than our SILP+ methods, although the increased planning time could slightly improve the performance.Not surprisingly, BC methods depend heavily on the quality of the expert demonstrations.The demonstrations' quality in SAC-SILP+ is better than in PRM's.The reason is that the planned paths in PRM can be partly beyond the workspace or suboptimal because of singularities.As a result, BC with SAC-SILP+ demonstrations and PRM demonstrations achieved success rates of 0.967 and 0.464 respectively under the same test configurations.
From the training time column, we found that SAC-Demon took the least training time, but SAC-SILP+ used a similar amount of training time while the standard deviation is the lowest.SAC-SILP+ and DDPG-SILP+ have achieved 28% and 8% less on training time compared to SAC and DDPG methods.Although the planning module intertwines with the learning process, there is no additional substantial computation load on the main program.This is due to the following facts: (1) the planning process is based on the visited states and it has been accelerated with the elimination of collision-checking on the candidate nodes; (2) the most computationally expensive step is the interaction with the environment and the planner involves a small proportion of the total computation.
In addition, HER-related methods are more unstable than others in our task as they have the highest standard deviation variance among the compared algorithms.For BC-related methods, we calculated the training time as a sum of both data collection and BC model training time.The training time for BC methods is much lower than RL-based methods.However, we should notice the assumption that the expert was given before BC training.The access to the expert also takes time and effort.In terms of the planning time, learning-based methods take similar amounts of time for rolling out the policies.The situation for PRM methods is different as we can define the desired planning time for the planner.The longer time we allocate, the higher success rate we can expect.We found that if we limit the planning time in PRM to 0.1 s, the same level of planning time in our learning-based methods, the mean success rate is 0.749, which is much lower than the result in DDPG-SILP+ or SAC-SILP+.From this perspective, SILP+ is superior to the traditional PRM method.
We also compared the performance during the training in terms of success rate, as shown in Fig. 5.The curves indicate that our SILP+ related methods DDPG-SILP+ and SAC-SILP+ perform better than other methods in the very beginning of learning.The improvements slower down after 100 epochs, but the advantage of both methods remains significant until the end of training.
The above baseline comparisons verified higher success rates and better training efficiency of SILP+ than other algorithms, including SILP.The main contributions of SILP+ compared to SILP include the Gaussian process-guided exploration near obstacles, the reward-based filter in the self-imitation learning framework, and the learning strategy in response to collisions.While the difference in success rates may appear small between SAC-SILP+ (0.973) and SAC-SILP (0.944), it is essential to consider the context.The success rate ceiling is 1, each incremental improvement becomes increasingly significant as we approach the maximum performance.Additionally, when we consider the efficiency aspect.We notice that the average training time for SILP+ has an improvement of 18.2% compared to SILP.This significant reduction in training time serves as a clear and easily understandable indicator of the superior performance of SILP+ over SILP.
In the next subsections, we further investigate how each of these elements contributes to better performance and training efficiency in SILP+.

Collision types comparison
It is commonplace to encounter failures during the learning process in motion planning tasks.Herein we consider scenarios where failing is undesirable but not catastrophic, such as, avoiding touching, moving fragile objects or immobile obstacles.In such scenarios, failures still provide a valuable source of information, which have normally been handled by user-defined penalties in reward functions.Yet, designing effective reward functions to avoid failures is difficult and usually requires domain knowledge.Besides designing a suitable reward function, the way of dealing with collision states in the RL framework also plays a vital role in the learning performance.We consider the following three methods to deal with the collisions: • type-0 (early-reset on collisions): The algorithm terminates the episode when a collision happens [55,56] and gives the accident a punishment in the reward function [56,57].• type-1 (continue as if nothing happens): The algorithm continues the episode with another random but collision-free action; collision experiences are skipped and will not be added into the experience replay buffer for training.• type-2 (learn from collisions and successes): The algorithm continues the episode with another random but collision-free action and adds all of the collision experiences into experience replay buffer for policy training.
We recorded the success rates during training with three methods on pure SAC (type-0, type-1, type-2) and SAC-SILP+ (type-0-SILP+, type-1-SILP+, type-2-SILP+) and illustrated them in Fig. 6.The training time, the number of collisions, accumulated steps during training, and the success rates on trained policies (average with three seeds on 1K episodes) are summarized in Table 2.
The difference between type-0 and type-1 is that the latter will not terminate the episode when a collision happens.Instead, type-1 will randomly select collision-free actions to continue the episode.This results in a longer episode and training time, as shown in Table 2, the training time in type-1 has increased 70.3% and 78.5% in SAC and SAC-SILP+ compared to type-0.In addition, type-1 adds more experience and information near the obstacles, but the success rate of type-1 (0.423) in SAC is much lower than type-0 (0.819).The declined performance could attribute to the scarce random exploration near obstacles.In continuous space, these explorations are inadequate to help the policy gain a generalized understanding of the environment but confuse the policy.However, the situation is different when SILP+ gets involved; one can observe that all three types benefit from SILP+ in success rate and training time.Especially for the success rate of type-1, it is more than double that of SAC-SILP+, from 0.423 to 0.942.We interpret that the randomly explored states near the obstacle  could enrich the nodes in PRM and help SILP+ plan better distributed demonstrations, thus alleviating the scarce experience problem near obstacles.
The difference between type-1 and type-2 is that type-2 uses the collision experiences to update its policy while type-1 does not.From the success rate in Table 2, we see that type-2 performs better than type-1 in both SAC and SAC-SILP+ settings.In accordance, the training time decreases, especially in SAC methods.We conclude that the failure experience can boost the performance, which is also reflected in the comparison between type-0 and type-2.
From the analysis above, one observes that the positive feedback from the planned path in SILP+ also helps with learning.However, which information has more impact on the results?Negative feedback from failures or positive feedback from the planned demonstrations?The answer can be found when one compares the success rates between type-0 and type-2 (from 0.819 to 0.865) and type-0 and type-0-SILP+ (from 0.819 to 0.925).The improvements from negative and positive feedback are 0.046 and 0.133, respectively.Therefore, we interpret that the positive feedback from LfD is more important than the negative feedback in training an NMP.
Here, we take the high success rate as our objective, so we selected type-2 to deal with collisions during the training process.However, if the efficiency has a higher priority, type-0 is also a good option as it can achieve comparable performance under the SILP+ algorithm but requires less training time.

Gaussian-process-guided exploration
In this subsection, we did an ablation experiment to investigate the effect of the Gaussian-process-guided exploration.We compared SAC-SILP+, which was embedded with the Gaussian-process-guided exploration module, with another SAC-SILP+ that is without this module.We call them with-GP and without-GP, respectively.The success rate, training time and the number of collisions during training are summarized in Table 3. From the data, we can see that the final success rates are the same, but the number of collisions for with-GP has  decreased to around 20% of without-GP.Accordingly, the training time has been shortened by more than 1 h.The greatly decreased collision number can be beneficial for safety-sensitive applications, especially in the robotics field.

Extrapolation error reduction with reward filter
During the training process,if SILP+ is steadily improving, the number of actions in the demonstration buffer that perform better than the policy should decrease.The comparison between the demonstrations and policy was done with a Q filter as used in [31] and [17].The results were embedded in a behavior cloning loss and contributed to update the RL gradients.However, Q filter can also trigger extrapolation error and result in unstable and unreliable training.In this part, we did an ablation experiment to investigate how extrapolation error occurred in the Q filter and affected the training performance, and how the proposed reward based filter can alleviate the impact.
In the experiments, we selected SAC-SILP+ as the algorithm with reward filter and replaced the reward filter with Q filter to form another comparison algorithm, called SAC-SILP+ with Q filter.First, we recorded the number of actions in the demonstration buffer that performed better than the actions from the policy during the training in Fig. 7.In the figure, the magenta curve and blue curve represent SAC-SILP+ with Q filter and SILP+ with our reward filter, respectively.From the curves, one can see that while the average Q-filter effect is better a high number of epochs, it also yields a large variance over the whole training process.In addition, the effect was unstable before epoch 100, as one can observe from the downward and upward variations in the curve.In contrast, SILP+ with our reward filter (blue line) has a much more stable curve with lower variance.Furthermore, we notice that in our method, the number of actions in demonstrations that perform better than the policy is smaller than the one with Q filter after around epoch 50, which means the policy would need to imitate less from the expert while depending more on itself in SILP+.The results shown in Table 4 indicate that the extrapolation error in the Q value not only

Actual robot experiments
We designed several experiments on a physical UR5e robot mounted on a table (Universal Robots, S/N 20195501237).First, we compare the performance of the best-performing policy SAC-SILP+ in simulation and on the real robot.Then we perform an additional comparison between SAC-SILP+ and the next-best contender, SAC-Demon on the real robot.In the SAC-SILP+ test we used three different conditions in order to check the effectiveness of the simulation-trained policy being deployed on a pick-and place task in a physical context.The task is shown in Fig. 8, in which three objects are randomly put on the table with the considerations of not arranging them far away from the trained workspace.The big cardboard box (0.2 m × 0.2 m × 0.3 m) represents the obstacle that the robot should avoid during the movement; the small cube (0.07 m × 0.07 m × 0.07 m) is the object that needs to be picked and placed in the blue basket (0.15 m × 0.15 m × 0.1 m).The blue basket is the goal, functioning as a container for the cube.The pick-and-place task consist of the following four steps: • Objects recognition and localization: The perception system is supported by the Aruco marker method [58,59], which recognizes the markers and calculates the pose of the markers relative to the camera.Based on the poses of these markers, we can compute the objects' locations relative to the base of the robot arm • Pick up: The robot goes to the position of the cube with the gripper's center located 20 cm higher than the center of the cube in the Z direction; then the robot goes 10 cm down to prepare for the grasping action and locate the cube inside the gripper; finally, the gripper closes to half-way to grasp the cube and goes 10 cm up to pick the cube up • Goal reaching: The learned policy guides the robot to move towards the goal without colliding with the obstacle • Place the object: Once the goal has been successfully reached, the gripper drops the cube into the basket.The process terminates if there is a collision or the policy runs out of pre-defined steps.
Note that in simulation training, the required positional accuracy for the successful reaching is 0.05 m.Here, we use the reaching algorithm in a pick-and-place task.Therefore, we choose a larger-than 0.05 m basket as the goal indicator taking into account the cube's size.Still, the required accuracy is nearly the same as in the simulation.The initial states of the robot, obstacle, cube and basket will affect the task's difficulty level and further affect the policy's performance.To fully test the performance of the algorithm on the real robot, we categorized the test scenarios into three types: Free (easiest), Block (normal) and Change (difficult).In the Free task, the obstacle is not blocking the path from the to-be-grasped object to the basket and the robot can move straight to the goal without considering the obstacle.Block means the obstacle locates between the robot's initial pose and the goal and the obstacle is a barrier for the robot's motion.The difficult one is Change, in which instead of a fixed goal and a fixed obstacle, the goal and/or obstacle will move to other locations during the execution of the policy.The difficulty level increases from Free to Block and then to Change.We first tested each scenario for 50 trials in the real environment and saved the trajectories of the end-effector and the Fig. 8. Pick and place task with physical robot and objects.The black and white square makers attached to the robot, objects are being used for the perception system for localization.Fig. 9. Success rates when SAC-SILP+ is applied on a real UR5e robot compared to the simulation environments.Please note that the origin of the  -axis is not zero but 0.70.perceived poses of the goal, obstacle and objects.Then, we repeated the real-world experiments in Gazebo using the same trained policy.
For the comparison between SAC-SILP+ and the next-best contender (SAC-Demon), we used 10 different scenarios with different obstacles, goals, and initial robot poses.The experimental setting is mostly the same as in the first experiment, except that this experiment focus on a reaching task instead of a pick-and-place task.We run the two simulator-trained policies for ten trials in each scenario and record the successful times.At the end of the testing, each policy will be tested 100 times.

Actual UR5e robot -experimental results
For the first SILP+ experiment, the success rates in physical and simulated environments were compared in Fig. 9.The success rates decrease as the difficulty levels increase in simulation and physical environments, and the Sim2Real gap becomes more noticeable in more difficult scenarios.For the Free scenario, there were three and four failure trials in simulation and the real test, respectively.The three failures in both simulation and real trials were due to the fact that the obstacle's position located out of the trained workspace.Another failure in the real environment was a collision with the basket, which is attributed to the different experimental settings in simulation and the actual robot experiment.The policy was trained to reach a point in the simulation, but we applied it in a pick-and-place application in the physical world.There were four and seven failure cases in simulation and the real test in the Block scenario, respectively.Among the four common failures, three of them are due to the fact that the obstacle and/or goal were put out of the trained workspace; another one could be the algorithm's ability.There were three more failures in real experiments than in simulations.Two of them were due to collision and one was because of the out-of-basket dropping.
Possible reasons include the measure error from the perception system and the different task settings between the simulation and physical experiment.In the Change situation, five failures were found in both simulation and real experiments.Interestingly, there were two failures in simulation that were not the case in the real test due to the complexity of the dynamic environment.The simulation was designed to replicate the experiments in the real-robot test.However, the time and speed of the goal and/or obstacle's change were hard to be replicated, which caused delays and collisions in simulation.In addition, there were seven failures in the physical experiments that did not happen in simulations; most of them were due to the measurement errors in the perception system.Generally, the causes of failure include locations out of familiar space, measurement errors, algorithm performance and time delay in the simulation.The external factors (e.g., environmental setting and sensor noises) played the most important role.
For the second experiment, we summarize the results in Table 5.
The comparison between our proposed method and the next-best contender on 100 additional trials in 10 scenarios yielded a success rate of 0.90 for SAC-SILP+ and 0.75 for SAC-Demon (significant at  < 0.001) on the real robot.In the simulated version of the test, the difference between SAC-SILP+ and SAC-Demon is small (0.90 vs 0.87) when compared to the results in Table 1 which are based on 1000 trials.Failures, e.g., in 'scenario 5' are characterized by a discrepancy between training and testing conditions.Based on these extensive empirical analyses, we conclude that the results of SAC-SILP+ are good.The small Sim2Real performance differences are expected to be solvable with better-matched experimental settings and improved perception systems.
The edited videos for the two actual robot experiments can be accessed through,1,2 respectively.

Conclusion
In this article, we have proposed SILP+ to relieve humans from collecting diverse demonstrations in goal-conditioned motion planning tasks.With the guidance of self-imitation learning that utilizes the demonstrations from planning on past experience, we train a neural motion planner that generalizes substantially better than those learned directly from off-policy RL, behavior cloning or hind-sight experience replay.The experimental results show that the methods of dealing with collision failures significantly determine the performance.Both positive and negative guidance can boost performance, especially the positive demonstrations.However, hesitating and/or cyclic moves around obstacles will deteriorate the training process by confusing the agent, since pertinent information is lacking.Besides, we verified that the proposed Gaussian-process-based exploration near obstacles could accelerate the training by reducing unnecessary collisions, resulting in shortened training time.Furthermore, we analyzed the extrapolation error in the actor-critic neural networks and found that the extrapolation error would lead to unstable training and affect the learning efficiency and success rate.This problem was solved by using a dedicated reward filter to obtain improved results.
We have explored a new way of embedding planning in the learning framework, while not adding much extra computation burden on the training process.The principle behind it may inspire the motion planning and reinforcement learning communities to design robust and efficient NMPs.In addition, the analysis and discussions on the collision solutions and extrapolation error reduction method could enhance studies related to safe and stable RL.The beneficial effect of the SILP+ method was confirmed for two common RL frameworks in current use, i.e., SAC [53] and DDPG [52], with a clear advantage for the SAC-SILP+ combination.
Although we have tested SILP+ with a position controller on a UR5e robot arm, SILP+ can also be used with other controllers and other robotic platforms if an action model can be obtained to extract the MDP format demonstrations.Since the planning and learning modules in SILP+ are closely intertwined, there is a strong mutual influence.For example, a better exploration technique in RL will not only benefit RL but also generate better nodes for planning and further improve the quality of the behavior demonstrations.In future work, we will investigate how other planning techniques, such as MPCs [60,61], could guide the exploration to critical regions in the RL training process and generate more informative experiences for further demonstration planning in SILP+.

Fig. 3 .Fig. 4 .
Fig.3.An example of modeling the reward function with a Gaussian process in three views on the 3D joint space: The green points represent visited collision-free nodes and the black, crossed points represent the collision nodes.Different color levels represent the modeled reward values which increase from red (penalty) to blue (high reward).The units of the axes coordinates are radians.The modeled reward distribution can guide the RL policy to explore regions with higher rewards.Please note that the positions of the points are based on a 2D projection from the 3D joint space.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 5 .
Fig. 5. Success rates during training, the data is recorded every 20 epochs.The solid line represents the mean value and the transparent region represents the standard deviation range under three randomly chosen seeds.

Fig. 6 .
Fig. 6.The success rate of different ways of dealing with collisions during training; the data is recorded every 20 epochs.The solid curve is the mean value, and the transparent region is the standard deviation under three randomly chosen seeds.Type-0, type-1 and type-2 are different types of methods dealing with collisions.If we use these collision methods with SILP+, we have type-0-SILP+, type-1-SILP+ and type-2-SILP+.

Fig. 7 .
Fig. 7. Number of actions (#N) needed from demonstrations to perform better than the RL policy during training for two filtering methods (low is good).In SAC-SILP+ (dark blue curve), the criterion for comparison is based on the rewards.In SAC-SILP+ with Q filter (light magenta curve) action-state values from the critic are used as the criterion.The curves are the mean values over three training sessions with different seeds and the semi-transparent band represents the variance.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) ).Then, we set the start node as the current node   and append it to the candidate path nodes   .3. Candidate neighbors selection: To determine the candidate neighbors of node   , we use the Euclidean distance as the metric to choose all the nodes not in   of which the distances to   are less than , such as the nodes  1 ,  2 ,  3 in Fig. 1(3).The selection of the distance  is explained in Appendix A. We put them into a neighbors set.If no neighbors exist, we stop the planning process and continue with RL learning.4. Collision-checking on edges: We check the edges between   and the candidate neighbors using our collision checking module, which is explained in Appendix B. Candidate neighbors with collision edges are removed.As shown in Fig.1(4), the edge between   and  3 intersects with the obstacle and thus is removed from the neighbor set.The method for edge collision-checking is based on subdivision, in which the intermediate linear interpolations of configurations are sampled based on _.The edge between two nodes is collision-free if all of the intermediate configurations are checked collision-free.5.The local planner with A-star: We compute the cost of each candidate neighbor in the task space using heuristic functions.

Table 1
Success rate, training, and planning time of SILP+ compared with other methods.Results are collected in simulation and averaged: Each method was randomly initialized three times, trained and then tested on 1000 reaching trials.

Table 2
Success rates and training time under different collision methods (simulation).

Table 4
Success rate and training time for two filter types (simulation).
results in unstable training but also deteriorates the success rate and training efficiency.

Table 5
Sim2Real success rate for SAC-SILP+ and SAC-Demon in a reaching task.