Collision-free path planning for welding manipulator via hybrid algorithm of deep reinforcement learning and inverse kinematics

In actual welding scenarios, an effective path planner is needed to find a collision-free path in the configuration space for the welding manipulator with obstacles around. However, as a state-of-the-art method, the sampling-based planner only satisfies the probability completeness and its computational complexity is sensitive with state dimension. In this paper, we propose a path planner for welding manipulators based on deep reinforcement learning for solving path planning problems in high-dimensional continuous state and action spaces. Compared with the sampling-based method, it is more robust and is less sensitive with state dimension. In detail, to improve the learning efficiency, we introduce the inverse kinematics module to provide prior knowledge while a gain module is also designed to avoid the local optimal policy, we integrate them into the training algorithm. To evaluate our proposed planning algorithm in multiple dimensions, we conducted multiple sets of path planning experiments for welding manipulators. The results show that our method not only improves the convergence performance but also is superior in terms of optimality and robustness of planning compared with most other planning algorithms.


Introduction
Welding tasks exist in various industrial manufacturing processes. Especially in the shipbuilding process, welding workload accounts for 40% of the total workload. Shipbuilding, known as a labor-intensive industry, requires a considerable number of skilled technicians to weld in enclosed and hazardous surroundings. Thanks to the development of industrial robotics, automated welding is gradually replacing manual welding by means of equipping welding manipulators on the shipbuilding production line [1,2]. Nonetheless, path planning for a specific welding task such as spot welding or arc welding, which is often completed by skilled welding technicians with an offline programming teach pendant, is B Tao Wang wangtao_cps@gdut.edu.cn Jie Zhong zhongjie@mail2.gdut.edu.cn Lianglun Cheng llcheng@gdut.edu.cn 1 School of Automation, Guangdong University of Technology, Guangzhou, China still challenging automated welding. When dealing with a complex welding task operated on different workpieces with various shapes, it is rather tedious teaching on-line or programming off-line due to the existent of multifarious pose switching of the welding torch which demands many path planning queries. In view of the demand for flexible and high-efficiency welding, a reliable, safe, and automated path planning technology serving for pose switching of the welding torch is urgently needed.
In the field of welding, the term path planning may be formulated as a constrained Traveling Salesman Problem (TSP) [3] while in this paper we study a new method to find an elegant collision-free path in configuration space of the welding manipulator which connects the initial configuration and the target configuration [4].
Most existing path planning methods for robots focus on their performance in one domain they pursue. The samplingbased method is one of the most popular path planning methods owing to its probabilistic completeness. The samplingbased planning algorithms mainly include the multi-query algorithm Probabilistic Roadmap Method(PRM) [5,6] and the single-query method Rapidly exploring Random Trees (RRT) [7,8]. The algorithms mentioned above can almost always find a feasible path (i.e., a sequence of valid states) when sampling in low-dimension state space like a 2-D or 3-D map for wheeled mobile robots [9,10] and Unmanned Aerial Vehicle [11]. However, when searching for a feasible path in high-dimension state space like the configuration space of a six-DOF welding manipulator, the computational complexity of the sampling-based method shows an exponential growth trend due to the dimensional explosion of sampling space. To better apply the sampling-based algorithms to the high-dimension path planning(e.g. manipulator planning), many improved methods have been proposed. In [12], a new variation path planning method based on PRM and lazy-PRM for industrial manipulators was proposed to reduce the time spent on the roadmap construction. Cao et. al. adopted the idea of target gravity to accelerate the path search speed in [13] and in [14] an improved sampling-based method using learned critical sources and local sampling is developed to go through narrow passages more quickly. Despite the fact that those methods increase the planning efficiency and inherit the superiority of traditional RRT algorithm, the sampling-based planning method is actually a probabilistic search-based offline method with uncertainty and the planning time may tend to infinity in a complex environment especially a narrow passage.
One of the alternative methods for high-dimension manipulator path planning is based on Artificial Potential Field (APF) which is an online method proposed in [15]. Compared with the sampling-based search algorithm mentioned before, the APF algorithm plans the same path each time for the same task which is meaningful in industrial scenario [16] while both of the two algorithms' performance suffers from finding a path in high-dimension state space. Furthermore, it should be noted that the APF algorithm, as a local planner, may converge to a local optimal solution and lacks completeness. In view of the aforementioned characteristics, scholars have proposed many improved APF algorithms for path planning in high-dimension state space such as [17,18]. Another path planning method treats the path planning problem as a constrained optimal control problem and solves it by a nonlinear model predictive controller (MPC) [19], but this method is model dependent and requires computing power and memory resources. What is even more unacceptable is that it is a local planning algorithm as APF does.
With the development of machine learning and deep learning, the learning-based method becomes an elegant alternative method for robot path planning [20]. For example, the Motion Planning Networks(MPNet) based on deep learning [21] aims to train a model that is able to generate an end-to-end collision-free path in the obstacle space for the given start and goal configurations of a robot, yet the representation of the whole workspace with obstacles is needed and its completeness depends on the replanning method(e.g. RRT). Nevertheless, the existing learning-based planning methods often rely on traditional search algorithms (such as RRT) or optimal control algorithms, so that they inevitably inherit the disadvantages of basic algorithms. In addition, a large amount of sample data is needed, and sampling efficiency needs to be improved urgently. There is still a need to balance exploration and utilization to obtain better performance.
In recent years, owing to the amazing performance of various robot controllers or planners based on Deep Reinforcement Learning (Deep-RL) as described in [22][23][24][25][26], many researchers turn to Deep-RL which shows excellent prospects for solving path planning problems in high-dimensional continuous state and action spaces. At the same time, deep reinforcement learning itself has also received more attention from researchers [27][28][29]. Xue et al. [30] proposed an obstacle avoidance path planning method based on a Deep-RL algorithm applied in discretized actions, they introduced sub-goals to guide the mobile robot to the final goal. In [31] a continuous action space Deep-RL algorithm was used for navigation of a mobile robot in simulated environments and the authors believe that the continuous Deep-RL algorithm is an effective collision-free decision-maker for a mobile vehicle.
Compared with the mobile robots, a welding manipulator with high-dimension continuous action space seems to be not that easy to handle by Deep-RL algorithms directly due to the tremendous exploration space and the sampling inefficiency. The exploration-exploitation trade-off is at the heart of reinforcement learning which restricts the convergence of the algorithm thus the balance of exploration and exploitation must be considered when doing path planning using Deep-RL. In [32] a dynamic obstacle avoidance approach for robot manipulator base on a continuous Deep-RL algorithm [33] was proposed and a well-designed reward function was presented to avoid the sparse reward and increase useful exploitation. In Sangiovanni et al. [34], presented a hybrid control method that combines the Deep-RL policy and the PRM policy (i.e. a sampling-based method) to achieve collision-free path planning for anthropomorphic robot manipulators. Hua et al. [35] proposed a path planner for a redundant robot in a narrow duct based on the Deep Deterministic Policy Gradient (DDPG) algorithm [36], this method reduces the exploration space by decoupling the selfmotion from the motion for obstacle avoidance; however, the manipulator must be redundant. It can be seen from the related work above that most scholars try to improve the convergence performance of reinforcement learning by adopting models which is inherent in the environment. This is indeed an area worth continuing to study. Aside from reinforcement learning itself, it is possible to mine information from the environment so that the agent could have a certain innate wisdom.
In this paper, we propose an informed learning-based path planner for a 6-DOF welding manipulator in the semi-closed and narrow environment which is common in real welding scenarios, see in Fig. 1. The model-free Deep-RL algorithm DDPG is adopted to train an agent that is able to navigate the welding manipulator in configuration space so that the welding torch can reach the target position in the workspace finally without any collision on the path. For the purpose of speeding up the convergence, we provide the agent the informed knowledge by introducing the inverse kinematics of the welding manipulator. To avoid the agent's excessive exploitation due to the prior knowledge and encourage exploration for obstacle avoidance, we introduce a hybrid method of Deep-RL and inverse kinematics of robot manipulator which will be discussed later in detail. The main contributes of this paper are as follows: -A well-designed interactive environment with high fidelity in simulation scenario is developed. Our modeling method provides a new physical modeling solution based on CoppeliaSim as a substitute for gym or mujoco for Deep-RL algorithms. -We utilize the inverse kinematics based action as the prior knowledge to reduce the unnecessary exploration in the state-action space thus the learning speed is accelerated. -To avoid excessive exploitation caused by the presence of prior knowledge based on inverse kinematics, the effect of the inverse kinematics based action is adjusted dynamically by introducing the gain module. -We design a method for accelerating deep reinforcement learning training process by combining a inverse kinematics module, which provides researchers with a new idea that is different from improving the deep reinforcement learning algorithm itself.
The rest of this paper is structured as follows. The theoretical background is introduced in the next section followed by our modified method in the subsequent section. Then we demonstrate the relevant elegant performance of the proposed method with several experiment results. At last, we provides the conclusion.

Background
In this section, we give an overview of the background theories for the proposed Deep-RL-based collision-free path planner, including the kinematics modeling of a welding manipulator used in this research, the Sequential Decision-Making model, and the model-free Deep-RL algorithm: DDPG.

Kinematics modeling
The kinematics of a manipulator mainly studies the displacement relationship, velocity relationship, and acceleration relationship between different links or joints. In this research, we should model the kinematics of the used six-DOF welding manipulator as shown in Fig. 1. Denavit-Hartenberg (D-H) [37] representation has been widely used in kinematic modeling and analysis of series manipulator and has become the standard method for modeling kinematics. Therefore, the detailed standard D-H parameters are shown in Table  1. Using the D-H parameters, the Homogeneous Transformation Matrix (HTM) of the last link relative to the base link can be obtained as where the q = [q 1 , q 2 , . . . , q 6 ] ∈ R 6 is the joint angle variable drawn from the Configuration Space (C-space), represents the HTM of ith link relative to the one of the previous links denoted as follows: The last link is usually called an end flange, and an endeffector with a reference frame is fixed on it, e.g., a welding torch. In practice, the task is executed by the end-effector in Cartesian space, and the end-effector's HTM refer to the base link can be computed by 0 e T (q) = 0 6 T (q) 6 e T , for simplicity, both sides of the equation can be written as where x e = [ p e φ e ] T ∈ R 6 is a pose, with p e being the three-dimensional position and φ e the orientation (e.g. Euler angles) in the Cartesian space.
Most of the motion specified by the task is defined in Cartesian space, thus it is inevitable to map the end-effector's Cartesian motion to the motion in joints in C-space. In robotics, given the end-effector's Cartesian velocityẋ e relative to itself, the corresponding velocity of joints' anglesq is as follows: where J (q) is a 6 × 6 Jacobian matrix for a 6-DOF manipulator, and it can be a known matrix once the joint position q Fig. 1 The considered 6-DOF welding manipulator in the semi-closed and narrow environment simulated in a robot simulation platform CoppeliaSim is given [38]. It is worth noting that the pseudo-inverse of the matrix J (q) † is computed due to the existence of singularity. The inverse kinematics (IK) problem of manipulator is to find a correct and proper joint configuration q drawn from the C-space which satisfies the equation x * e = fkine(q), the term x * e represents the target pose of end-effector.

Algorithm 1 Pseudo-inverse-based method for IK problem
Input: x * e : target pose; q init : initial guess; dt: sampling period; Output: solution q * 1: initial q ← q init ; 2: repeat 3: compute current pose x e ← fkine(q); 4: compute pose differential Δx e ← x e x * e ; 5: compute the joint velocityq inv ← J (q) † Δx e ; 6: update the joint position q ← q +q inv dt; 7: until ( Δx e < ) 8: return q * ← q In general, the traditional solutions to the inverse kinematics problem include closed-form solutions and numerical solutions. The closed-form solutions consist of algebraic and geometric methods, they solve the problem faster than numerical solutions yet they depend on the model of manipulator and lack generalization ability. On the other hand, the numerical solutions do not rely on the model of manipulator which mainly refer to iterative methods.
Among all of the iterative methods, the algorithm based on the Pseudo Inverse of the Jacobian matrix [39] and the one based on Damped Least-Squares(DLS) [40] are most popular, both of them can converge to an optimal or approximately optimal solution q * based on an initial guess q init . Unfortunately, sometimes these methods may not converge to a definite solution due to the presence of singularity or the initial guess is not proper enough. From the perspective of optimization theory, the Pseudo Inverse method is faster than the DLS method and can be implemented readily in practice.
In this research, the pseudo-inverse-based method (see in Algorithm 1) is considered to find a iteration direction in the C-space which minimizes the error between x * e and current pose x e . In Line 4, we use the difference of pose Δx e instead of the velocityẋ e in Eq. 4, and thenq inv is obtained in Line 5. The Δx e is computed as follows: where the R e ∈ R 3×3 is the rotation matrix form of the orientation vector φ e . The function vex(·) can be seen in [41] which aims to compute the increment of R * e relative to R e .

The Markov decision process
In a wide sense of the word, Reinforcement Learning (RL) solves the sequential decision problems by finding the decision sequence which maximizes the expected cumulative return. A Markov Decision Process (MDP) is a typical model Fig. 2 The interaction between the agent and the environment [42] of sequential decision problem which is described by a tuple (S, A, π, P, r , γ, H ). In this tuple, S is a complete set of all existing states, A is a set of all actions, both of them can be discrete or continuous. The policy π can be divided into a stochastic one a ∼ π(·|s) and a deterministic one a = μ(s), where a ∈ A, s ∈ S. The state transition probability function P : S ×A×S → [0, 1] denotes the dynamics of the environment: the probability of reaching s t+1 after taking action a t at the environment state s t . In some cases, the state transition function can be deterministic.
The reward function r = R(s t , a t ), which is a scalar obtained from the environment when taking action a t at observing a state s t , implies the goal of a RL problem. Until now, we can use the MDP and π to generate sequence as follows: where H ∈ N + is a time or step horizon of an episode. It is also called a trajectory which is generated by the interaction between the agent and the environment, see in Fig. 2. The agent mainly plays the role of a policy π while the environment define the state transition model and output the reward signal.
The discounted return at time step t is defined as follows: where the term γ ∈ [0, 1] denotes the discount factor. To evaluate the performance of the policy π , we can compute the expectation of the total reward in one episode: where the policy and the environment models are stochastic. In general, the RL algorithms aim to find the optimal or asymptotically optimal policy π * θ (·|s) with parameters θ . In recent years, scholars use deep neural networks (DNN) to Algorithm 2 Deep Deterministic policy gradient 1: Initialize: The predicted actor network μ(s) and critic network Q(s, a) with weights θ μ and θ Q 2: Initialize: The target actor networkμ(s) and critic net-workQ(s, a) with weightsθ μ ← θ μ andθ Q ← θ Q 3: Initialize: The replay buffer R 4: for episode=1,N do 5: Initialize a random process N for exploration 6: Reset the environment and observe the s 0 7: for t=0,T-1 do 8: Select an action a t = μ(s t ) + N 9: Execute a t at the environment 10: Get the reward r t and next state s t+1 11: Store the transition (s t , a t , r t , s t+1 ) in R 12: Sample a minibatch of M transitions in R: Set Update network Q(s, a) by minimize the loss: Update network μ(s) with the sampled policy gradient:

16:
Soft update target networks end for 18: end for represent the parameters of policy and therefore the iteration of policy is carried out by training the policy network weights. This method is called Deep Reinforcement Learning (Deep-RL) which interests more and more researchers all over the world from different domains [43] such as robotics.

Deep deterministic policy gradient
In this subsection, we focus on one of the state-of-the-art Deep-RL algorithm: deep deterministic policy gradient algorithm (DDPG) which was proposed in [36] and it is shown in Algorithm 2. This algorithm is suitable for training an agent or policy to control a robot such as a 6-DOF welding manipulator to accomplish a specify task with position or torque, because its policy can output an action that is high dimensional and continuous.
The motivation of DDPG is actually to allow Deep Qlearning Network [44] to expand to continuous action domain with a deterministic actor. Like the algorithm DQN, a pioneering work that combines reinforcement learning and the deep learning, DDPG trains an action-value function network Q π (s, a) to approximate the Bellman function: where the deterministic policy π is a = μ(s) which maps the current state to a specific action. The action-value function network is trained by minimize the loss function: where the label y t is given by in which the action-value network Q s, a|θ Q and the policy network μ s|θ μ represents the target network, respectively. The purpose of introducing target network is to make the training process more stable.
On the other hand, the DDPG algorithm utilize the actorcritic framework-based DPG algorithm [45] to maintain the policy μ(s), and in this framework, the action-value network is called critic network while the policy network is called actor network. The deterministic policy gradient is proved to be as follows: which represents the performance of the actor network, we can update the actor network using this equation.
An experience replay buffer which is designed for offpolicy updating denoted as R is used to store the transition tuple (s t , a t , r t , s t+1 ) see in Algorithm 2 Line 11. The predicted network θ μ and θ Q are updated using Eqs. 11 and 9, respectively, with the sampled minibatch experience replay transitions. To train the target network more stably, a soft update technic as Algorithm 2 Line 16 shows is employed.
Due to the determinacy, the actor may coverage to a local optimum, and this is harmful to the agent to explore for higher reward in the environment. Thus, in Algorithm 2 Line 8, a stochastic process noise, Ornstein-Uhlenbeck, is added to the actor's action to enhance exploration and protect the robot in the environment at the same time.

Methodology
In this section, we introduce in detail the hybrid collision-free path planning method of Deep-RL and inverse kinematics for the 6-DOF welding manipulator.

The interaction environment
To solve the path planning problem with the Deep-RL framework, we must convert the primal problem to a sequential decision problem, namely the Markov decision process (MDP). To this end, we need to define the interface between the agent and the environment in Fig. 2.

State space
In the real world, the state set may be extremely massive or even infinite, so enumerating all the states is unrealistic. In practice, we usually select some important characteristics as the state information. It should be noted that the state information of Deep-RL represents the environmental information perceived by the agent and implies the influence brought about by the actions of that agent. State information is the basis for the agent to make decisions and evaluate its long-term benefits, namely accumulative reward, and the quality of the state design directly determines the convergence, convergence speed, and final performance of the Deep-RL algorithm.
Task analysis is the soul of state design. In this research, the goal of the collision-free path planning task is to train an agent that drives the joints of the welding manipulator without any collision to minimize the distance between the end-effector of the welding manipulator and its target position. In the light of the given task, the state observed from the environment at time step t is designed as a stacked vector: where q ∈ R 6 is the joint position of the welding manipulator, and p e ∈ R 3 is the end-effector reference frame's Cartesian position relative to the target frame. The scalar d tar denotes the Euclidean distance between the end-effector reference frame's origin and the origin of the target frame. The last element collision is a Boolean that indicates whether the collision occurs between the links of the welding manipulator and obstacles in the environment, it can be obtained by the CoppeliaSim API function. It is obvious that the dimension of the state vector is where dim(·) means the dimension of a vector.

Action space
When applying Deep-RL algorithm to an actual problem, perhaps the most intuitive part is the definition of the action space A. In this research, the actor in the environment is a 6-DOF welding manipulator; thus, the control method of the agent for the manipulator can be velocity control mode intuitively as follows: whereq ∈ R 6 is the target angular velocity within the limitation of each rotation joint driven by a servo motor. The dimension of the action space is dim(a t ) = 6 (15)

Reward signal
Contrary to the definition of action space mentioned aforehand, the reward signal, which implies the goal of the specified problem, determines whether the agent can finally learn the desired skills, and directly affects the convergence speed and final performance of the algorithm as the state space does. After considering our problem, we define the reward signal consists three parts. The first part is designed for the end-effector approaching a specific target position. We utilize a nonlinear piecewise function to compute this part of the reward: where d tar is the Euclidean distance between the end-effector and its target position, the parameter δ is the turning point of the function. The popular explanation for r 1 is the longer the distance d tar , the smaller the reward. The second part is designed for the purpose of avoiding collision, so in this part, we give a large penalty according to the current collision condition.
where C o ∈ R + is a constant by which we can regulate the penalty. The collision state collision is a Boolean as mentioned in Eq. 12. The third part is relative to the norm of the current actioṅ q as shown in Eq. 18, that means we expect the agent to take smaller actions for safety.
Finally, we can construct the total reward signal as follows: where λ i ∈ R + denotes the constant scaling factor which is designed to make the contribution of each reward component to the final reward at a close level, so that each target reward can be taken into account by the agent to the same extent.

The training process
Although the Deep-RL algorithm DDPG as shown in Algorithm 2 has achieved elegant results on many robotic operation tasks in Mujoco environment such as reacherObstacle in which the agent is required to move a 5-DOF manipulator to the target position without collision, the complexity of learning mapping relationship between state space S and action space A in the path planning problem is so high that it requires amazing training time steps and even millions of steps are commonplace. To be honest, training a policy model for a path planning problem from scratch using the DDPG algorithm is irrational. Thus, in this paper, we aim to speed up the DDPG training process for the collision-free path planning problem via a hybrid method of DDPG and inverse kinematics of the welding manipulator as shown in Fig. 3.
The path planning problem for a welding manipulator actually contains two sub-problems, one is moving the endeffector to its target position while the other is avoiding collision during the moving process. If the collision avoidance is not in consideration, moving the end-effector to its target position can be solved by a sequential dynamic model shown in Eq. 20.
where dt is the sampling period,q inv is obtained as Line 5, Algorithm 1 shows. Due to the presence of obstacles, we need to correct thė q inv generated by the inverse kinematics module to avoid collision. To this end, we might as well use the action a t selected by the policy network of DDPG to share the task of avoiding obstacles as Eq. 21 shows.
In this way, there is no need for the DDPG agent to learn from scratch since the inverse kinematics module will constraint the search space between the start and goal region as well as guide the welding manipulator move to the target position so that the agent's responsibility is just to correct the output of inverse kinematics module for collision avoidance. In other words, the invalid exploration which discourage the welding torch moving to the target position is lessened, thus the search space is reduced significantly which is beneficial to speed up the training process of the agent.
However, we find that when training with dynamic model Eq. 21, the excessive exploitation of the inverse kinematics module leads the agent to overlook the objective of collision avoidance and the policy converge to local a optimum. We blame this failure on the changeless usage ofq inv generated by the inverse kinematics module. It is the fixedq inv that Algorithm 3 Hybrid of DDPG and inverse kinematics 1: Initialize: actor networks and critic networks 2: Initialize: replay buffer R 3: for episode=1,N do 4: Initialize a random process N for exploration 5: Reset the environment and observe the s 0 6: for t=0,T-1 do 7: Select an action a t = μ(s t ) + N 8: Get G(t)q inv as correction of a t 9: update the environment using Equation 23 10: Get the reward r t and next state s t+1 11: Store the transition (s t , a t , r t , s t+1 ) in R 12: Sample a minibatch of M transitions in R 13: Update actor network and critic network 14: Soft update target networks 15: end for 16: end for restricts the exploration of the agent for collision avoidance. Thus we add a time-varying gain module to encourage exploration in the early time of one episode with a relatively large growth rate and increase the exploitation in the later time by applying a lower growth rate when the manipulator ought to be near the goal position. The gain module is an increasing nonlinear function ln(x) of time step t as shown in Eq. 22.
where t = 0, . . . , T −1 and the hybrid dynamic model which balances the exploitation and exploration is as follows: The integral training process of our collision-free path planning method is described in Algorithm 3.

Experiments and analysis
In this section, we evaluate the performance of our proposed hybrid method through several path planning experimental results in the simulation system. We also prove the progress of our method through comparative analysis.

Experimental system description
The simulation experiment system is roughly divided into three parts. The first part is the interaction environment which mainly includes a welding manipulator with obstacle around it as shown in Fig. 1 and we model them in a well-known open-source robot simulator CoppeliaSim with high fidelity. The simulator CoppeliaSim provides remote interface functions for collision detection between specific robot entity and obstacles, thus we can determine the Boolean collision in Eq. 17 which indicates whether any collision occurs when the welding manipulator is at a certain joint angle q . In general, this part, as a server application, provides an interface for the exchange of state information and action input as shown in Fig. 2. When receiving the action information, namely the joints' angular velocityq ∈ R 6 predicted by the agent, the environment server application will update the current state s t of the welding manipulator as described in Eq. 12 to its transition state s t+1 , and the joints' angle is updated with Eq. 23, where dt is set to 0.05 in second. To make the simulation closer to the real situation, we must constraint the action a t oṙ q in a bounded range:q min ≤q ≤q max . In the meanwhile the joint angle q should meet the joint limit requirements which are very crucial to the safety of the welding manipulator, otherwise, the links of the welding manipulator may collide with each other. After the update of the state, the acquisition of the instant reward signal instead of a sparse one is pivotal. The constant parameters which formulate the reward signal are instantiated as follows: the δ in Eq. 16 is set to 0.05, the C o in Eq. 17 is set to 200, the λ i in Eq. 19 set to λ 1 = 2000, λ 2 = 1, λ 3 = 1, respectively. It needs to be pointed out that these parameters are carefully picked up through a large number of experiments. Different parameters will directly affect the convergence of the algorithm.
The second part is the DDPG agent represented by four fully connected deep neural networks constructed by a famous deep learning framework Pytorch and their respective update method for internal neural network weights. All of those networks are trained by an Adam optimizer using a certain number (Mini-batch size M) of transitions data sampled Fig. 4 The relation chart of the simulation system from the replay buffer R at each time step. All of the aforementioned four networks including two actor networks and two critic networks have two hidden layers respectively. The actor networks approximate the deterministic policy function a t = μ(s t ); thus, the dimension of the input layer equals to the state dimension dim(s t ) while the output layer equals to dim(a t ). On the other hand, the critic networks approximate the action-value function Q(s t , a t ) thus the dimension of the input layer equals to the sum of the action dimension dim(a t ) and the state dimension dim(s t ) while the dimension of the output layer equals to one because the action value is a real scalar. The choice of key parameters for the DDPG agent is shown in Table 2.
The third part is the inverse kinematics module. We perform the inverse kinematics calculation according to the current position of joints q to obtainq inv at each time step. This part will be executed in Matlab. More specifically, we set up a pure kinematic model of the welding manipulator synchronize with the one in simulator CoppeliaSim, and the instantaneous Jacobian matrix as shown in Eq. 4 is computed at each time step which is used to obtainq inv .
These three parts communicate through remote Application Programming Interface (API) or engine as shown in Fig.  4. The whole training phase is implemented on a PC (CPU 3.6 GHz, RAM 16 GB).

Experimental results and analysis
In various algorithms of path planning, the proposed learningbased path planning algorithm can be classified as a global and single query path planning method. To validate the effectiveness of the proposed method, we test the learning-based planner with three different path planning tasks. Each task was specified with an initial position q s ∈ R 6 drawn from

Performance evaluation
We evaluate the performance of our method on the three single-query test tasks from different domains. First of all, it is known that the objective of the Deep-RL algorithm is to maximize the cumulative reward in one episode with finite time steps and, therefore, it is necessary to analyze the trend of the reward curve or the learning curve vs training episode number which reflects whether the target deterministic policy model has converged as well as the learning efficiency. We show the learning curves of the three test tasks, respectively, in Fig. 5. For each specified path planning task, we train the policy model using different gain module noted as G(t). When G(t) = 0, it means that the inverse kinematics module is removed and the agent should learn from scratch which is abandoned in our research. To exploit the inverse   kinematics module during the training phase, we need a nonzero G(t); thus, we adopt G(t) = ln(t + 1) as shown in Eq. 22. To reveal the progress of the proposed time-varying gain module, we also adopt a fixed gain module G(t) = 2 as contradistinction. After comparison, we find that different selection of gain module G(t) leads to significantly different learning curve when dealing with the same task. Our method which adopts a time-varying gain module G(t) = ln(t + 1) achieved the best performance in terms of the convergence as the red curves in Fig. 5 show. The red curves in Fig. 5a-c finally converge without exception to a near-zero level from which we can infer that the learned agents have accomplished the specified tasks in the light of our reward design mechanism. On the other hand, when G(t) = 0 or G(t) = 2, the learning curves fail to converge like the red curves do in the same range of training episodes; therefore, it can be seen that our method has significantly improved convergence performance and convergence speed which is a credit to the introduction of the proposed inverse kinematics module and gain module.
Second, we need to confirm whether the final deterministic policy model mentioned above can be used to generate a collision-free path that is a solution to the respective specified task. The collision-free path in configuration space of the welding manipulator can be obtained indirectly by way of sampling a trajectory s 0 , a 0 , s 1 , . . . , a T , s T using the final policy model and then construct the solution q i , i = 0, 1, . . . , T from {s i } where s 0 is equivalent to the state when the welding manipulator is at the configuration of q s . Consistent with the previous assumption, the policy model trained by our method described in Algorithm 3 which employs both the gain module G(t) = ln(t + 1) and inverse kinematic module can generate a unique collision-free path in the configuration space which fulfills respective task requirements. We render the path intuitively and succinctly in the simulator CoppeliaSim as Fig. 6 shows.
In addition to collision avoidance, the path should ensure that the end-effector of the welding manipulator is at its target position p * e finally. Although all of the three models trained by different G(t) can generate a collision-free path for respective task, only the model trained by G(t) = ln(t + 1) is able to move the end-effector to its target position. To intuitively reflect the performance of the models in the control of end-effector's position, we plot the distance between the end-effector and its target Cartesian position vs the path steps as Fig. 7 shows. It is clear that all of the red curves that represent our method eventually drop to zero or near-zero while other curves fail to drop as the red ones do, the second half of these failed curves fluctuates around a number significantly greater than zero, which shows that those learned policy fails to convergence and is trapped in local optima. On the other hand, as shown in Fig. 8, obviously, only our proposed training method can move the end-effector to its target position. However, the policy learned by other training methods (e.g. G(t) = 2 or G(t) = 0) controls the manipulator arm to gradually move toward the non-target direction and the movement step is gradually reduced, which can also indicate that these paths fall into the local optimum. The above description indicates that only the policy trained by our method can successfully complete all given tasks including collision avoidance and moving the end-effector to its target position.

Comparison with other methods
In this paper, we introduce an alternative collision-free path planning algorithm for industry manipulators. As we all know, for the path planning problem in high-dimensional sampling space, e.g. the configuration space of industry manipulators, the sampling-based method is one of the most efficient algorithms. To demonstrate the progress of our algorithm, the three experimental path planning tasks which have been used to test our algorithm are now used to query several staple sampling-based single-query path planning algorithm, i.e. RRT-Connect [46], RRT [7], RRTStar [47], and BiTRRT [48]. We utilize the open-source motion planning library "The Open Motion Planning Library"(OMPL) [49] to implement the above planning algorithm on the three test tasks.
We select three performance indicators to evaluate the performance of different algorithms from different dimensions. The first performance indicator is the success rate for a single-query path planning task in a specified number of times and the other two indicators aim to measure path length in configuration space and workspace, namely Cartesian space. Generally speaking, the success rate of path planning means the robustness of the algorithm while the length of the planned path represents the optimality of the path. In Fig. 9, we reveal our algorithm's performance through comparison with other algorithms.
First, since the path generated by the converged model is deterministic and successful each time, the success rate of our path planner based on the policy model is 100%. However, in line with expectations, the success rate of samplingbased planning algorithms is relatively low, and the highest success rate is only just about 70% which is got by the RRT-Connect algorithm. This is obviously not allowed in industrial scenarios, so our algorithm has unique advantages over those sampling-based planning algorithms. Second, as we all know, from the perspective of path length, only optimal RRT algorithms such as RRTStar [47] can find the optimal or asymptotical optimal path, but they may require a huge number of samples to achieve such optimality. our algorithm achieves the same performance as BiTRRT does which is better than any other algorithms, In other words, the path generated by a well-trained policy using our method can be comparable to an optimal sampling-based path planning algorithm.

Conclusion and future work
In this research, we propose a hybrid algorithm of DDPG and Inverse kinematics as an alternative global and single-query path planning algorithm for industry welding manipulator. Simulation results in high-fidelity virtual scenario show that our improvements to the original algorithm DDPG have greatly increased its convergence performance and convergence speed and the well-trained policy model is able to generate a collision-free path for a specified task. In industry robot systems like automatic welding robot systems that require extremely high stability and safety, our algorithm may be competitive comparing with other sampling-based algorithms because the sampling-based algorithm can not guarantee 100% success rate of planning. Furthermore, our algorithm can also produce path planning results comparable to those of optimality or progressive optimality algorithms in terms of path length. In summary, our algorithm provides an alternative solution for teaching-free automatic path planning of actual welding robots or other industrial robots.
Despite the progress of the proposed algorithm, we still have lots of future work to do. We always remember that the model-free deep reinforcement learning algorithm such as DDPG in this paper is born with the characteristics of low sample efficiency. Although we promote the sample efficiency greatly in this research, we will still go on to develop a more elegant method to accelerate the convergence of the Deep-RL algorithm for path planning. On the other hand, we also wish our algorithm can be applied to real-world robots to complete sim-to-real tasks.