Neurodynamics Adaptive Reward and Action for Hand-to-Eye Calibration With Deep Reinforcement Learning

Calibration performed by a robotic manipulator is crucial in the field of industrial intelligent production, as it ensures precise and accurate measurements. In this paper, we present a new method for addressing the hand-to-eye calibration problem using deep reinforcement learning. Our proposed algorithm utilizes an actor-critic framework and incorporates neurodynamics adaptive reward and action functions, which allows for better convergence, reduces the dependence on the initial value, and overcomes the local convergence issues of traditional deep reinforcement learning method. Additionally, we introduce a step-wise mechanism under the guidance of the attention mechanism, and zero stability to handle the complexity of the calibration task in challenging environments. A number of experiments were conducted to demonstrate the validity of the proposed algorithm. The experimental results show that our proposed algorithm can achieve a nearly 100% success rate after training phase. Additionally, we compared our proposed algorithm with other widely used methods, such as deterministic deep policy gradient (DDPG) and soft actor-critic (SAC) to further demonstrate its effectiveness.


I. INTRODUCTION
Robotic manipulators are the most widely used automated mechanical devices in the field of robotics technology, finding applications in medical treatment [1], industrial manufacturing [2], military [3], semiconductor manufacturing [4], and space exploration [5]. They are particularly useful for performing dangerous tasks like carrying heavy objects and hazardous materials. Moreover, robotic manipulators guarantee speed, efficiency, and lower production costs in industrial settings.
Although manipulators have been widely promoted and applied in today's industrial field, with the continuous advancement of intelligent manufacturing, the requirements for intelligent manipulators in the industrial field are also getting higher and higher. For instance, robotics production The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Alshabi .
like ABB in Switzerland [6] and Kuka in Germany [7] can achieve precise motion control. However, most of these robotic manipulator applications are based on given assignments and motion trajectories and lack adaptability to the target's dynamic changes, making most such robotic manipulators gradually obsolete. Therefore, to change this situation, robotic manipulators with self-learning ability have become one of the ways to realize intelligent robotic manipulators.
In this paper, we proposed a deep reinforcement learning (DRL) method for solving calibration tasks [8]. Generally, DRL is not used for ordinary linear or nonlinear optimization problems due to computational efficiency and effectiveness issues. However, the advantage of deep reinforcement learning is that its training process is a process from scratch. Unlike other machine learning methods, such as supervised learning and unsupervised learning [9], reinforcement learning can obtain data from the environment so that its ultimate goal is not to find the law of these data but to enable agents to maximize the rewards in the environment. Moreover, the way of thinking adopted by reinforcement learning is closer to the learning way of real life and is more intelligent than other machine learning methods. Therefore, to realize the robotic manipulator's automatic and intelligent calibration task, the deep reinforcement learning (DRL) method has been applied to robotic manipulator control.
Through experiments, we discovered that using a static reward function and action function setting was not the optimal solution for the complex calibration environment as the agent cannot accurately identify the target and achieve the expected calibration result. To alleviate this problem, we proposed a neurodynamics adaptive reward function and action function setup as a core component of our deep reinforcement learning algorithm. It is inspired by brain-like mechanisms, and zero stability [10], [11] as we want the robotic manipulator to behave like a human based on its own experience. On this basis, inspired by the problem of the bottleneck of information processing in cognitive science, we further design a stepwise reward and action function based on a neurodynamics adaptive mechanism inspired by attention methods in cognitive science. This mechanism is often referred to as the attention mechanism. The experimental results show that this method can effectively solve the calibration problem in the high-dimensional target domain. Multiple sets of experimental results on V-rep platforms show that the dynamic adaptation method can effectively solve the calibration problem of high-dimensional target domains. Additionally, we conducted a comparative analysis with other well-known and widely used algorithms in the field of deep reinforcement learning, such as DDPG and SAC algorithms, to further demonstrate the effectiveness of our proposed algorithms. Consequently, the success of our algorithm design can be viewed as a prerequisite for advancing the application of deep reinforcement learning in practical scenarios involving the UR-10 robotic manipulator.
The contributions of our study can be summarized as follows: • We proposed a neurodynamics adaptive reward function and action function to replace the static reward function and action function setting.
• To achieve the zero stability, we further design a stepwise reward and action function based on the neurodynamics adaptive reward function and action function. The rest of this article is organized as follows. Section II. presents the background of deep reinforcement learning, and classical hand-to-eye calibration method. Section III. presents the problem of hand-to-eye calibration tasks and the performance of static deep reinforcement learning. Section IV. demonstrates the proposed algorithm architecture for calibration. Section V. describes the experiment setting and result. Section VI. evaluates the comparison between two different reward and action function setting,and comparison analysis with DDPG, and SAC algorithms. Finally, main finding and future works are summarized in Section VII.

II. PRELIMINARY
In this section, we recall the deep reinforcement learning algorithm and demonstrate the classical hand-to-eye calibration method.

A. DEEP REINFORCEMENT LEARNING
Reinforcement learning (RL) is a subfield of machine learning allowing agents to interact with the environment without supervision. The purpose of reinforcement learning is to maximize the cumulative reward and obtain optimal behavior so that agents can fully understand the environment. Almost all reinforcement learning algorithm can be formulated as markov decision process (MDP) [12], the mathematical model for agents to evaluate the decision, where the probability distribution randomly chooses outcomes. In the markov decision process, there is a tuple (S, A, P M (s t+1 |s t , a t ), R, γ , π ), where S is the state space, A is the action space, P M (s t+1 |s t , a t ) represents the probability distribution of transition from (s t , a t ) to new state s t+1 , R is the reward function, which determined by the state-action pair (s, a), π is represent as the probability of choosing action a give by the state s, and finally, the γ is the discount factor, which can emphasize the important of current reward, and weaken the influence of future rewards on the current state. In each step of markov decision process, the agents are to seek the best possible action with respect to a policy π.
Similar to reinforcement learning, deep reinforcement learning is still formulated as a markov decision process, but it uses the neural network to output the possible action instead of selecting possible action by following the π(s|a). One of the essential elements of deep reinforcement learning is experience replay. Due to the high correlation between the samples obtained by the algorithm, and the deep neural network requires the data to satisfy the independent and identical distribution, the algorithm cannot directly use the data in the neural network, so we use the experience replay mechanism, which allows us to build a memory to store all training samples. When the memory is full, we randomly select a small batch of samples as input to the neural network, which outputs the probability of action selection.

B. CLASSICAL HAND-TO-EYE CALIBRATION
Hand-eye calibration, that is, calculating the translation between the robot end-effector and the world coordinate VOLUME 11, 2023 FIGURE 1. Checkerboard, and camera calibration. The red circle on the checkerboard represents the calibration target point.
system [22], is a critical prerequisite for realizing robot handto-eye coordination [23], and it is also the key to improving the accuracy of robot control. In the hand eye calibration task, an eye-to-hand method is a common approach. In eyeto-hand systems, the camera is fixed outside the robotic manipulator and does not move with the robotic manipulator. The most common usage for this type of calibration system is the factory production line, where most of the objects to be detected are fixed in certain areas. Although this type of approach has a wide range of applications, it could cause a sizeable relative positioning error when the camera is far away from the robotic manipulator, which limits the scope of the perception operation.
Traditionally, the most direct method for addressing the hand-to-eye calibration is to use high-precision measurement equipment to establish a world coordinate system between the camera and the robotic manipulator, and solving transformation matrix. Although high-precision measurement equipment has high calibration accuracy, it may not meet the needs of fast, labor-saving, and low-cost production in industrial applications. Because precision measuring equipment is usually expensive, manual operation is very cumbersome, and the calibration time required is also long. Alternately, a cheaper solution is to use calibration objects, such as a checkerboard plate (as shown in Fig. 1). Nonetheless, one problem to be solved, regardless of which method is used for solving hand-to-eye calibration, is to compute the relation as follows: where T end base 1 and T end base 2 represent the transformation matrix between the robotic manipulator base and end effector, T base 1 camera 1 and T base 2 camera 2 represent transformation matrix between robotic manipulator base and camera, T camera 1 object and T camera 2 object represent transformation matrix between camera and calibration object (as shown in Fig. 2). This formula is obtained by moving any two poses with the robotic manipulator sandwiching the calibration object. To solve this equation, the only two unknown matrices is T base 1 camera 1 and T base 2 camera 2 , since T end base 1 and T end base 2 can be calculated by the coordinate system of the robotic manipulator itself, and T camera 1 object and T camera 2 object can be calculated by the coordinate system of the camera itself.
Traditionally, there are several approaches to solve the above equation. For example, utilizing a Lie theory in the Euclidean group has been proposed by Park and Martin to solve an unknown transformation matrix [24]. A dual quaternion method has been used in hand-to-eye calibration [25]. A novel metric on SE(3) has been proposed for optimization by Qiu, Wang and Kermani [26]. Chen and Huang [27] used an integrated two-pose calibration method to calculate parameters of both eyes to reduce the model error. Zheng and Yang [28] have created a new system to solve the unknown transformation matrix. Although these solutions were very advanced in finding the unknown transformation matrix at the time, they have gradually become unable to meet the requirements of modern industry due to their limitations, which will be discussed in details in later sections. Therefore, there is an urgent need for the industry to find a new efficient and accurate hand-to-eye calibration method.

III. TASKS FORMULATION
In this section, we demonstrate the problem of classical handto-eye calibration method. Moreover, we also analyze the performance of static deep reinforcement learning algorithms for solving hand-to-eye calibration.

A. PROBLEM OF CLASSICAL HAND-TO-EYE CALIBRATION METHOD
While the classical hand-to-eye calibration method is a widely used technique for determining the transformation between a robotic end-effector and an external camera, it still has its own limitations. Here are some common problems associated with the classical calibration methods.
• Expensive experimental equipment:In traditional calibration method, some of equipment is very expensive, which is not suitable for situations where the motion parameters are unknown or uncontrollable.
• Time-consuming [29]: The classical hand-to-eye calibration method requires the collection of a large number of calibration data points. This is especially challenging in situations where frequent re-calibration is required, such as in robotics applications where the robot may move to different positions.
• Measurement errors [30]: The classical hand-to-eye calibration method is sensitive to measurement errors, which can arise from various factors such as noise in the collected data or inaccuracies in the geometry of the calibration object. Even with the small errors in the calibration measurements, it leads to significant errors in the estimated transformation matrix, which result in low accuracy and stability.
• Dependence on a calibration object [31]: The classical methods rely on the use of calibration objects with known geometry to obtain the necessary measurements. This can be a limitation in situations where calibration objects are not available or not easily accessible.
• Limited flexibility: The classical approach relies on a fixed geometric relationship between the end-effector and the camera, which limit its flexibility. For example, if the camera is mounted on a moving platform, the relationship between the end-effector and the camera can vary over time, which can lead to errors in the estimated transformation matrix.

B. PERFORMANCE OF STATIC DEEP REINFORCEMENT LEARNING
In order to meet the current demand for intelligent robotic manipulators in the industry, and tackle the problems of classical hand-to-eye calibration method, we attempt to use the deep reinforcement learning method to establish an accurate conversion relationship through simulation training and simplify the calibration process.
In deep reinforcement learning, one of important factors to reflect the efficiency of algorithms, which is the reward function. Generally, reward is the numerical value, which is the feedback on the action of agent in the previous steps or the whole process. We name this reward setting mechanism as static reward function. In practice, many mature deep reinforcement learning algorithms have successfully used this particular reward setting method to guide the agent activities in the environment, such as AlphaGo [32], Atari Gaming [13]. These environments treat reward as the signal to justify whether the agent has accomplished the task or not, and we named this reward setting method as static reward mechanism. For our calibration tasks, as the piercing tool attached in the end-effector of robotic manipulator, the robotic end pose remain relative quiescence. Therefore, when designing the reward function, we only used the distance between the end of the piercing tool and the target point (TTD) to define it, and we did not set the extra reward function for robotic end pose. So, the reward can be written as follow, where (x 1 , y 1 , z 1 ) represents the coordinate of current needle position, (x 2 , y 2 , z 2 ) represents the coordinate of target position, and R is short for the reward signal. If distance TTD less than or equal to d, it means that the agent successfully reached the target, and the reward is n. Otherwise, the reward remains m during simulation. Moreover, for our calibration task, as a piercing tool attached to the end effector of the robot manipulator, the robot end attitude remains relatively static, and the tip of the piercing tool remains perpendicular to the plane. Therefore, we do not set additional reward functions for the robot's end pose.
In addition to reward function, action function is another factors to ensure agents is moving towards target point in this experiment, and similar to static reward function, generally action function also can be seen as static action function, which is formulated as where A c was determined by probability distribution, which outputted by actor neural network, which will be discussed in later section. Besides, N (0, 1) is the normal distribution with 0 mean and 1 standard deviation, adding normally distributed noise to the action allows the agent to explore the environment more during the simulation, and ω is a factor takes values in [0,1] to ensure that output of normal noise is within During the simulation, we set d to 5mm as the condition for judging the success of the task, and the reward is 2 when the TTD is less than 5mm, otherwise it remains 0. It turns out that (as shown in Fig. 3) the agent cannot even achieve the desired results in a one-dimensional target space. We realize that using this static rewards and action functions is not sufficient to justify the agent's behavior in the environment. Therefore, our paper focuses on modifying the action function and reward function to improve the average training accuracy, and we will explain the details in the later section.

IV. PROPOSED ALGORITHM ARCHITECTURE FOR CALIBRATION
In this section, we illustrate the architecture of proposed calibration algorithm through four different part, ranging from VOLUME 11, 2023 data sampling method in calibration to the parameter choices of our algorithms.

A. DATA SAMPLING METHOD
As the sample policy gradient algorithm takes too long to collect data, and the previously sampled data are not applicable once the policy parameters are updated, we work on an algorithm that can utilize the collected data to update the network. In addition, the algorithm requires preprocessing of the collected data to correct errors caused by different data distributions. We also need to add certain constraints so that the preprocessed data will not cause excessive variance when updating the policy. Meanwhile, we also need the sampled and updated networks to be in the same network because this can improve the training speed.
In order to design a model that meets our requirements, we utilized important sampling to correct errors caused by different data distributions and employed the clipping method to constrain policy iteration (as shown in Fig. 4). We chose not to use experience pooling because it can introduce additional complexity and potential performance issues. Additionally, experience pooling can lead to issues with off-policy correction, which can be computationally expensive and require additional hyperparameters. The clipping method updates the policy based on the most recent experiences gathered during training, avoiding these issues and providing a simple and effective way to limit the policy update step and improve stability during training. To do that, we have following equations: where J (θ ′ , θ) represents the object loss function, A is advantage function, and ϵ is the numerical number of clipping size. In equation (7), it means that when advantage function A is larger than 0, the agents should increase the probability of action selection under the probability distribution π θ (a t |s t ) while limiting the magnitude of the increase. The advantage of the clipping method is to ensure that policy π θ ′ will not be far from π θ , so when the action output by the actor network is in a good direction, the clipping method will limit its excessive updating in the good direction. When the action is not good, the clipping method will limit its excessive updates in the wrong direction. Thus, the purpose of limiting the swinging range of the robotic manipulator is too large or too small, and speeding up the learning speed can be achieved.

B. ACTOR-CRITIC MODEL FOR HAND-TO-EYE CALIBRATION
With the important sampling method to preprocessing data, the calibration algorithm can be described as follow.  The proposed algorithm contains two different neural networks, and both are fully connected neural networks and also the actor-critic [33] structure. Each network plays a different role in the algorithm. One is to calculate the probability distribution of the output action, and the other network is to calculate and evaluate the value generated after the agent performs the action.
As described in Fig. 5, we use a 2-dimensional simulation as an experiment to illustrate the structure of our algorithm. The actor model contains three hidden layers, each of which has 128 neurons. The first two hidden layers contain the ReLU operation, and the last hidden layer contains the Hardwish operation. This actor network model takes the coordinates of the end-effector of the robotic manipulator and coordinates of the last action as the state and outputs the probability of action selection. For each state, the actor model infers the possible 3D coordinates of the target point and interacts with the environment. Once the environment receives the action signal, it will output the relative position of the next target point and the needle tip and form the next state together with the 3D coordinates of the end of the robotic manipulator. Therefore, in the hand-to-eye calibration task, each state is independent. Furthermore, as the traditional tip target distance (TTD) obtained by matrix transformation is replaced by the direct output of the actor-network in the deep reinforcement learning algorithm, the hand-to-eye calibration process is simplified. Moreover, the sequence of trajectories used to construct the environment is pre-collected to facilitate training of the hand-to-eye calibration model and stored separately in the form of an array, called ''stored memory''.
For the critic model, its structure is similar to the actor model. The difference is that it takes the last state of each episodes and all state of each episodes from the stored memory separately as the input, and output the cumulative reward and state value, where the cumulative reward and state value can be described as: where G t is the cumulative reward starting with time step t, V π (s) is adopted to measure the value of current state s, and Q π represents the expected value that can be obtained by executing a specific action a on the current s when following the current policy π.
Then, using these information to obtain the accurate gradients to calculate the critic loss and advantage function, and guide the learning process of the actor model. In other words, the critic loss and advantage function both can be obtained by using equation (8) and (9) and advantage function can be described as: To view the actor-critic model as a whole, this model is to provide the information from the environment, including the probability distribution of action, the trajectory between end-effector of robotic manipulator and the target points, and the coordinates of the end effector corresponding to the current state of the robot arm, and the evaluation of each action selection is relatively efficient for this hand-to-eye calibration task. We will verify our assumption by running different simulation scenarios, which will be explained in later section.

C. NEURODYNAMICS ADAPTIVE REWARD AND ACTION SETTING
Generally, the reward signal is 0 in most Markov Decision Process (MDP) settings. For our hand-to-eye calibration task, this traditional reward signal setting may lead to the problem of sparse reward and refers to the problem that the agent has difficulty obtaining positive rewards during exploration, resulting in failure to learn. Hence, an intuitive way to solve the sparse reward problems is reward the agents outside the reward function when the agents take a step toward the goal. In other words, we call this method reward shaping [34], initially proposed by Andrew Y. Ng in 1999. In addition, on the basis of Zeroing discretized neurodynamics method [35], [36], to make the reward shaping more adaptable to our task, and achieve the zero stability [10], [11]. we design the reward function with neurodynamics adaptive properties, which is inspired by the Euler difference methods.
Based on this, we set reward function as follows: where TTD current is the distance between needle position and target position in current step, and TTD previous is the distance between target position and needle position in last step. The purpose of applying this equation is to ensure that the reward function can reflect whether agents have learned the environment based on the action selection in the last steps. In other words, this modified reward function contains information about whether the robotic manipulator is moving toward the target position so that the agent can choose an action based on the results of the current state instead of making the action selected from the initial state of the agent in every step. More important in this equation is the introduction of the amplification factor k, allowing the agent to identify the target faster. Since the neurodynamics method inspired by Euler formula is zeroing stable that is determined by the root properties of its characteristic polynomial, the neurodynamics adaptive reward function induced deep reinforcement learning is zeroing stable. More details of this second simulation will be explained in Section V.
In addition to reward shaping, we designed the following action function with same neurodynamics adaptive properties: where P end is the coordinate position of the end pose of the robotic manipulator in each steps, and A c is the action has been outputted by the current actor network, and the function of the magnification factor λ is to reduce the action by different times according to the change of the TTD distance. To do that, agents at least has sense where it should be go to reach the target. More details of this simulation will be explained in Section IV. On the other hand, in simple tasks, a single-step neurodynamics adaptive reward setting and action setting are sufficient for the policy to converge. For the 3D hand-toeye calibration task, this single-step neurodynamics adaptive reward may be ineffective and time-consuming, and may not meet the efficiency requirements for industrial applications. Therefore, we taken the further step based on a single-step neurodynamics adaptive reward setting and action setting called neurodynamics adaptive step-wise settings (DASS): and VOLUME 11, 2023 where a, b, c are the distance conditions in the calibration task, δ 1 , δ 2 , δ 3 , δ 4 , are the magnification parameters for reward function, and φ 1 , φ 2 , φ 3 , φ 4 are the reduction parameters of the action function. Meanwhile, we formulate a rule to regulate the parameters, that is, as TTD gradually decreases, the adaptive parameter of the reward function should increase accordingly, and the adaptive parameter of the action function should decrease accordingly.
On the basis of important sampling and clipping method, actor-critic model and neurodynamics adaptive reward and action setting method mentioned in the above, an actorcritic algorithm based on the deep reinforcement learning is proposed for hand-to-eye calibration problem, which is termed as neurodynamics adaptive deep reinforcement learning algorithm.

D. STATES INFORMATION SETTING
When solving calibration tasks using deep reinforcement learning, one critical component is the state information setting. In general, the state of a robotic manipulator includes its joint angles, velocities, and positions. For calibration tasks, additional information such as the camera image, the target position, and the distance between the manipulator end effector and the target point should also be taken into account. However, adding too many variables can lead to instability, neglecting important state information can lead to poor performance. Therefore, selecting the appropriate set of state variables is crucial for achieving a balance between performance and stability in calibration tasks using deep reinforcement learning.
In the proposed algorithms, the state information setting includes the coordinate error between the target point and the actual position of the end-effector, and the coordinates of the previous action taken by the agent. This setting offers several benefits to the training process. Firstly, it is less susceptible to noise, which can cause fluctuations in the state space and result in unstable training. Secondly, by including the coordinate error between the target point and the current position of the end effector, the agent can track its progress towards the goal and adjust its actions accordingly. This allows for a more efficient and effective learning process, as the agent can make corrections based on its current state, rather than relying solely on the initial target position. Thirdly, by including the coordinates of the actions the agent performed in the previous step, the agent can learn from past experience and avoid repeating the same mistakes. Overall, this state information setting contributes to a more stable and accurate calibration process.

V. SIMULATION AND RESULT ANALYSIS
In this section, we present the results of applying deep reinforcement learning on a UR-10 robotic manipulator via the simulation platform V-rep. Our experiment has been conducted in the V-rep platform using both Remote API clients and Embedded scripts to control the robotic manipulator. Our team has chosen UR-10 manipulator with an attached piercing tool. In this experiment, the goal is to find out the coordinate transformation relationship between the robotic manipulator's end-effector and the world coordinate systems. In other words, we are trying to solve hand-to-eye calibration problems using the deep reinforcement learning method. Based on that, we set that third-party independent camera only to observe the needle tip and target point coordinates. In the V-rep platform, we were using inverse kinematics (IK) [37] to control the UR-10 robotics arm, which means that getting every world coordinate system of joint angles is not necessary. Instead, suppose the manipulator can figure out the coordinates of the needle tip of the piercing tool through a third-party independent camera. In that case, we can deduce all coordinates of each joint angle.

A. SIMULATION PLATFORM
Simulation is a very important tool for algorithm verification and can reduce the cost of learning. For this purpose, the virtual robot experimentation platform, known as V-Rep [38] is very useful. In the V-rep platform, there are various robots, including a 7-DoF manipulator, UR5, UR10, vehicles, Dobot Magician and so on, which users can choose to perform their tasks. Meanwhile, V-Rep allows the user to use various programming tools to perform the simulation simultaneously, such as Remote API clients, Addons, Plug-ins, Embedded scripts, and ROS nodes. For our paper, we only use Remote API clients and Embedded scripts.
The user manual describes that Remote API clients allow V-Rep to interact with an external entity. This external entity can be any hardware, and its remote function can be written as another coding language, such as Matlab, Python, or Java. On the other hand, an embedded script is a script embedded in a model, allowing users to write the central simulation command within the V-rep platform, and the coding program used is Lua [39]. This main script is the central control of the simulation. With these two powerful programming tools, we can import deep reinforcement learning via remote API clients and start our simulation using the UR10 manipulator. Unlike the traditional hand-to-eye calibration system [40], the purpose of our simulation tasks is to enable the robotic manipulator can learn how to recognize and reach the targets by itself. To achieve that, we will use deep reinforcement learning, which will be discussed later in the following section, as our approach to constructing a hand-to-eye calibration system.

B. PARAMETER CHOICE OF SIMULATION
For simulation purpose, we carefully designed parameters (as shown in Table 1) for each of them as both of them can lead the experiment into wrong direction if parameters are not been carefully reviewed. All parameters are the same in the three different experiments except for the state and action dimensions, since the specific experiment determines their dimensions.

C. SIMULATION SCENARIO
We have performed three experiments based on different target selections to verify that our deep reinforcement learning algorithms with modified reward and action are suitable for hand-to-eye tasks. In the simulation process, instead of recording the success rate of each episode, we calculated the average success rate of each of the 20 episodes in one or two-dimensional target selection and every 50 episodes in three-dimensional target selection and plotted it as a line graph. According to our neurodynamic adaptive step-wise settings rules, we set our reward function and action function based on equation 14 and equation 15 in simulation processing as follows: and where the value of reward and agent's action movement are based on different tip target distance (TTD), in centimeters, that is, the reward should continue to grow as the tip target distance shrinks, while the agent's movement should decrease as the distance shrinks. In this way, we can not only improve the training speed of the agent, the stability of the training, but also improve the accuracy and safety.
In addition, when setting the constraints of the step-wise reward function, we refer to the maximum range of motion of the manipulator in the last meter of the three-dimensional space. In three-dimensional space, in a cube with a side length of 1m and a diagonal length of 1.73m, the most extended trajectory of the distance between the end effector and the target point is 1.73m. Therefore, we set the initial constraint of the step-wise reward function to start from 0.173m and decrease in multiples as the manipulator gets closer to the target point.

1) SCENARIO ONE
We start with one-dimensional target point selection, meaning only the x coordinate can be changed through entire experiments and the y, z coordinates are fixed. In other words, we set y = 0.34412, z = 1.2896, and x ∈ [−0.519489, 0.580511]. As shown in Fig. 6, this graph represents the average success rate per 20 episodes during the training simulation.The graph shows that it takes approximately 9160 episodes of training for the policy to converge. To verify this is an acceptable result, we also conducted a comparison experiment, which will be discussed in the later section. Moreover, in the graph, the red line represents the result after training without smoothing, and the blue line represents the result after smoothing using the mean noise reduction technique; the reason for adding smoothing is to reduce the volatility during training, and reducing volatility means relative smoothness. For consistency of simulation results, we performed the same procedure for the rest of the simulation.

2) SCENARIO TWO
Secondly, we did slightly more complicated experiments in which the target point space is two-dimensional, where the x, and z has been sampled uniformed, where x ∈ [−0.151959, −0.751959], z ∈ [0.98946, 1.58946], and y = 0.34487. As shown in Fig. 7, it takes approximately 10000 episodes of training for the policy to converge, which the tip target distance between(TTD) needle and the target points are within 0.1cm, and the whole training process lasted about 12 hours. Similar to the scenario one, we also did the evaluation test to verify whether the new reward and action setting is more efficiency than original setting, which will be discussed in a later.

3) SCENARIO THREE
Thirdly, we performed an even more complex simulation, which the target point is three-dimensional, where VOLUME 11, 2023 FIGURE 7. Average accurate rate synthesized by neurodynamics adaptive deep reinforcement learning algorithm using reward function (16) and action function (15) for two dimensional hand-to-eye calibration problem. The reason we carefully designed such a target point interval is to consider the limited range of motion of the manipulator. At the same time, we consider how to make the arms of the manipulator operate without colliding with each other. The results shows in Fig. 8 told us that it takes approximately 25000 episodes of training for the algorithms to converge, and we trained continuously for 24 hours to make the robotic manipulator reach the target position accurately and maintain a stable state.

VI. COMPARISON ANALYSIS
We evaluated and compared the stability of the average accuracy and reward across different settings of action functions and reward functions in different dimensional target spaces. Additionally, we compared the results of our proposed algorithm with the DDPG and SAC algorithm to verify that our actor-critic based algorithm is more suitable for the handto-eye calibration task. This comparative analysis serves to provide further evidence and support for the effectiveness of our proposed algorithms. By comparing our approach to these established methods, we were able to assess its performance, evaluate its advantages, and demonstrate its superiority in addressing the calibration problem. The results FIGURE 9. Average accurate rate synthesized by neurodynamics adaptive deep reinforcement learning algorithm using reward function (16) and action function (15) and static deep reinforcement learning algorithm using reward function (3) and action function (4) for hand-to-eye calibration problem in three different dimensional space. The top two graphs represent the average accuracy of the calibration tasks in the 1-dimensional target space, the middle two graphs represent the average accuracy of the calibration tasks in the 2-dimensional target space, and the bottom two graphs represent the average accuracy of the calibration tasks in the 3-dimensional target space. of this comparative analysis highlight the unique contributions and advancements offered by our proposed algorithms in comparison to existing techniques.

A. AVERAGE ACCURACY AND RETURN COMPARISON FOR TWO DIFFERENT ACTION AND REWARD SETTING
As shown in Fig. 9, the average accuracy of the new action and reward function settings represented by the three left graphs is better than the original action and reward function settings represented by the three graphs on the right. Moreover, compared with the old reward and action function setting, the training process of the new setting is smooth, that is, the fluctuation between each training set is not as large as in the old setting and cannot reach the target position.
From an average episode return perspective, the average episode return should remain within a specific range. In our case, once the tip target distance (TTD) is less than or equal to the pre-designed range, we consider the agent to have done its job. Therefore, the reward should remain around 7, with a maximum of 9. As can be seen from Fig. 10, the agent achieves this well in the simulation with the new reward and action function setting. However, in the old setting, the average reward kept rising, which meant that the agent could not reach the goal before the end of each episode, causing the agent to generate erroneous reward stacking.
Remark: We conducted a descriptive statistical analysis on different target dimensional spaces and find that our proposed algorithm with the DASS setting outperforms the old reward and action function setting, as shown in the Table 2. It should be noted that the data used for this analysis included a training phase from scratch, so the average accuracy was close to 100%. Nonetheless, our proposed algorithm exhibits superior performance compared to older settings, demonstrating the effectiveness of our approach.

B. AVERAGE ACCURACY COMPARISON ANALYSIS WITH DDPG ALGORITHMS
In this subsection, we conducted a simulation using the deep deterministic policy gradient (DDPG) algorithm to determine whether our proposed algorithm was better suited for the hand-to-eye calibration task. We conducted a comparative evaluation of the performance of two algorithms, under identical parameter settings, reward and action function specifications. Our analysis focused on comparing the average accuracy of the algorithms across varying target dimensional spaces. The results of our simulation are presented in Fig.11, and Table 3, which shows the average accuracy of both algorithms across different target dimensional spaces. From a training perspective, our findings suggest that although the DDPG algorithm can successfully accomplish the hand-toeye calibration task in one-and two-dimensional spaces, its training process is highly unstable. As depicted in Fig. 11, when agents were trained to accomplish the task in onedimensional spaces, it can be observed that after the 14,000th iteration, the agent appeared to lose the trained dataset, causing the average success rate to drop to almost zero. Similarly, agents faced the same issue when training in two-dimensional spaces after the 5,000th iteration. This sudden drop in performance could be attributed to a phenomenon known as catastrophic forgetting [41], where the agent forgets previously learned information as it learns new information. One possible reason behind this phenomenon is that the agent may mistakenly believes that the robotic manipulator is trapped in a local optimum, leading to fluctuations in training and affecting its stability and efficiency.
Moreover, when using the DDPG algorithm to train the agent in the three-dimensional target space, it encounters difficulty in successfully identifying the target, leading to more severe instances of repeated training compared to the one-and two-dimensional spaces. This can be attributed to the increased complexity of the task in three-dimensional space and the greater number of potential target positions, which presents a challenge for the DDPG algorithm.
In addition, from the perspective of the efficiency of agent training, even in one or two-dimensional target space, the agent needs at least two consecutive days of training to achieve the desired results. At the same time, even if the agent trains continuously for three days in the three-dimensional target space, it cannot achieve the desired effect. Such training efficiency has no practical advantages for the robotic manipulator in practical applications. On the contrary, our proposed algorithm can successfully accomplish the task well in less than one day in a complex three-dimensional space. Although it takes almost a day for the algorithm training to complete the autonomous calibration task, we believe that future intelligent robotic manipulators should adopt this actor-critic based deep reinforcement learning algorithm. After the robotic arm is well-trained and fully aware of the environment, it can complete any calibration task in a very short time. This approach has significant advantages over traditional calibration methods, as it does not require manual intervention and can be used for various types of robotic manipulators, reducing costs and improving efficiency.
Overall, these findings highlight the limitations of the DDPG algorithm for the hand-to-eye calibration task and suggest that our proposed algorithms is more effective for this task.

C. AVERAGE RETURN COMPARISON ANALYSIS WITH DDPG ALGORITHMS
From the perspective of the average return of agent training, our proposed algorithms have shown better performance than the DDPG algorithms. As presented in Fig.12, our proposed algorithms has higher average return than DDPG algorithms across different dimensional target spaces. In particular, the analysis of the training data in different dimensional target spaces, as shown in Table 3, demonstrates that our proposed algorithms have a higher average return than the DDPG algorithms. Furthermore, our proposed algorithms has more consistent performance across trials when compared variance of average return. Therefore, our proposed algorithms is more stable performance than the DDPG algorithms, indicating that they are more effective in training the agent for the handto-eye calibration task.

D. AVERAGE ACCURACY COMPARISON ANALYSIS WITH SAC ALGORITHMS
To further prove the effectiveness and superiority of our proposed algorithm, we conducted an additional simulation using the soft actor critic (SAC) algorithm [42], which has been proven to be more effective in robotic field. This simulation aims to use the identical parameter settings, reward VOLUME 11, 2023    . Average accuracy comparison with DDPG algotithms. The top two graphs represent the average accuracy of two algorithms in the 1-dimensional target space, the middle two graphs represent the average accuracy of two algorithms in the 2-dimensional target space, and the bottom two graphs represent the average accuracy of two algorithms in the 3-dimensional target space.

FIGURE 12.
Average return comparison with DDPG algotithms. The top two graphs represent the average return of two algorithms in the 1-dimensional target space, the middle two graphs represent the average return of the two algorithms in the 2-dimensional target space, and the bottom two graphs represent the average return two algorithms in the 3-dimensional target space.
function specifications and action function specifications to compare the performance of our algorithm and SAC. Through the comparative assessment of the performance of the two algorithms, we could provide further evidences to prove the feasibility and advantage of our proposed algorithm for the hand-to-eye calibration task.
As shown in Fig. 13, although SAC algorithms can quickly complete the calibration tasks in one-dimensional spaces, it has the same problems with SAC algorithms where the agent appeared to lose the trained dataset, causing the average success rate to drop to almost zero after 900th iteration. Moreover, when using the SAC algorithm to train the agent in the two-and three-dimensional target space, it encounters difficulty in successfully identifying the target, leading to more severe instances of repeated training compared to the onedimensional spaces. This can be attributed to the increased complexity of the task in two-and three-dimensional space and the greater number of potential target positions, which presents a challenge for the SAC algorithm. In additional, as shown in Table 4, our proposed algorithms has more consistent performance across trials when compared variance of average return. Therefore, our proposed algorithms is more stable performance than the SAC algorithms, indicating that they are more effective in training the agent for the hand-toeye calibration task.

VII. CONCLUSION
In this paper, we propose a novel method for addressing the hand-to-eye calibration problem using deep reinforcement learning. Our proposed algorithm utilizes an actor-critic framework and incorporates neurodynamics adaptive reward and action functions, which allow for better convergence, reduce the dependence on the initial value, and overcome the local convergence issues of traditional deep reinforcement learning methods. Additionally, we introduce a step-wise mechanism under the guidance of the attention mechanism, and zero stability to handle the complexity of the calibration task in challenging environments. We have conducted several simulations to demonstrate the validity of our proposed algorithm, and the results show that the agent can achieve nearly 100% accuracy after the learning phase with step-wise neurodynamics adaptive reward and action function settings. Furthermore, we have compared our proposed algorithm with DDPG and SAC algorithms through additional simulations, which further prove its effectiveness and superiority.
For future research on the intelligent robotic manipulator, the agent still has the following capabilities to improve. The training time of the hand-to-eye calibration task needs to be adapted to the increasingly developed industrial intelligent production technology. Secondly, the agent should be able to adjust the parameters in time with the sudden change of the camera position to quickly find the new relative position relationship between the robotic manipulator and the camera so that the calibration training process still can succeed in very short time.
MENGFEI YU received the B.S. degree in information and computing science from Jiangxi Normal University, Nanchang, China, in 2020. He is currently pursuing the master's degree in computational mathematics with the South China University of Technology, Guangzhou. His research interests include robotics, reinforcement learning, and data mining. DELU ZENG received the bachelor's degree in applied mathematics and the Ph.D. degree in signal and information processing from the South China University of Technology (SCUT), Guangzhou, in June 2003 and June 2010, respectively. He has been a Visiting Scholar with Columbia University, the University of Waterloo, and the University of Oulu. He is currently a Full Professor with the School of Electronic and Information Engineering, SCUT. His research interests include statistics learning, image and speech processing, computational intelligence, machine learning, fitting and approximation and their applications to communications, and industrial intelligence.