Application of an improved whale optimization algorithm in time-optimal trajectory planning for manipulators

: To address the issues of unstable, non-uniform and inefficient motion trajectories in traditional manipulator systems, this paper proposes an improved whale optimization algorithm for time-optimal trajectory planning. First, an inertia weight factor is introduced into the surrounding prey and bubble-net attack formulas of the whale optimization algorithm. The value is controlled using reinforcement learning techniques to enhance the global search capability of the algorithm. Additionally, the variable neighborhood search algorithm is incorporated to improve the local optimization capability. The proposed whale optimization algorithm is compared with several commonly used optimization algorithms, demonstrating its superior performance. Finally, the proposed whale optimization algorithm is employed for trajectory planning and is shown to be able to produce smooth and continuous manipulation trajectories and achieve higher work efficiency.


Introduction
Manipulators are multi-degree-of-freedom robots capable of autonomous operation and task execution. They have been utilized in fields including manufacturing, medical care and aerospace [1]. These manipulators operate autonomously and perform tasks efficiently. In manufacturing, manipulators streamline production, handle materials and ensure consistent quality. In medical care, they enable precise and minimally invasive surgeries, leading to faster recovery and improved outcomes. Aerospace benefits from manipulators for assembling and maintaining components in challenging environments. However, as industrial level and job requirements continue to increase, the performance requirements for manipulators in various industries are becoming increasingly stringent. As a result of these requirements, several experts and scholars have dedicated a lot of time and effort to researching issues such as trajectory planning, path planning [2] and tracking control [3] of manipulators [4].
An important aspect of manipulator design is trajectory planning. It holds the key to minimizing operation time, reducing energy consumption and maximizing productivity. In manufacturing, optimized trajectory can streamline production processes and improve overall efficiency. In medical applications, precise trajectory planning allows for minimally invasive procedures with enhanced patient safety. Similarly, in aerospace, accurate trajectory planning ensures smooth and agile movements in challenging environments. It can be divided into multi-objective trajectory planning and single-objective trajectory planning. The planning of single-objective trajectory is mainly concerned with time, energy [5] and impact [6], while multi-objective trajectory planning combines multiple single-objective goals to meet different working environments [7,8]. Time-optimal trajectory planning is a crucial focus of current research due to its profound impact on manipulator performance. By enabling manipulators to complete tasks in the shortest possible time, this optimization technique significantly improves work efficiency, leading to enhanced productivity and reduced operational costs. With industries seeking streamlined processes and faster task execution, time-optimal trajectory planning plays a pivotal role in maximizing the potential of manipulators, making it a critical area of exploration and innovation in the field.
The paper [9] proposes an adaptive cuckoo algorithm, which has good convergence and convergence ability and combines with a quintic B-spline curve to obtain a smooth time-optimal trajectory. The paper [10] combines the original teaching-learning-based optimization algorithm with the variable neighborhood search (VNS) algorithm to improve escape ability from local optima and combines with a quintic B-spline curve to obtain time-optimal trajectory for the manipulator. The paper [11] proposes a local chaotic particle swarm optimization (PSO) algorithm, which solves the problem of early convergence into local optima in traditional particle swarm algorithm and combines with piecewise polynomial interpolation function to generate time-optimal trajectory. The paper [12] proposes an improved sparrow search algorithm, which uses tent chaotic mapping to optimize the generation of initial population, combines with an adaptive step factor to make the algorithm have good convergence effect and finally obtains a good operating trajectory.
In 2016, Mirjalili proposed a novel intelligent optimization algorithm known as whale optimization algorithm (WOA). Compared with other optimization algorithms such as the PSO, cuckoo search and genetic algorithm, the WOA has the advantages of fast convergence speed, simple algorithm and high convergence accuracy. These features make it an ideal choice for time-optimal trajectory planning in manipulators. The WOA exhibits rapid convergence, allowing the discovery of global optima within a limited number of iterations, thus reducing computation time. Additionally, its high accuracy ensures that planned trajectories closely approximate optimal solutions. In the context of time-optimal trajectory planning, precise trajectories are crucial for efficient manipulator motion. By improving the WOA, we can effectively address challenges in time-optimal trajectory planning, leading to improved motion efficiency and better alignment with industrial application requirements. The paper [13] proposed an improved whale optimization algorithm (IWOA), which designed dynamic inertia weights for two behaviors by improving the contraction-expansion mechanism and the spiral updating mechanism, thus enhancing the search ability of the algorithm. However, it was observed that in later stages, the algorithm tended to get trapped in local optima. In paper [14], a multi-strategy whale optimization algorithm (MSWOA) was proposed, which incorporated adaptive weights, Lévy flight and evolutionary population dynamics to enhance the algorithm's search capability. However, it was found that the algorithm failed to converge to the global optimum in some test functions. The paper [15] proposed a modified whale optimization algorithm (MWOA) that employs probabilistic prey selection and adjusts the initialization of the population and the search strategy during the development phase to reduce the likelihood of getting trapped in local optima, thereby enhancing the algorithm's robustness. Nevertheless, the algorithm exhibits a relatively high time complexity while tackling optimization problems. Although all of these algorithms have achieved good results, they may not perform well in some target optimization problems.
Therefore, this study presents an enhanced version of the whale optimization algorithm (RLVWOA) that combines reinforcement learning and the VNS algorithms. First, an inertia weight is designed for the surrounding prey and bubble-net attack behavior of whales and the control weight value is optimized using the q-learning and SARSA algorithms to enable each generation of populations to obtain suitable inertia weight, thereby enhancing the global search capability of the algorithm. Then, combined with the VNS algorithm, the local search capability of the algorithm is improved through continuous neighborhood search. Compared to the standard WOA, the RLVWOA can adaptively control surrounding prey and bubble-net attack behaviors and with the assistance of VNS algorithm, it can effectively escape from local optima, thereby achieving robust search capabilities. Finally, the RLVWOA is used in conjunction with a quintic non-uniform B-spline (NURBS) curve to perform time-optimal trajectory planning for the manipulator and its feasibility is verified in MATLAB.
The primary contribution of this study lies in the development of the RLVWOA algorithm, which innovatively integrates reinforcement learning algorithm and VNS algorithm. This integration leads to substantial performance improvements and presents an enhanced solution for the time-optimal trajectory planning problem in manipulators. The proposed enhancements significantly accelerate convergence and optimize the algorithm's capabilities, while mitigating the risk of getting trapped in local optima, thereby facilitating the discovery of more efficient trajectory paths. Consequently, this paper introduces a novel method for manipulator trajectory planning, leading to heightened work efficiency and smoother operations and exhibiting promising prospects for widespread application across various industries, encompassing manufacturing, medical care and aerospace.
The subsequent sections of this paper are organized as follows: Section 2 introduces the basic concepts of NURBS interpolation. Section 3 provides an overview of WOA, reinforcement learning and VNS algorithms. In Section 4, the proposed method for improving the WOA is described and a comparison between the RLVWOA and other commonly used single-objective algorithms is conducted on test functions. Section 5 focuses on the modeling of the PUMA560 robotic arm and compares the trajectory planning results obtained using the RLVWOA and traditional single-objective algorithms. The final section highlights the contribution of this study and suggests potential directions for future work.

Basic concepts of NURBS curves
The NURBS interpolation is a widely used curve or surface fitting technique, which is also widely used in the manipulator trajectory planning. Compared with traditional B-spline curves, NURBS curves have greater flexibility and accuracy and can better fit complex curve shapes. Based on the mathematical model of control points and nodes, it can generate smooth and continuous trajectories. By optimizing the weight of control points and the distribution of nodes, the optimal manipulator trajectory planning can be achieved, thereby improving the accuracy and efficiency of the manipulator. Using the NURBS interpolation for trajectory planning can help solve complex manipulator motion problems, while also improving the reliability and stability of the manipulator. A k-th NURBS curve can be expressed as a segmented rational polynomial function [16], as shown in Eq (1).
Where the weight factor of the NURBS curve is denoted by ω, di is the control vertex of the NURBS curve, k is the degree of the NURBS curve, x is the parameter of the NURBS curve and Ni,k(x) is the basis function of the k-th NURBS curve. Here, Ni,k (x) can be obtained by the De Boor-Koch formula from the node vector X = [x0, x1, ⸱⸱⸱, xn+k, xn+k+1], as shown in Eqs (2) and (3) and 0/0 is defined as 0 [17].

The quintic NURBS curve matrix
The equation for calculating the derivative of a NURBS curve of degree k is expressed by Eq (5) [19]: According to Eq (6): It can be derived that when solving for NURBS curves with n + 1 unknowns, four boundary conditions need to be added to ensure a unique solution to the equation system. Therefore, according to the actual motion conditions of the manipulator, the following four boundary conditions are added as shown in Eq (7): Where v and a represent the angular velocity and angular acceleration of the manipulator. Substituting Eq (7) into Eq (1), the matrix equation for solving all control points can be obtained as shown in Eq (8).  The joint motion trajectory angle curves of the manipulator can be obtained using Eq (1). By using Eq (5) to solve the derivatives of the curve equation up to the third order, the angular velocity, angular acceleration and angular jerk curves for each joint can be acquired.

Whale optimization algorithm
In 2016, Mirjalili et al. proposed the WOA, which is a recently developed metaheuristic search algorithm. The authors studied and analyzed the optimization ability of WOA from different perspectives such as structure and mathematical models. Experimental results showed that WOA not only has strong search ability and positive feedback, but also can achieve global optimization [20].
The most remarkable feature of a humpback whale is its sociality. Typically, a group of six or so humpback whales search for prey and confirm the target's position. Other groups of whales approach the prey through encircling contraction and spiral contraction and eventually succeed in eating the prey at the appropriate time. The algorithm consists of the following three stages: (1) Surrounding prey It is assumed that the optimal solution corresponds to the position of the target prey in the WOA. Each whale updates its relative position with respect to the target position using Eqs (9) and (10): In these two equations, X * (t) represents the best position, X (t) represents the present position and t represents the present iteration. A and C are adjustment factors, defined as: where, rand1 and rand2 are random values uniformly distributed between 0 and 1 and a is a decreasing factor with a gradual reduction from 2 to 0, represented as: In the equation, tmax represents the maximum number of iterations.
(2) Bubble-net attack In the WOA, the bubble-net attack is categorized into the contraction and encirclement mechanism and the spiral updating mechanism. The contraction and encirclement mechanism is the same as the formula for surrounding the prey, but with the range of A changed from [-a,a] to [-1,1]. The spiral updating mechanism is represented by Eq (14): Here, l is a random number between -1 and 1. The constant b is used to represent the logarithmic spiral shape. Dq represents the distance between the whale and the prey, which is expressed by Eq (15).

( )
Assuming that a whale chooses between the shrink-wrap and spiral update mechanisms with a probability of 50% during the hunting of a target prey, the position update is given by the Eq (16).
(3) Searching for prey The whale decides to use the shrink and encircle mechanism or search for prey mechanism based on the size of parameter A. When A ≥ 1, the whale cannot obtain the optimal position of the prey and therefore needs to randomly search for the target within its range, as expressed in Eqs (17) and (18).

Reinforcement learning algorithm
The reinforcement learning algorithm is proposed by Misky in 1954, which mainly consists of agent, environment, state, action and reward components [21].
Reinforcement learning is a type of machine learning algorithm inspired by biology that aims to learn through experimentation within the possible state-action pairs to find a mapping from states to actions that maximizes the cumulative reward [22]. In reinforcement learning, an agent interacts with its environment by exploring and making decisions based on the present state. The agent first explores and observes the current state St, then makes an action decision actiont based on the perceived current state. The environment changes its state from St to St+1 in response to the agent's action and returns a reward (or punishment) signal rt to the agent. The agent adjusts its action decisions based on the reward feedback from the environment and trains itself to maximize current and future rewards. This process is called a Markov decision process. The basic principle is shown in the Figure 1. Q-learning and SARSA are both value-based reinforcement learning algorithms. Their goal is to find the optimal policy by learning and optimizing the value function. Q-learning algorithm is an offline learning algorithm based on a greedy strategy, which learns the optimal value function by updating the state-action pairs. At each time step, the agent observes the current state and selects the next action based on the current policy function and value function. The agent then observes the next state St+1 and receives the corresponding immediate reward rt. On the other hand, SARSA algorithm is an online learning algorithm, which selects the next action and learns based on the current state and policy function. Therefore, SARSA's learning process is a continuous and constantly updated process, which can dynamically adapt to changes in the environment [23]. Specifically, the value function update formula for Q-learning and SARSA are as shown in Eqs (19) and (20): In these equations, Q(s,a) represents the value function of taking action in state S, α is the learning rate, γ is the discount factor, r is the immediate reward and max ' a is the operation of taking the maximum value among all possible action' in the next state S'.

Variable neighborhood search algorithm
The VNS algorithm is a heuristic optimization algorithm based on neighborhood search that can effectively solve many complex optimization problems. The original proposal of the algorithm can be attributed to Mladenovic and Hansen. It has gained extensive utilization in subsequent research endeavors [24]. The principle of the VNS algorithm is to search on different neighborhood structures and gradually approach the optimal solution by continuously expanding or reducing the neighborhood structure. During the search process, the VNS algorithm jumps out of local optimal solutions and seeks better solutions.
The main steps of the VNS algorithm are as follows: Step 1. Initialization: Randomly generate an initial solution and set the initial neighborhood structure.
Step 2. Neighborhood structure: Generate new solutions by changing the current neighborhood structure. In each neighborhood structure, define a set of operations, such as insertion, deletion, exchange, etc., to generate new solutions.
Step 3. Neighborhood search: Search in the current neighborhood structure to find the best solution. If a better solution is found, go to Step 4. Otherwise, go to Step 5.
Step 4. Neighborhood expansion: Expand the neighborhood structure to better search for possible solutions.
Step 5. Neighborhood contraction: Contract the neighborhood structure to better search for possible solutions.
Step 6. Convergence check: Check if the algorithm has converged. If not, go back to Step 2. Otherwise, output the optimal solution.
The core idea of VNS algorithm is to continuously expand and contract the neighborhood structure to better search for possible solutions. In each neighborhood structure, a set of operations is defined and the best solution is selected based on greedy strategy.

Improved algorithm
The three behaviors of the WOA have a crucial impact on finding the optimal position, while the value of the inertia weight also plays a vital role in the optimization and search capability of the algorithm. The IWOA with dynamic inertia weight proposed in paper [13] introduces an inertia weight value in the surrounding prey and bubble-net attack behaviors, as shown in Eqs (21) and (22). Although this accelerates the convergence speed and improves the convergence capability of the algorithm, the inertia weight value is simply linearly decreased based on the current iteration, which may not be suitable for the current population. Therefore, this paper improves the IWOA algorithm by using reinforcement learning to optimize the control of the inertia weight value, making it more suitable for the current population and enhancing the convergence speed and optimization capability of the algorithm. Additionally, the VNS algorithm is introduced to improve the local search capability of the algorithm and obtain better optimal solutions.

The design of the Q-table
The initial Q-table is a zero matrix of size m × n, where m is the number of states and n is the number of actions. When the environment and actions change, the Q-table is updated according to Eqs (19) and (20), as shown in Eq (23).
According to the results proposed in [25], SARSA algorithm has faster convergence rate, while Q-learning has better overall performance. Moreover, [23] has verified that the combination of SARSA and Q-learning algorithms yields better convergence. The algorithm presented in this study utilizes both Q-learning and SARSA algorithms. However, it employs them at separate stages, as illustrated in Eq (24), where tmax represents the total number of iterations.

The design of the states
To ensure that the WOA obtains better optimization capability and faster convergence speed with appropriate inertia weight values, the state design of the reinforcement learning algorithm needs to be considered. The design of the state should take into account the convergence, diversity and balance of the WOA. Therefore, the following aspects are taken into account in the design of the state: In this equation, t represents the iteration number of the algorithm, f (xi t ) represents the fitness function value of the i-th individual in the t-th iteration and Ct represents the ratio of the sum of fitness values of all individuals in the t-th iteration to that in the initial iteration, which reflects the convergence of the algorithm. Dt represents the ratio of the maximum fitness value of the t-th generation to that of the first generation, which reflects the diversity of the algorithm. Bt represents the ratio of the mean value to the standard deviation of each generation, which reflects the balance of the population in each generation. Equation (28) calculates the state value of each generation by weighted sum. Considering the importance of convergence and diversity of the algorithm, ω1 and ω2 are set to 0.35 and ω3 is set to 0.3.

The design of the actions
Action refers to the agent's response, which is determined by the present state. With each successive population iteration, the agent selects suitable inertial weight values based on the environment. Larger values of ω may cause the algorithm to be trapped in a local optimal solution, while smaller values may affect the algorithm's global search ability. Therefore, ω is defined as 10 actions between (0-1), where the first action, a1, generates a random number from (0.0-0.1) and the second action, a2, generates a random number from (0.1-0.2) and so on. The detailed action values are shown in the Table 1.

The design of the rewards
The agent does not choose actions on its own, but selects the appropriate action based on the Qtable and the current state, in order to obtain more positive feedback. Designing a reward function as shown in Eq (29) can simultaneously take into account the convergence, diversity and balance of the algorithm, making the algorithm more capable of searching. The goal of this paper is to minimize the function value and the smaller state value, the better the performance of the algorithm. Therefore, when St-1 is greater than St, the reward is positive, otherwise it is negative.

Action selection strategy
When the algorithm starts, the values in the Q-table are initialized to zero, which means the agent has no experience to rely on and must explore and learn by experience. By continuously investigating unknown environments, the agent gains more experience, it learns valuable knowledge to inform its actions. The ε-greedy strategy is a method that balances exploration and exploitation, as shown in Eq (30).
Where ε represents the greedy rate and the value of k0-1 is a randomly generated number within the range of 0 to 1. When ε ≥ k, the agent chooses the action that maximizes the Q value, also known as the greedy strategy. When ε < k, exploration is performed and a random action is chosen.

The design of neighborhoods
The objective of this paper is to minimize the optimization problem. Therefore, the design of the VNS aims to expedite the discovery of the global minimum by exploring various neighborhoods. The three neighborhoods are designed as follows: 1) Randomly choose a variable and reduce its value through a certain amount.
2) Randomly choose a variable and multiply it through a generated number within the range of 0 to 1.
3) Randomly select two variables and swap their positions.

The algorithm processes
The combination of reinforcement learning algorithm, the VNS algorithm and the WOA requires considering reward, state, action and action selection strategy. The WOA is treated as the environment and the state S is calculated based on Eq (28) and at each iteration, St is updated to St+1. The learning component comprises the agent and the reward r. The entire procedure can be divided into four sequential steps. To begin with, the agent obtains the environment state St for the t-th iteration, then chooses action based on the Eq (30) and adjusts ω value. The WOA will iterate using the updated ω. After completing one iteration, the environment state will transition from St to St+1. Lastly, the reward r is calculated based on the Eq (29) and the Q-table value is updated by Eq (19) or Eq (20). After t iterations, the agent will select optimal ω based on prior exploration experience for the current state. The algorithm flowchart of the RLVWOA is shown in Figure 2.

Comparative validation
To verify the feasibility of the RLVWOA, twenty standard benchmark functions were selected for testing [26], as shown in Table 2 and compared with the reptile search algorithm (RSA) [27], snake optimization (SO) [28], WOA, IWOA, MSWOA and MWOA. To ensure the fairness of the experiment, using the same computer, the population number of all algorithms N = 30, dimension D = 30, number of iterations tmax = 300 and other parameter settings for each algorithm are shown in Table 3.       Each testing function is run 30 times using each algorithm separately. The comparative results are shown in Table 4 and the time it takes for each algorithm is shown in Table 5. The highlighted section denotes the algorithms that achieved the highest performance for each testing function.     According to the results from Table 4 and Table 5, although RLVWOA exhibits longer running time and fails to converge to the theoretical optimal values on some test functions such as F5, F6 and F9, it demonstrates relatively better convergence accuracy and attains the best mean ranking. For the sake of brevity, this paper only presents the convergence figures of F1, F4, F9, F12, F17 and F20, which include two unimodal test functions, two multimodal test functions and two fixed-dimension test functions. To make these figures more intuitive, we use the same initial population and set tmax = 50.
As shown in the Figure 3, although RLVWOA requires more running time, it demonstrates better convergence performance, enabling faster convergence compared to other algorithms. Therefore, it fully demonstrates that the RLVWOA, which combines the reinforcement learning algorithm and the VNS algorithm, can solve the unstable optimization performance of the WOA well.

Model establishment
The problem of time-optimal trajectory planning for manipulators can be likened to solving a constrained optimization problem to find the minimum value. It heavily relies on the algorithm's search capability to navigate through the vast solution space and identify the optimal trajectory that minimizes the completion time while satisfying the constraints imposed by the manipulator's dynamics and task requirements. The efficiency of the optimization algorithm plays a pivotal role in achieving time-optimal solutions, ensuring the manipulator's swift and precise execution of tasks in various industrial applications. To facilitate better understanding and avoid the need to learn about different manipulator structures, this paper chooses to use the common PUMA560 manipulator as the model for trajectory planning. Its modified D-H parameters and kinematic constraints are shown in Tables 6  and 7, respectively.  The goal of this paper is to find the time-optimal trajectory for the manipulator. Therefore, the fitness function of the algorithm is defined as depicted in Eq (31): In the Eq (31), f denotes the overall execution duration of the manipulator and ti represents the time to reach the i-th path point.
Based on the data in Tables 6 and 7, The selected path points that satisfy the kinematic constraints are shown in Table 8. Based on these path points, by substituting it into Eq (1) and Eq (31), the timeoptimal trajectory planning for the manipulator is conducted.

Trajectory planning
The time-optimal trajectory planning for the manipulator using the RLVWOA is conducted. In order to further validate the performance of the algorithm, the RSA, SO, WOA, IWOA, MSWOA and MWOA algorithms are also utilized for the trajectory planning of the manipulator. Each algorithm utilizes the same number of iterations T = 300 and population size N = 30, while other specific parameters are taken from the data presented in Table 3. The specific results are shown in Table 9, where the results obtained by the RLVWOA are highlighted in bold. The convergence comparison figure is shown in Figure 4.  Based on the data in Table 9 and Figure 4, it can be observed that the RLVWOA achieves superior results in terms of obtaining the shortest running trajectory for the manipulator. The RLVWOA, compared to the standard WOA, achieves a reduction of 39.39% and compared to other improved WOAs achieve a minimum reduction of 11.51%. Additionally, the RLVWOA demonstrates faster convergence speed, further validating its superior search capability as proposed in this paper. The trajectory planning plot is depicted in the Figure 5.
According to Figure 5, all curves are uniform, continuous and devoid of any abrupt changes. Furthermore, they adhere to the kinematic constraints outlined in Table 7. Therefore, it can be concluded that the RLVWOA is capable of obtaining a superior time-optimal trajectory.

Conclusions
This paper proposes an improved RLVWOA that combines reinforcement learning to enhance global search capability and introduces VNS algorithm to improve local search capability. A comparison with other algorithms demonstrates the superior performance of RLVWOA. Subsequently, the RLVWOA is employed in conjunction with the quintic NURBS for trajectory planning of the manipulator. The result is a smooth, uniform and continuous trajectory, which outperforms the results obtained by other optimization algorithms in terms of reduced the manipulator operation time.
The main contribution of this paper is the proposal of an improved RLVWOA that exhibits superior search capability compared to other algorithms. However, there are still some issues that need to be addressed in future work. This paper only combines reinforcement learning algorithms. It would be worthwhile to explore the use of deep learning algorithms such as deep q-learning network (DQN) algorithm, deep deterministic policy gradient (DDPG) algorithm and twin delayed deep deterministic policy gradient (TD-3) algorithm as potential alternatives. Additionally, while the introduction of the VNS algorithm has improved the search capability, it has also increased the algorithm's runtime significantly. Future work could involve redesigning more suitable neighborhoods or adding termination thresholds to control the runtime of the algorithm.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.