Bio-Inspired Optimization Algorithm Associated with Reinforcement Learning for Multi-Objective Operating Planning in Radioactive Environment

This paper aims to solve the multi-objective operating planning problem in the radioactive environment. First, a more complicated radiation dose model is constructed, considering difficulty levels at each operating point. Based on this model, the multi-objective operating planning problem is converted to a variant traveling salesman problem (VTSP). Second, with respect to this issue, a novel combinatorial algorithm framework, namely hyper-parameter adaptive genetic algorithm (HPAGA), integrating bio-inspired optimization with reinforcement learning, is proposed, which allows for adaptive adjustment of the hyperparameters of GA so as to obtain optimal solutions efficiently. Third, comparative studies demonstrate the superior performance of the proposed HPAGA against classical evolutionary algorithms for various TSP instances. Additionally, a case study in the simulated radioactive environment implies the potential application of HPAGA in the future.


Introduction
Nuclear energy has been widely applied in various developed countries, as well as in several developing countries, including China [1].In this situation, a growing number of humans, robots, and other agents are employed to operate the nuclear facilities, which might increase the risk of nuclear exposure [2].Although nuclear protective equipment can prevent agents from a large amount of radiation dose, it is harmful to human health and robot stability and reliability to work in the radioactive environment [3].Therefore, with respect to the path planning problem in the radiation environment, one of the crucial goals is to provide an optimal path traversing all the operating points eventually with the lowest cumulative radiation dose [4].Note that the traversing issue is defined as a multi-objective operating planning problem, which is distinct from the multi-objective optimization problem.
In overhauling or accident response scenarios, people or robots should traverse all the operating points and then return to the origin.Therefore, how to determine an operating sequence with the minimal radiation dose, namely the multi-objective operating planning problem, is important as well for the path planning process.Note that the aforementioned issue is similar to a standard traveling salesman problem (TSP).Wang et al. proposed an improved particle swarm optimization combined with a chaos optimization algorithm to cut the effective radiation dose when agents traverse over all the nodes [5].Xie et al. combined the improved ACO algorithm and chaos optimization algorithm to solve the multi-objective inspection path-planning problem [6].Although both methods are demonstrated to be effective in radiation path planning, the multi-objective operating planning problem can be modeled in a more complex way by taking the task difficulty at each operating point, i.e., the operating time, into consideration to be closer to reality.Compared to the classic TSP, the cost between two operating points is not just a simple Euclidean distance but a compound metric including cumulative dose and the consumed operating time.Therefore, the multi-objective operating planning problem can be modeled as a variant of the traveling salesman problem (VTSP).
This paper aims to solve the multi-objective operating planning problem; one primary part is path planning in the radiation environment considering multiple operating points with different operating difficulty levels and multiple radiation sources of different dose rates.Further, a modified genetic algorithm (GA) associated with reinforcement learning (RL), namely the hyper-parameter adaptive genetic algorithm (HPAGA), is provided to solve the radiation VTSP more efficiently.In practical terms, this proposed methodology will prevent people and robots from excessive radiation doses, holding considerable importance, especially as the nuclear power industry construction continues to develop rapidly.
There are three primary contributions listed as follows: 1.
A more complicated multi-objective operating planning problem model in the radiation environment is constructed compared to [6].Specifically, this model considers the operating difficulty level at each operating point ignored entirely in [6], which influences the time to complete each operating task and then the cumulative radiation dose.Therefore, this newly constructed model is closer to the engineering practice.

2.
A combinatorial algorithm framework consisting of the bio-inspired optimal algorithm and reinforcement learning is provided, where the hyper-parameters of GA, including crossover probability, mutation probability, and population size, can be adjusted by the RL during the iterative process in order to solve the VTSP more efficiently.

3.
Comparative tests between the proposed HPAGA and several classical evolutionary computing algorithms in terms of solving different TSP instances with diverse scales are conducted to demonstrate the superior performance of the proposed hybrid algorithm.
The rest of this paper is organized as follows: Section 2 gives a brief overview of the related work.The model of the multi-objective operating planning problem in the radiation environment is constructed in Section 3. The combinatorial algorithm framework is described in Section 4. A series of comparative experiments between the proposed method and other classical methods are recorded in Section 5. Besides, a case study in a simulated nuclear facilities inspection task is conducted in Section 6.Finally, the conclusion and future work is expounded in Section 7.

Related Work
Recently, plentiful path planning and operating planning methods have been proposed for radiation environments to minimize cumulative radiation doses during overhauling or accident response stages [7].Graph searching, as a typical method for path planning, has been employed for radioactive environments.Liu et al. proposed an A* algorithm to plan the walking path with a minimum dose.Similarly, several sampling-based exploration methods have been utilized in the path planning with reducing radiation dose [8].Chao et al. proposed a grid-based rapidly exploring random tree star (RRT*) method to prevent workers from nuclear exposure as much as possible [9].Evolutionary computing algorithms and their variants are widely used to solve this issue.For instance, Zhang et al. proposed a hybrid algorithm consisting of an improved ant colony optimization (ACO), A* algorithm, and particle swarm optimization [2,10].Meanwhile, Lee et al. provided a conflict-based search approach for multi-agents to find respective optimal paths in the radi-ation environment [11].The aforementioned methods aim at finding an optimal path from the start point to the destination point neglecting the possible multiple operating points.
Different from the aforementioned planning issues in the radiation environment, this paper focuses on the multi-objective operating panning problem, which is regarded as a VTSP.Note that TSP is a typical combinatorial optimization problem, which belongs to the NP-hard problem [12].To solve TSP, related algorithms can be roughly classified into three categories, i.e., exact algorithms, heuristic algorithms, and bio-inspired optimization algorithms [13].Applegate et al. proposed the concord algorithm via modeling TSP as mixed-integer programming problems, where a branch-and-cut algorithm is utilized to solve it [14].This is one of the best exact solvers to our best knowledge [15].Meanwhile, LKH-3 is a state-of-art heuristic algorithm for solving TSP, which involves the thinking of local search and k-opt operators to reduce the exploration space [16].However, both the exact solvers and the heuristic methods are time-consuming to obtain satisfactory solutions.In contrast, bio-inspired optimization algorithms, such as the representative of approximate algorithms, can obtain accepted solutions of TSP with a short running time.There is GA [17,18], wolf search algorithm [19], rat swarm optimizer [20], and so on for solving TSP.Thereinto, GA is a popular optimization technique that mimics the process of natural selection [21].However, it is difficult to effectively set up the hyper-parameters including crossover probability, mutation probability, the amount of population, and so on [22].Recently, several hybrid algorithms combined with evolutionary computing algorithms and reinforcement learning have been provided to solve NP-hard problems [23,24].Inspired by the creative idea of the hybrid algorithm [25], reinforcement learning is employed to adjust the hyper-parameters of GA according to the fitness of the population so as to speed up convergence and avoid the local minimum in this paper.

Radiation Dose Model
In the radioactive environment, suppose that there are N radiation sources R i with different dose rates, represented by Dr(R i ), located in the XOY plane as shown in Figure 1.The radiation dose rate derived from each radiation source is inversely proportional to the square of the distance.Therefore, the dose rate of a certain point P i suffering from multiple radiation sources is obtained as where |P i R k | denotes the distance between points P i and R k .The cumulative dose is the crucial reason for causing the harmfulness to people and robots, which is related to the exposure time.With respect to the multi-objective operating planning problem in the radioactive environment, the cumulative dose between two operating points P i and P k consists of primary two parts, namely the locomotion cumulative dose and the operating stay cumulative dose, which is expressed by where C rl (P i , P k ) means the locomotion cumulative dose between P i and P k , and C ro (P k ) denotes the operating stay cumulative dose at P k .The radiation dose rate map with six radiation sources is intuitively illustrated in Figure 2. Concretely, the locomotion cumulative dose is generated during the locomotion from one operating point to the next operating point, which can be calculated by where n is the resolution factor representing the number of the equipartition points as shown in Figure 3. Besides, v denotes the velocity of the agent, which is a constant in this paper.Meanwhile, the operating stay cumulative dose is derived by where T s (P k ) represents the cost time during operating at P k which is related to the difficulty of the operating task.Note that the radiation dose model is more complex than [6], for the operating difficulty is taken into consideration when computing the cumulative dose.

VTSP Formulation
In this paper, the multi-objective operating planning problem in the radiation environment is modeled as a variant TSP, where the Euclidean distance between any two nodes is replaced by the cumulative radiation dose.Similar to the typical TSP, the purpose is to find a traversing sequence of operating points with the minimum cumulative radiation dose, where the agent should launch from the origin, pass through every operating point only once, and finally return to the origin.
Suppose that there are K operating points {P 1 , P 2 , . . ., P K } in the radioactive scenario, the traversing sequence is defined as where B o means the origin point.Then, the total cumulative radiation dose during the whole process is described as where C t (Γ) denotes the total cumulative dose related to a certain sequence Γ.Furthermore, the optimal sequence with the minimal cumulative dose is obtained by where exchanging the order of operating points can promote the total cumulative dose to approach the optimal.So far, the radiation dose model for the multi-objective operating planning problem has been formulated.In the next content, the proposed HPAGA will be introduced to solve this VTSP in an effective way.

Algorithm Framework
HPAGA is a combinatorial optimization algorithm based on the genetic algorithm and reinforcement learning, which can be utilized to solve the TSP and VTSP problems.It mainly consists of two parts, i.e., GA and RL based on Q-learning.Specifically, the hybrid algorithm possesses satisfactory search capability by virtue of the evolution pattern of the genetic algorithm and is able to dynamically adjust the crucial three hyper-parameters of the genetic algorithm including crossover rate, mutation rate, and population size by use of the reinforcement learning.This adaptive mechanism promotes HPAGA to find the optimal path during the search process more quickly and effectively.Note that the proposed algorithm framework is shown in Figure 4.There are three sub-agents in terms of crossover agent, mutation agent, and population agent, which are responsible for adjusting crossover rate P c , mutation rate P m , and population size Pop of GA, respectively.The reinforcement learning process of HPAGA can be divided into five steps as follows: • Step 1: The agent obtains the current state S t from GA by calculating the population fitness in a designed way.The regulation of the state space formulation will be expatiated in the following passage.

•
Step 2: HPAGA selects and executes the corresponding action according to the action selection policy in reinforcement learning and then adjusts the crossover rate, mutation rate, and population size of the current GA.

•
Step 3: Execute the GA with the updated crossover rate, mutation rate, and population size to reach the new state S t+1 .

•
Step 4: Calculate the reward R t+1 from state S t to state S t+1 .The reward estimation method will be introduced in the following passage.

•
Step 5: Update knowledge of the agent according to states S t , S t+1 , reward R t+1 , and action selection policy by Q-learning.
Through a certain number of reinforcement learning iterations, continuously obtaining states, executing actions, receiving reward feedback, and improving policies, HPAGA optimizes the crossover rate, mutation rate, and population size based on past learning experience to elevate the efficiency of GA.

Genetic Algorithm
GA imitates the process of selection, crossover, and mutation in biological evolution, and searches different solutions through continuous evolution to find the individual with the highest fitness.
For each individual of the VTSP problem, it is an operating point sequence as where B o represents the starting point, P (i) denotes the operating point, and Pop means the population size.The initial population is generated randomly through the initialization module, and each individual represents a feasible operating route.The generated routes are accomplished by randomly shuffling the operating point order.This process ensures that the population contains a considerable number of random routes, providing abundant individuals for subsequent optimization processes.
The objective of the VTSP problem is to find the lowest cumulative dose operating sequence for the human or robot.The fitness is determined by calculating the cumulative dose corresponding to each individual.The formula for calculating the fitness f (ξ i ), i.e., the reciprocal of the summation of the cumulative dose corresponding to each individual, is derived by It is significant to choose an effective crossover operator when solving the VTSP problem.According to the reference [26], the sequential constructive crossover (SCX) operator is utilized to improve the traditional GA.The advantage of the SCX operator is that the generated offspring individuals can relatively retain the high-quality information in the parent individuals, such as superior operating point order and lower cumulative dose, which reduces the possibility of generating unreasonable offspring paths.

Multi-Parameter Adaptive Reinforcement Learning
The reinforcement learning algorithm based on Q-learning is a value-based learning method, which aims to enable agents to learn how to make optimal behavioral decisions in specific environments.The Q-learning algorithm mainly includes several key concepts, i.e., Q-value table, state, action, reward, and policy.
The Q-value table is utilized to record the Q-values learned by the agent, where each row represents a state, each column represents an action, and all values in the initial Q-value table are zero.The Q-value represents the benefit of selecting the corresponding action based on the current state.The Q-value can be calculated based on the current state S t , the next state S t+1 , the selected current action A t , the next prospective action A t+1 , and the next reward R t+1 , which is expressed as where Q(S t , A t ) represents the Q-value of selecting action A t under state S t , α represents the learning rate, R t+1 represents the reward obtained from state S t to state S t+1 , γ is the discount factor, and max Q(S t+1 , A t+1 ) represents the maximum Q-value in the row of state S t+1 in the Q-value table.
With respect to the proposed HPAGA, the state S t of the agent consists of three factors including the relative fitness of the current population's best individuals S t,1 , the relative average fitness of the population S t,2 , and the relative diversity of the populationS t,3 .Therefore, the state for HPAGA is defined as where the sub-states are described as Note that ξ 1 i represents the i th individual of the initial generation, ξ p i denotes the i th individual of p th generation, ξ 1 represents all individuals of the initial generation, ξ p represents all individuals of p th generation, Pop p is the population size of p th generation, and Pop 1 represents the population size of the initial generation.Besides, ω 1 , ω 2 , and ω 3 are positive weights which adjust the importance of three different fitness factors and meet ω 1 + ω 2 + ω 3 = 1.For example, in the proposed HPAGA, the weights are set to be 0.4, 0.3, and 0.3, respectively.
According to the aforementioned state calculation regulation, the state space will be continuous.In order to ensure a constructible Q-table and a satisfactory convergence speed, the state space is designedly converted to a discrete one.Concretely, the state space is divided into a certain number of intervals.If the value of S t belongs to one interval, S t will be assigned the characteristic value of this interval.For instance, the state space is divided into 20 intervals.When S t ∈ [0, 0.05], S t ← s(1); when S t ∈ [0.05, 0.1], S t ← s(2); until S t ∈ [0.95, +∞), S t ← s (20).
With respect to the action space, the ranges of crossover rate, mutation rate, and population size are divided into a certain number of intervals so as to construct the discrete actions for each agent.The range of crossover rate is from 0.4 to 0.9, the range of mutation rate is from 0.01 to 0.21, and the range of population size is from 50 to 500.Note that the number of intervals can be chosen according to the performance of the algorithm or experiences.
The state transition reward function is designed specifically for each reinforcement learning agent based on the best individual fitness and the population's average fitness.Therefore, the reward function for the crossover agent is constructed by The reward function for the mutation agent is designed by Besides, the reward function for the population agent is a weighted combination of R t+1,cross and R t+1,mutation as R t+1,population = 0.5R t+1,cross + 0.5R t+1,mutation .
In this paper, the ϵ-greedy strategy is adopted to select actions.The agent selects the action with the best Q-value via a probability of ϵ based on known information and selects exploration with a probability of 1 − ϵ, namely, a random action.The action selection strategy π(S t , A t ) is expressed as where ϵ o ∈ (0, 1) is a threshold value.

Experimental Results
In this section, experiments on different conventional TSP instances are conducted to verify the superiority of the proposed HPAGA.

Experimental Setup
The test instances in this study are chosen from the widely-used TSP instance library TSPLIB [27].To demonstrate the effectiveness of our algorithm on datasets of different scales, six instances with different scales, namely att48, berlin52, st70, eil76, gr96, and eil101, are selected.Note that all of them utilize the two-dimensional Euclidean distance metric.With respect to the software and hardware configurations, Python version 3.7.16 is employed for this experiment, and the experimental computer consists of an Intel Core i5-9300H processor, 8 GB of RAM, and Windows 10 operating system.
An overly large population size can result in an unmanageable computational load, while a too-small population may suffer from insufficient diversity.To strike a balance, the initial population size for this task is arbitrarily set at 1000.Too low a crossover rate hinders the proper inheritance of beneficial genes, whereas an excessively high mutation rate can compromise population quality.Consequently, based on empirical observations, the initial crossover rate is set at 0.65 and the initial mutation rate at 0.1 for this task.Drawing from reference [28], the corresponding reinforcement learning parameters are established with a learning rate of 0.75, a discount rate of 0.2, and a greedy rate of 0.85, aiming to foster a synergy between exploration and exploitation for effective and optimized learning.

Ablation Experiment
To verify the effectiveness of the HPAGA in adjusting different hyper-parameters of GA, the ablation experiment is conducted.A comparative study is executed among HPAGA, HPAGA_c (only dynamically adjusting the crossover rate), HPAGA_m (only dynamically adjusting the mutation rate), HPAGA_p (only dynamically adjusting the population size), HPAGA_cm (dynamically adjusting both the crossover and mutation rates), and GA (without applying RL).Each method runs 30 independent epochs with 1000 generations in each epoch on the aforementioned four selected instances.To ensure a fair comparison, the initial population of each dataset was generated with the same random seed so as to produce convincing results.
Table 1 shows the results of each method on the four TSP instances.Note that the words Best, Worst, and Mean represent the minimum, maximum, and average cost of the traveling salesman in 30 independent epochs for each algorithm, respectively.Std represents the standard deviation of these 30 independent epochs.Num_c represents the number of crossover operations, and Num_m represents the number of mutation operations for the corresponding algorithm.Figure 5 shows the convergence curves of the best solutions obtained by the six different algorithms on four TSP datasets over 1000 generations in 30 independent epochs.The discussion of the ablation study is expounded from five aspects: 1.
Analyzing the comparative results of HPAGA_c and GA, HPAGA_c obtains lower average costs than GA all over the four instances, with fewer crossover operations.This indicates that dynamically adjusting the crossover rate alone can propagate superior genes and improve the overall fitness of the population, then enhancing the performance of GA.

2.
Based on the comparative results of HPAGA_m and GA on the four instances, HPAGA_m accomplishes lower minimum costs than GA on att48, berlin52, and eil101 instances, with a fewer number of mutation operations.However, on the st70 instance, HPAGA_m's minimum and average costs are worse than GA's.This implies that dynamically adjusting the mutation rate alone can increase population diversity and enhance genetic algorithm performance, but it can also have potentially negative effects due to the influence of mutated individuals in the population.

3.
Reviewing the comparative results of HPAGA_p and GA, HPAGA_p acquires lower minimum and average costs than GA in all instances, which demonstrates that the population size agent is effective in improving the classical GA. 4.
Examining the results of HPAGA_cm, HPAGA_cm realizes lower minimum and average costs than GA, with fewer crossover and mutation operations.Compared to HPAGA_m, HPAGA_cm reaches a better balance while dynamically adjusting both crossover and mutation rates, promoting population diversity and mitigating the potential negative effects of mutated individuals by propagating superior genes.5.
Among all the comparative algorithms, HPAGA achieves the best performance in most comparative indicators, including the lowest costs and the smallest standard deviation.Note that Figure 5 demonstrates that HPAGA also has the fastest convergence speed.  1 The numbers within parentheses below the instance names represent the known optimal distances.
The ablation study adheres to the principle of variable control.The GA backbones in the experiment have equivalent performance in solving the TSP.Therefore, it is evident that the RL component significantly enhances the TSP-solving performance.According to the ablation experiment, it is concluded that in the case of fixed population size, dynamically adjusting the crossover and mutation rates via reinforcement learning strategy assists the hybrid algorithm in obtaining better results than classical GA with fewer genetic operations.In a situation of dynamic adjustments to population size, the appending crossover agent and mutation agent help HPAGA realize comparable or better results than HPAGA_p with fewer genetic operations in the majority of instances.In summary, the comprehensive dynamic adjustment mechanism of HPAGA is the most effective, which significantly improves the performance and stability of GA.As shown in Figure 6, it is demonstrated that in virtue of the proposed HPAGA, the computed path is feasible and basically optimal intuitively.

Comparative Analysis
To verify the performance of the HPAGA algorithm, the comparative analysis of the optimization performance is conducted with several approximate algorithms including ACO, particle swarm optimization (PSO), black hole algorithm (BH), and dragonfly algorithm (DA).The comparative results are listed in Table 2.Note that the computed best solutions of the comparative algorithms source from [29], meanwhile, the configurations of the comparative algorithms are recorded in [30,31].the korA200 dataset, with the fitness still decreasing.In the future, more effective learning techniques will be investigated to improve the capability of solving large-scale problems.Noticeably, the proposed HPAGA might not be the best performer among all the optimization algorithms to our best knowledge, but introduces a novel and valuable hybrid concept to enhance the existing algorithm.Instance: korA200

Case Study in Simulated Radioactive Scenario
In this paper, a case study in the simulated radioactive environment is conducted to demonstrate the feasibility of the proposed HPAGA for the multi-objective operating planning problem.The configuration of the simulated environment is illustrated in Figure 8. Suppose that there are five radiation sources R 1 ∼R 5 with the radiation dose rate of 1576 µSv/h, 240 µSv/h, 610 µSv/h, 1016 µSv/h, and 1550 µSv/h, respectively, dispersedly located at the coordinates of (54, 186), (47, 73), (101, 97), (99, 142), and (193,129).Note that the contour lines represent the positions with the same value of radiation dose rate.The number of operating points is set as 20.It is different from [6] that the operating difficulty of each operating point is taken into consideration, which is measured by the number of hours consumed at each point.Besides, B o at (0, 0) is the starting point.The parameters of these twenty operating points are listed in Table 3.The cumulative dose matrix is defined to describe the cumulative dose between any two points.The value of each element of the cumulative dose matrix in this case is computed according to (3).Apparently, on account of the operating difficulty, the cumulative dose matrix is asymmetric.The case study becomes an asymmetric VTSP.HPAGA is utilized to solve the asymmetric VTSP, the searching procedure for the optimal operating sequence with the increasing generations is exhibited in Figure 9.Note that after the iteration of less than 240 generations, the algorithm has converged to an optimal solution.The results of this simulated case study demonstrate the effectiveness of the proposed HPAGA in solving the multi-objective operating planning problem in the radioactive environment.

Conclusions and Future Work
This paper introduces a novel multi-objective operation planning model for radioactive environments, accounting for difficulty levels at each operating point to impact operation times and cumulative radiation dose.With respect to the newly designed radiation dose model, a hybrid algorithm framework is proposed that integrates bio-inspired optimization with reinforcement learning, enabling the dynamic adjustment of GA hyper-parameters for efficient VTSP solutions.Noticeably, comparative studies showcase the superior performance of HPAGA against classical evolutionary algorithms for various TSP cases.Furthermore, the case study in the simulated radioactive environment implies the application prospect of HPAGA.
In the future, more efficient learning tricks of the RL part and fresher ideas for hybrid algorithms will be investigated further.Besides, the improved algorithm will be applied to intelligent robots for real-world nuclear scenarios. i

Figure 1 .
Figure 1.A certain point is infected by the multiple radiation sources.

Figure 3 .
Figure 3.The computing method for cumulative radiation dose between two points.

Figure 5 .
Figure 5. Convergence curves of the six methods in the ablation experiments.

Figure 6 .
Figure 6.The figure represents the minimum cost path obtained by our HPAGA method in 30 experimental trials.

Figure 7 .
Figure 7. Convergence curves of the GA and HPAGA for korA200.

Figure 8 .Figure 9 .
Figure 8.The configuration of the simulated radioactive environment.

Table 1 .
The results of the ablation experiment.

Table 3 .
The configuration parameters of the operating points.
Pos. denotes the position of each operating point.CT with the unit of hour represents the consuming time at each point.