A Period Training Method for Heterogeneous UUV Dynamic Task Allocation

: In the dynamic task allocation of unmanned underwater vehicles (UUVs), the schemes of UUVs need to be quickly reallocated to respond to emergencies. The most common heuristic allocation method uses predesigned optimization rules to iteratively obtain a solution, which is time-consuming. To quickly assign tasks to heterogeneous UUVs, we propose a novel task allocation algorithm based on multi-agent reinforcement learning (MARL) and a period training method (PTM). The period training method (PTM) is used to optimize the parameters of MARL models in different training environments, improving the algorithm’s robustness. The simulation results show that the proposed methods can effectively allocate tasks to different UUVs within a few seconds and reallocate the schemes in real time to deal with emergencies.


Introduction
With the rapid development of UUV technologies, multi-UUVs can cooperatively perform various complicated tasks, such as target localization, photogrammetry, and cooperative search coverage [1][2][3]. Usually, multiple heterogeneous UUVs equipped with various sensors are allocated to search irregular task areas, where the UUVs may be broken due to unpredictable threats, and many new search tasks need to be conducted. To deal with emergencies, each UUV is required to adaptively reallocate its task sets, resulting in multiple UUV search coverage task allocation problems in dynamic environments.
To solve the UUV task allocation problem, many heuristic algorithms have been researched in recent years. To deal with large-scale geospatial search problems, Kuhlman et al. use a novel iterative greedy method to plan the search paths of UUVs, which the algorithm scales well to the deployment of large groups of agents [3]. In [4], a new technology based on receiver operating characteristic (ROC) analysis is used to centrally handle the UUV task allocation problem, and the path planning problem is solved using a distributed model. To address the dynamic UUV task allocation problem, Sun et al. design a tri-level optimization method to plan safe paths in severe ocean environments, in which UUVs have limited detection ranges and are required to respond to emergencies in real time [5]. To guarantee the effectiveness of scheme allocation, heuristic algorithms commonly require a certain number of optimization iteration processes to generate, update, and select solutions, leading to high time consumption in dynamic task environments.
Multi-agent reinforcement learning (MARL) provides a novel fast-solving framework to solve multi-agent task planning problems [6,7]. MARL formulates problems as a sequential Markov decision process, where intelligent agents collaboratively complete tasks in a shared environment [8,9]. In terms of the control and obstacle avoidance of UUVs, Fang et al. use the decentralized MARL training framework and improve the original multi-agent generative adversarial imitation learning method by introducing a new randomly selecting and updating method [10]. In [11], the researchers construct multiple reward functions to adjust the action distribution of UUVs, which quickly allocate task sequences in ocean current environments. MARL has achieved many significant results in different UUV planning problems. Therefore, it has great potential in UUV coordination task allocation. However, real task scenarios involve dynamic changes due to emergencies. To ensure the completion of tasks, it is inevitable for MARL to reallocate task schemes in different task environments. The changing numbers of UUVs reduce the generalization performance of the MARL algorithm and may lead to model invalidation [11,12].
In this paper, we propose a novel MARL with an attention mechanism (MARLAM) to quickly assign tasks to heterogeneous UUVs. Herein, we propose a novel task allocation algorithm based on multi-agent reinforcement learning (MARL) and a period training method (PTM). The designed MARL model employs an encoder-decoder framework with an attention mechanism to adaptively allocate many irregular task areas for heterogeneous UUVs. In addition, PTM uses periodically changing training conditions to optimize the model parameters, which enables the MARL model to achieve better scalability after the training process. The experiments demonstrate that the proposed MARLAM with the designed period training method (MARLAM-PTM) can use a trained MARL model to quickly allocate task schemes in different task environments. Figure 1 is an illustration of the UUV search coverage task allocation problem with emergencies. Several irregular task areas are randomly distributed in a two-dimensional environment. Heterogeneous UUVs with different colors have different velocities and search ranges. Each UUV travels sequentially to its allocated task and searches the areas with designed coverage search paths. All UUVs start from the same base and return to it after finishing their tasks. The allocation of schemes needs to minimize the total distances traveled by UUVs between tasks and the overall time consumed searching irregular areas. As shown in Figure 1, the search coverage path of UUVs needs to be elaborately designed according to the shape and size of the irregular task areas. In addition, two types of emergency are considered here, i.e., broken UUVs and the new search tasks. When UUVs suffer emergencies, the optimization algorithm needs to reallocate tasks quickly and adaptively. Multi-agent reinforcement learning (MARL) provides a novel fast-solving framework to solve multi-agent task planning problems [6,7]. MARL formulates problems as a sequential Markov decision process, where intelligent agents collaboratively complete tasks in a shared environment [8,9]. In terms of the control and obstacle avoidance of UUVs, Fang et al. use the decentralized MARL training framework and improve the original multi-agent generative adversarial imitation learning method by introducing a new randomly selecting and updating method [10]. In [11], the researchers construct multiple reward functions to adjust the action distribution of UUVs, which quickly allocate task sequences in ocean current environments. MARL has achieved many significant results in different UUV planning problems. Therefore, it has great potential in UUV coordination task allocation. However, real task scenarios involve dynamic changes due to emergencies. To ensure the completion of tasks, it is inevitable for MARL to reallocate task schemes in different task environments. The changing numbers of UUVs reduce the generalization performance of the MARL algorithm and may lead to model invalidation [11,12].

Scenario Description of the Dynamic UUV Allocation Problem
In this paper, we propose a novel MARL with an attention mechanism (MARLAM) to quickly assign tasks to heterogeneous UUVs. Herein, we propose a novel task allocation algorithm based on multi-agent reinforcement learning (MARL) and a period training method (PTM). The designed MARL model employs an encoder-decoder framework with an attention mechanism to adaptively allocate many irregular task areas for heterogeneous UUVs. In addition, PTM uses periodically changing training conditions to optimize the model parameters, which enables the MARL model to achieve better scalability after the training process. The experiments demonstrate that the proposed MARLAM with the designed period training method (MARLAM-PTM) can use a trained MARL model to quickly allocate task schemes in different task environments. Figure 1 is an illustration of the UUV search coverage task allocation problem with emergencies. Several irregular task areas are randomly distributed in a two-dimensional environment. Heterogeneous UUVs with different colors have different velocities and search ranges. Each UUV travels sequentially to its allocated task and searches the areas with designed coverage search paths. All UUVs start from the same base and return to it after finishing their tasks. The allocation of schemes needs to minimize the total distances traveled by UUVs between tasks and the overall time consumed searching irregular areas. As shown in Figure 1, the search coverage path of UUVs needs to be elaborately designed according to the shape and size of the irregular task areas. In addition, two types of emergency are considered here, i.e., broken UUVs and the new search tasks. When UUVs suffer emergencies, the optimization algorithm needs to reallocate tasks quickly and adaptively.

Objective Functions and Constraints
The heterogeneous UUV group is set as U = [U 1 , U 2 , · · · , U N ] T , where N is the number of UUVs. [x i , y i ], v i , and r i represent the location coordinates, velocity, and search range of the UUV i, respectively. T = [T 1 , T 2 , · · · , T M ] T denotes irregular task areas, where M is the number of tasks. UUVs are expected to search all irregular task areas with minimum traveling distances and search times. Accordingly, the objective function f is defined as follows: where d(T m , T n ) is the Euclidean distance between the locations of task m and task n, and δ i,m,n is the binary decision variable. We define δ i,m,n as 1 if task m and task n are allocated to the UUV i, and otherwise, we define it as 0. t(U i , T m ) denotes the search time of the UUV i in task m: where L i,m and v i are the coverage search path of the UUV i on task m, and the search velocity of the UUV i, respectively. As shown in Figure 2a,b L i,m needs to be specially designed according to the UUV's search range and the task area's shape and size [4]. For the convenience of calculating the coverage search path, a maximum bounding rectangle is employed to approximate the irregular task areas.

Objective Functions and Constraints
The heterogeneous UUV group is set as , where N is the number of UUVs. [ , ] i i x y , i v , and i r represent the location coordinates, velocity, and search range of the UUV i , respectively. specially designed according to the UUV's search range and the task area's shape and size [4]. For the convenience of calculating the coverage search path, a maximum bounding rectangle is employed to approximate the irregular task areas. In Figure 2a, the irregular task area is approximately represented by its bounding rectangle. Figure 2b shows an illustration of the UUV spiral search strategy, and the cov- can be calculated as follows: where m l and m w are the length and the width of the bounding rectangle, respectively.
To decrease the search time of the allocation of schemes, some UUVs with faster search velocities and wider search ranges tend to be assigned more tasks, leading to uneven allocation of schemes to UUVs. To avoid this situation, the maximum task load of the UUVs is described as follows: In Figure 2a, the irregular task area is approximately represented by its bounding rectangle. Figure 2b shows an illustration of the UUV spiral search strategy, and the coverage search path L i,m can be calculated as follows: where l m and w m are the length and the width of the bounding rectangle, respectively. To decrease the search time of the allocation of schemes, some UUVs with faster search velocities and wider search ranges tend to be assigned more tasks, leading to uneven allocation of schemes to UUVs. To avoid this situation, the maximum task load of the UUVs is described as follows: where ceil (·) is the top integral function, which guarantees that UUVs are not overloaded.

MARL with the Attention Mechanism and Period Training Method
To solve the UUV task allocation problem with emergencies, we propose a novel MARL algorithm with an attention mechanism. The algorithm consists of two parts; one is an encoder with deep feature extraction networks, and the other is a decoder based on the attention allocation mechanism. First, the encoder contains two linear projection networks and one self-attention network [13][14][15][16], which is used to extract the high-dimensional features of the UUV and task data. Then, the obtained high-dimensional features are used in the decoder, and we introduce the attention mechanism to allocate task sets for each UUV sequentially. Finally, we design a period training method to optimize the parameters of the encoder and decoder.

Encoder with Deep Feature Extraction Networks
It is noted that the data dimensions are different between the UUV data U i and the task data T m . We use two linear projection networks to unify the dimensions of U i and T m before the high-dimensional embedding process.
where h T m and h U i are the low-level features of task m and the UUV i, respectively, with the same dimension dim. W 1 , b 1 , W 2 , and b 2 are the parameter vectors of the two linear projection networks. Then, we use a simple attention model (SAM) to extract the high-level features [15].
where H U i and H T m are the high-level features of the UUV i and target m, respectively. [·, ·, ·] T is the transposition operator.

Decoder with Attention Mechanisms
The decoder sequentially allocates the tasks for each UUV until all tasks are finished. During the solving process, the model needs the high-level feature vectors extracted from the encoder to solve the task lists. The flow chart of the decoding process can be seen in Figure 3.
where ( ) ceil ⋅ is the top integral function, which guarantees that UUVs are not overloaded.

MARL with the Attention Mechanism and Period Training Method
To solve the UUV task allocation problem with emergencies, we propose a novel MARL algorithm with an attention mechanism. The algorithm consists of two parts; one is an encoder with deep feature extraction networks, and the other is a decoder based on the attention allocation mechanism. First, the encoder contains two linear projection networks and one self-attention network [13][14][15][16], which is used to extract the high-dimensional features of the UUV and task data. Then, the obtained high-dimensional features are used in the decoder, and we introduce the attention mechanism to allocate task sets for each UUV sequentially. Finally, we design a period training method to optimize the parameters of the encoder and decoder.

Encoder with Deep Feature Extraction Networks
It is noted that the data dimensions are different between the UUV data i U and the task data m T . We use two linear projection networks to unify the dimensions of i U and m T before the high-dimensional embedding process. where

Decoder with Attention Mechanisms
The decoder sequentially allocates the tasks for each UUV until all tasks are finished. During the solving process, the model needs the high-level feature vectors extracted from the encoder to solve the task lists. The flow chart of the decoding process can be seen in Figure 3.  The decoder uses an attention mechanism to output the task m t i for the UUV i at a specific time step t. We aim to use one trained MARL model to solve the problem with different numbers of UUVs and tasks. This means that the dimension of data embedding in the decoder remains constant in changing task environments. Therefore, combining the high-level feature vectors, the context embedding c t i of the UUV i at step t is constructed as follows: where [·, ·, ·] is the horizontal concatenation operator. The dimension of c t i is (5 · dim), which can remain unchanged with differing numbers of UUVs and tasks. A single-head attention layer in the decoder is used to calculate the high embedding c t i,m of the UUV i in task m, which denotes the matching degree between the UUV i and task m at decoding step t.
where W 3 and W 4 are the parameter vectors of the linear projection networks. The probability that the UUV i selects task m at the time step t can be calculated as follows: where p t θ|i,m represents the selection probability and θ is the total parameter vector of the encoder and the decoder. According to p t θ|i,m and the maximum task load constraint in (4), the decoder circularly outputs the tasks for each UUV as the solution until all tasks are selected.
Note that all parameters in the encoder and decoder do not depend on the UUVs and tasks. Hence, we can apply one MARLAM model to different numbers of UUVs and tasks without retraining.

Period Training Method
The total parameter vector θ is optimized with the proposed PTM. To ensure the convergence of the MARL model, the number of agents usually needs to be fixed in the training process. In the dynamic task allocation problem, UUVs may suffer unpredictable events and be broken. This changing number of UUVs significantly reduces the performance of the MARL model [11,12]. To adapt to this situation, the trained model needs to reallocate tasks when the UUVs' number changes. Figure 4 is an illustration of the PTM training framework. Different from the common MARL training algorithm, PTM uses various data with different numbers of UUVs to construct periodically changing training environments. In addition, the improved actorcritic framework is introduced in the PTM. Herein, we redesign the parameter exchange rules between the actor network and critic network to achieve the convergence of the MARL model. The parameter vector of the total model θ is optimized using the policy gradient algorithm [14], which can be calculated as follows: where L represents the loss of the training process, and π and π denote the solutions obtained by the actor network and critic network, respectively.  To ensure the convergence of the MARL model in periodically changing training conditions, we design a greedy update method to control the parameter replacement between the critic network and actor network. After finishing an epoch training step, PTM randomly samples the test data where c θ and a θ are the parameters of the critic network and actor network, respectively. According to (12), PTM replaces the parameters of the critic network by using the actor network parameters when the results of the actor network are better than those of the critic network. Due to periodically changing training conditions and the greedy parameter replacement rule, the critic network can maintain the best model parameters.

Simulation Experiment
In the simulation experiments, we use two typical task allocation algorithms as comparisons, i.e., the modified two-part wolf pack search algorithm (MTWPS) [17] and dynamic discrete pigeon-inspired optimization (DDPIO) [18]. In addition, we use the traditional actor-critic training method to train another MARL model, and the two MARL models have the same network constructions, except for the training methods. The number of tasks in the training condition is 40. The number of UUVs in the common actorcritic method is four UUVs. The number of UUVs in PTM changes periodically between three and four. The simulation data are characterized as scalar values for the convenience of the tests.

Dynamic Case Settings with Emergencies
All UUVs start from the same base, whose coordinates are (12.5, 12.5), and return to the base when all their tasks are finished. We set three cases with emergencies in simulation experiments.
1. Case 1: We randomly generate 40 irregular task areas as the initial settings. Four heterogeneous UUVs with different velocities and search ranges need to conduct these tasks with preplanned orders and search coverage paths. The original task situation is shown in Figure 5a.  To ensure the convergence of the MARL model in periodically changing training conditions, we design a greedy update method to control the parameter replacement between the critic network and actor network. After finishing an epoch training step, PTM randomly samples the test data {U, T} 1 , . . . , {U, T} L from not only the current epoch but also the other epochs with different numbers of UUVs. The rule of parameter replacement of the PTM is as follows: where θ c and θ a are the parameters of the critic network and actor network, respectively. According to (12), PTM replaces the parameters of the critic network by using the actor network parameters when the results of the actor network are better than those of the critic network. Due to periodically changing training conditions and the greedy parameter replacement rule, the critic network can maintain the best model parameters.

Simulation Experiment
In the simulation experiments, we use two typical task allocation algorithms as comparisons, i.e., the modified two-part wolf pack search algorithm (MTWPS) [17] and dynamic discrete pigeon-inspired optimization (DDPIO) [18]. In addition, we use the traditional actor-critic training method to train another MARL model, and the two MARL models have the same network constructions, except for the training methods. The number of tasks in the training condition is 40. The number of UUVs in the common actor-critic method is four UUVs. The number of UUVs in PTM changes periodically between three and four. The simulation data are characterized as scalar values for the convenience of the tests.

Dynamic Case Settings with Emergencies
All UUVs start from the same base, whose coordinates are (12.5, 12.5), and return to the base when all their tasks are finished. We set three cases with emergencies in simulation experiments.

1.
Case 1: We randomly generate 40 irregular task areas as the initial settings. Four heterogeneous UUVs with different velocities and search ranges need to conduct these tasks with preplanned orders and search coverage paths. The original task situation is shown in Figure 5a. In Figure 5a,b the irregular blue areas are the task areas that need to be searched, the red rectangles are the corresponding approximate area model, and the black star represents the base. The Euclidean distance and the search time of UUVs in all areas can be approximately calculated using the simplified area model, which is a fast and efficient method of evaluating allocated task schemes.

Case 2:
We consider that the UUVs find a group of new task areas while performing case 1. Similarly, the new target areas are approximated using a bounding rectangle. To search the new task areas, the algorithms need to reallocate the task schemes in real time. 3. Case 3: Following case 2, we assume that UUV 4 is broken and cannot search for any tasks in case 3. To deal with the emergency, the rest of the unfinished tasks are quickly assigned to the other UUVs according to objective function (1) and constraint (4). Figure 6 shows the allocation results of MARLAM-PTM in three cases. The values and times in Table 1 represent the values of the objective function in Equation (1) and the running times of the different methods, respectively.  . As shown in Figure 6a, the numbers of allocation tasks for the four UUVs are 9, 8, 9, and 14, respectively. The scheme allocation of MARL-TPM meets the maximum task load constraint. In case 1, the UUVs conduct the allocation task sets according to the planning order and search them using the spiral search strategy. After completing the tasks, all UUVs return to the base. As shown in Table 1, the proposed MARLAM-PTM exhibits a fast solution speed due to the new solving framework. In Figure 5a,b the irregular blue areas are the task areas that need to be searched, the red rectangles are the corresponding approximate area model, and the black star represents the base. The Euclidean distance and the search time of UUVs in all areas can be approximately calculated using the simplified area model, which is a fast and efficient method of evaluating allocated task schemes.

2.
Case 2: We consider that the UUVs find a group of new task areas while performing case 1. Similarly, the new target areas are approximated using a bounding rectangle. To search the new task areas, the algorithms need to reallocate the task schemes in real time.

3.
Case 3: Following case 2, we assume that UUV 4 is broken and cannot search for any tasks in case 3. To deal with the emergency, the rest of the unfinished tasks are quickly assigned to the other UUVs according to objective function (1) and constraint (4). Figure 6 shows the allocation results of MARLAM-PTM in three cases. The values and times in Table 1 represent the values of the objective function in Equation (1) and the running times of the different methods, respectively. Figure 6a displays the allocation result of case 1. The maximum task load of the UUVs in case 1 can be calculated with the constraint given in (4), i.e., 14 = ceil(40/3). As shown in Figure 6a, the numbers of allocation tasks for the four UUVs are 9, 8, 9, and 14, respectively. The scheme allocation of MARL-TPM meets the maximum task load constraint. In case 1, the UUVs conduct the allocation task sets according to the planning order and search them using the spiral search strategy. After completing the tasks, all UUVs return to the base. As shown in Table 1, the proposed MARLAM-PTM exhibits a fast solution speed due to the new solving framework. Figure 6b demonstrates the scheme allocation of MARLAM-PTM in case 2, where the numbers of tasks and training conditions are different. Some task areas have been searched and are connected by broken lines, and the brown dots represent the new search tasks. In Figure 6b, the new search tasks are mainly located in the left corner of the task environment, near the original allocation tasks of UUV 2. In the reallocation of schemes, most of the new tasks are assigned to UUV 2 in real time, ensuring the reduction in the overall traveling distances of the UUVs. As shown in the results of case 2, the MARL-AMPTM adaptively reallocates tasks to each UUV and ensures that all new task areas can be searched. tasks, UUV 4 is broken, and the unfinished tasks need to be reallocated to other UUVs. The locations of the unfinished tasks are mainly near the task sets of UUV 3. Figure 6c shows most of them are reallocated to UUV 3. It is noteworthy that the result in case 3 shows that the MARLAM-PTM model obtains a lower value than the MARLAM model. The periodically changing training conditions and the greedy parameter replacement rule in the PTM enable the model to maintain robustness in solving problems with different numbers of UUVs. To test the robustness and scalability of MARLAM-PTM in various situations, we add four cases, and each of these contains three different task allocation problems. Ash shown Table 2 and Figures 7-10, the additional tests have different settings. It is worth noting that we use the same trained model in all four cases, which have different test conditions. Table  2    As shown in Figure 6c, in case 3, the number of UUVs and tasks are both different from the training conditions of the two MARL models. During the process of carrying out tasks, UUV 4 is broken, and the unfinished tasks need to be reallocated to other UUVs. The locations of the unfinished tasks are mainly near the task sets of UUV 3. Figure 6c shows most of them are reallocated to UUV 3. It is noteworthy that the result in case 3 shows that the MARLAM-PTM model obtains a lower value than the MARLAM model. The periodically changing training conditions and the greedy parameter replacement rule in the PTM enable the model to maintain robustness in solving problems with different numbers of UUVs.

Simulation Experimental Results
To test the robustness and scalability of MARLAM-PTM in various situations, we add four cases, and each of these contains three different task allocation problems. Ash shown Table 2            According to the results of cases 4 and 5, two MARL models exhibit stable pro solving capabilities when the number of tasks increases, and they yield similar res task allocation across all six problems. Previous studies have shown that MARL m possess strong scalability with respect to the number of tasks [13]. Conversely, in c and 7, the performance of MARLAM is affected by a decrease in the number of U From the results of MARLAM-PTM in cases 6 and 7, the proposed algorithm utiliz proposed PTM training model, which cyclically alters the numbers of UUVs, an greedy rule of parameter replacement in Equation (11) ensures convergence and r ness of the proposed algorithm across different numbers of UUVs. As shown in the r in all cases, this approach achieves optimal performance with a constantly changing ber of UAVs.

Conclusions
In this paper, a new MARL algorithm with a period training method is propo solve the heterogeneous UUV dynamic task allocation problem with emergencie proposed MARLAM algorithm can quickly allocate tasks to each UUV. The dimens data embedding in the MARL remains constant, which ensures that the algorith solve the problem with different numbers of UUVs and allocate tasks without one tr mode. In addition, the designed PTM training method uses periodically changing tr conditions and the greedy parameter replacement rule to improve the scalability MARLAM model. Based on the simulation test results, the proposed method can q allocate different task areas for heterogeneous UUVs in different task environments Author Contributions: Resources, S.G.; validation, L.Z. and J.X.; writing-original draft pr tion, X.W. and K.Y.; writing-review and editing and supervision, S.G. and S.B.; funding a tion, L.Z. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported, in part, by the National Natural Science Foundation of (grant number 61871307) and the Fundamental Research Funds for the Central Univ (JB210207). According to the results of cases 4 and 5, two MARL models exhibit stable problemsolving capabilities when the number of tasks increases, and they yield similar results in task allocation across all six problems. Previous studies have shown that MARL models possess strong scalability with respect to the number of tasks [13]. Conversely, in cases 6 and 7, the performance of MARLAM is affected by a decrease in the number of UUVs. From the results of MARLAM-PTM in cases 6 and 7, the proposed algorithm utilizes the proposed PTM training model, which cyclically alters the numbers of UUVs, and the greedy rule of parameter replacement in Equation (11) ensures convergence and robustness of the proposed algorithm across different numbers of UUVs. As shown in the results, in all cases, this approach achieves optimal performance with a constantly changing number of UAVs.

Conclusions
In this paper, a new MARL algorithm with a period training method is proposed to solve the heterogeneous UUV dynamic task allocation problem with emergencies. The proposed MARLAM algorithm can quickly allocate tasks to each UUV. The dimension of data embedding in the MARL remains constant, which ensures that the algorithm can solve the problem with different numbers of UUVs and allocate tasks without one trained mode. In addition, the designed PTM training method uses periodically changing training conditions and the greedy parameter replacement rule to improve the scalability of the MARLAM model. Based on the simulation test results, the proposed method can quickly allocate different task areas for heterogeneous UUVs in different task environments.