Heterogeneous mission planning for a single unmanned aerial vehicle (UAV) with attention-based deep reinforcement learning

Large-scale and complex mission environments require unmanned aerial vehicles (UAVs) to deal with various types of missions while considering their operational and dynamic constraints. This article proposes a deep learning-based heterogeneous mission planning algorithm for a single UAV. We first formulate a heterogeneous mission planning problem as a vehicle routing problem (VRP). Then, we solve this by using an attention-based deep reinforcement learning approach. Attention-based neural networks are utilized as they have powerful computational efficiency in processing the sequence data for the VRP. For the input to the attention-based neural networks, the unified feature representation on heterogeneous missions is introduced, which encodes different types of missions into the same-sized vectors. In addition, a masking strategy is introduced to be able to consider the resource constraint (e.g., flight time) of the UAV. Simulation results show that the proposed approach has significantly faster computation time than that of other baseline algorithms while maintaining a relatively good performance.


INTRODUCTION
Recently, mission environments such as disaster management or logistics services become more complex and larger. The goal of these large-scale missions can be achieved safer and faster using unmanned aerial vehicles (UAVs) (Shakhatreh et al., 2019;Grzybowski, Latos & Czyba, 2020;Kim et al., 2021). Since the complexity of task allocation by scheduling a large number of tasks in the mission to UAVs is high, it takes a long time for the human operator to plan these tasks manually without ensuring the optimal performance. The performance and computational time significantly impact the success rate of rescue in disaster management or the benefits of companies in logistics services (Atyabi, MahmoudZadeh & Nefti-Meziani, 2018). Therefore, autonomous mission planning algorithms need to be developed to solve these problems rapidly and efficiently.
Mission planning problems of the UAV can be represented as one of vehicle routing problems (VRPs). The VRP has various variations such as distance constraint (Karaoglan, Atalay & Kucukkoc, 2020), multiple trip availability (Paradiso et al., 2020), and algorithm. Kool, van Hoof & Welling (2019) utilized the Transformer (Vaswani et al., 2017) style-neural network model that modifies the RNN structure of the Pointer network into multi head attention (MHA) network. This model is proposed to solve various types of routing problems with its flexibility, and the authors show that the model outperforms some heuristic algorithms and Pointer network-based model in terms of the performance and the computation time; hence, we build our approach based upon this MHA network.
The proposed approach in this study particularly considers the characteristics of the UAV, which are the capability of handling heterogeneous missions and the flight time constraint. As UAV technology development continues, several tasks can be carried out by even a single UAV simultaneously or sequentially, such as delivering extra payloads, visiting specific areas to take images of landmarks, or flying over large areas to obtain information. Considering these heterogeneous tasks, the cost for completing each pair of tasks becomes different depending on the order of task completion. For instance, different path lengths for delivery or radius of the area for coverage make the cost matrix of the VRP asymmetric. Another characteristic of a UAV is that they have limited flight time due to fuel/battery capacity. This constraint makes a UAV refuels/recharges their fuel/battery at the depot and resume their work. Therefore, the mission planning problem of a UAV in this article is represented as the multi-trip asymmetric distance constrained VRP (MTAD-VRP) which is one of the variations of the VRP. The heterogeneous mission planning considering these characteristics further increases the complexity of the VRP.
It is worthwhile noting that there are a few studies on heterogeneous mission planning problems for the UAV using heuristic optimization algorithms. Zhu et al. (2018) formulated the heterogeneous mission planning problem for UAV reconnaissance as multiple Dubins travelling salesman problem (MDTSP), which is one of VRPs and proposes the genetic algorithm-based approach to solve the problem. Chen, Nan & Yang (2019) considered an additional constraint which is time window and formulates the heterogeneous mission planning problem as a multi-objective, multi-constraint nonlinear optimization problem. Then, they utilize the search-based algorithm for optimization. Gao, Wu & Ai (2021) proposes ant colony-based algorithm for minimizing the weighted sum of the total UAV fuel consumption and the task execution time. The performance of the proposed algorithm is compared with other ant colony-based algorithms through numerical simulations. However, to our best knowledge, it is difficult to find the heterogeneous mission planning based on reinforcement learning approaches. As mentioned earlier, reinforcement learning-based approaches are expected to provide the superior performance compared with heuristic algorithms in terms of computation time and optimality. Besides, aforementioned works consider only single-trip problems which significantly limit the capability of the UAV.
To this end, this study proposes an attention-based reinforcement learning algorithm for the heterogeneous mission planning for a single UAV. We first formulate the heterogeneous mission planning problem as the MTAD-VRP expressed in a graph form to utilize the solvers for the VRP. Considering a realistic complex mission environment and characteristics of the UAV, we use the reinforcement learning approach with an attentionbased neural network model to solve the problem with its fast computation time and flexibility. Although the existing learning-based algorithms can deal with various routing problems, most of them only consider homogeneous inputs (Vinyals, Fortunato & Jaitly, 2015;Bello et al., 2017;Kool, van Hoof & Welling, 2019). Thus, we introduce the unified mission representation for network inputs to contain the information of heterogeneous missions. And then, we design the masking strategy to deal with the flight time constraint to complete all tasks in a mission and help the training process of reinforcement learning. The proposed algorithm uses the MHA-based model architecture for better computational efficiency than that of the RNN-based model architecture while preventing the vanishing gradient effect when dealing with long data sequences. Furthermore, the MHA-based model has a permutation invariant property that makes the model to be able to learn the robust strategy regardless of input permutation. The REINFORCE algorithm (Sutton et al., 2000) with a baseline updates the model to converge stably by reducing the variance of the parameters' gradient. To validate the feasibility and the performance of the proposed approach, we perform numerical simulations and compare the result with state-of-the-art open-source heuristic algorithms.

PROBLEM DEFINITION
This study considers visiting, coverage, and delivery as heterogeneous missions. Here, visiting is for capturing an image of the landmark building, coverage is for gathering information of a large area with the spiral flight pattern, and delivery is for picking and placing the package. Figure 1 illustrates heterogeneous missions. Note that, if needed, more mission types could be readily incorporated into the problem thanks to the flexibility of the learning approach.
Our purpose is to complete all the given heterogeneous mission while minimizing the flight time of a single UAV dispatched from the depot. The flight time budget constraint should be satisfied, and the UAV is allowed to return to the depot for recharging. Figure 2 shows a sample mission scenario in a 2-D view, where the black squares are the depot, blue squares are visiting mission spots, circles are coverage mission areas, and a pair of magenta diamonds with cyan arrows are the delivery mission with a specific direction.
To formulate the heterogeneous mission planning problem as the mathematical formulation of the VRP, we abstract the problem into a graph instance. The mission graph G ¼ ðV; EÞ consists of k nodes ðv 1 ; v 2 ; Á Á Á ; v k Þ 2 V and edges ðe 12 ; e 21 ; Á Á Á ; e k1 ; e 1k Þ 2 E. Nodes represent the feature of each mission and the value of edges are constructed with the travel time cost which depends on the type of mission. The solution of the problem is the sequence of the index of nodes ¼ ðn 1 ; Á Á Á ; n t Þ, where n t 2 N and 1 n t k. The total travel time cost L, which is the objective of the problem is the sum of every value of edges between the selected nodes and the cost of returning to the depot as: e n t n tþ1 þ e n k n 1 n t 2 : (1) Assuming that the UAV flies with a constant velocity, the travel time cost between missions is calculated by the total distance that the UAV need to fly. We ignore the time of recharging and loading packages for simplicity. The type of mission affects the cost calculation as: where S ¼ pr 2 =w, c xv , c xc , and c xd are the travel cost to the visiting, coverage, and delivery mission point, respectively, from the source mission point x. d is the distance between missions, S is the length of the spiral path to cover the area, w is the sensing range of the UAV, r is the radius of the coverage area, and l is the length of the delivery path. The cost for returning to the depot is the same as c xv with the visiting mission point of the depot. Figure 3 provides the conversion of the mission instance to the graph representation. Additionally, we consider the limited flight time constraint of the UAV for safe mission completion. Typically, the UAV can be recharged at the base station which is considered as the depot in the VRP. Thus, we allow the UAV to be recharged by revisiting the depot. Figure 4 illustrates the recharging event graphically.

ATTENTION BASED DEEP REINFORCEMENT LEARNING
In this section, we first propose a unified feature representation to deal with heterogeneous missions. Then, we suggest a masking strategy to consider the flight time limitation constraint. We introduce the neural network model and reinforcement learning algorithm to solve the heterogeneous mission planning problem using these methods. The neural network model architecture consists of an encoder and decoder network with the attention mechanism for sequential data. The REINFORCE algorithm (Sutton et al., 2000), one of the reinforcement learning algorithms, is used to optimize the neural network.

Unified feature representation
We propose the unified feature representation v ¼ ðx 1 ; y 1 Þjjðx 2 ; y 2 ÞjjAjjI Type combining the spatial information of heterogeneous missions and indicator of each mission type, where ðx 1 ; y 1 Þ is the critical position of the mission, ðx 2 ; y 2 Þ is the end-position, A is the

Masking strategy
The masking strategy generates the mask M to prevent selecting invalid actions in reinforcement learning. The mask consists of the completion mask M C ¼ ðm c 1 ; m c 2 ; Á Á Á ; m c k Þ for already completed missions and the time limitation mask M T ¼ ðm t 1 ; m t 2 ; Á Á Á ; m t k Þ. The time limitation mask M T masks the mission when the flight time of returning to the depot after completing the mission exceeds the remained flight time of the UAV, expressed as: where T j is the time to complete the mission j from the current mission, T Return;j is the flight time of returning to the depot time after completing the current mission, and T Remain is the remained flight time of the UAV. The masking strategy generates the mask M ¼ M C jM T , where the operator | means the element-wise logical 'or' operation. Figure 5 shows an example of the masking strategy. Each circle in the figure represents an arbitrary mission. The completion mask M C and the time limitation mask M T are represented as red circles and purple circles, respectively. The agent in the example only can select unmasked missions to complete or the depot to return. If every element of M T is masked before finishing the whole mission, the UAV is forced to return to the depot for refueling/recharging itself. When the UAV arrived at the depot, the remained flight time of the UAV is initialized, and then every element of M T is calculated by (5). After that, the UAV continues the subsequent tasks. Note that time for recharging is not explicitly considered in the cost; we assumed that the recharging can be done quickly by replacing the battery with a new one. However, if needed, we could easily include the recharging cost in the optimization problem formulation.

Model architecture
The neural network model p h which approximates the VRP solver is parameterized with h.
The policy with the model can be represented as the probability p p h ðjsÞ, where s ¼ ðv 1 ; v 2 ; Á Á Á ; v k Þ is the given mission nodes as the input of the model and ¼ ðn 1 ; n 2 ; Á Á Á ; n k Þ is the output of the model, which is the permutation of the index of s. With the chain rule, the probability can be factorized as: where t is the output value at t 2 f1; Á Á Á ; kg and 1:tÀ1 is the partial sequence of . We utilize the Transformer style model architecture of Kool, van Hoof & Welling (2019) to approximate Eq. (6). The model takes the input which is a set of mission node data for the encoder and outputs the solution sequence with the decoder while satisfying the constraints. The encoder is implemented with multi-head attention (MHA) layers (Vaswani et al., 2017), and it generates the embeddings of each input element ðh e1 ; h e2 ; Á Á Á ; h ek Þ where the embeddings represents the relationship among all of the other elements in the input mission nodes. To generates the embeddings with the encoder, the MHA layer utilizes query Q i ¼ w q v i , key K i ¼ w j v i , and value V i ¼ w m v i vectors, where w q , w j , and w m are linear layers for projecting mission node features. The attention mechanism inferences the relationship between query and key by calculating the attention score as: where u ij is the attention score and d is the embedding size. The attention score represents the similarity between Q i and K i . Using the attention score, the embedding h ei of v i and the context vector h c are calculated as: Note that the calculation of the attention mechanism is parallelized by the heads which are part of the MHA layer. In this study, the encoder consists of 3 MHA layers with eight heads and 128 embedding sizes. Figure 6 shows the embedding process of the encoder.
At each decoding step t, the decoder embeds the outputs from the encoder into ðh 0 e1 ; h 0 e2 ; Á Á Á ; h 0 ek Þ with a MHA layer. Then, decoder selects the next node of the solution with the attention mechanism as described in Vinyals, Fortunato & Jaitly (2015). In this case, the query vector Q′ consists of the context vector h c , the partial solution information 1:t which is abstracted as 0 ¼ ðh en 1 ; h en t Þ, where n t 2 1:t , and the remained flight time budget T Remain for considering the flight time budget constraint. Note that 0 is initialized as ð0; 0Þ before selecting the first mission to complete and then updated as ðh en 1 ; h en t Þ, where h en 1 is the embedding of the first solution node and h en t is the embedding of the last solution node. This is because the agent of the VRP only needs to consider the uncompleted missions with respect to the last completed mission regardless of completed missions (Kool, van Hoof & Welling, 2019). Then, the probability of selecting each mission node is obtained by the attention score between the embeddings ðh 0 e1 ; h 0 e2 ; Á Á Á ; h 0 ek Þ, Q′, and mask M from the masking strategy as: where u 0 i , d′ and a′ are the attention score, embedding size of the decoder, and the probability of selecting each mission, respectively. The next solution n t is selected by sampling from the probability distribution a′ and added to the last index of 1:tÀ1 to construct 1:t . After selecting the next solution, T Remain is reduced by the completion time of the selected mission and 0 is updated with 1:t . In this study, the decoder consists of 1 MHA layer with one head and 128 embedding size. Figure 7 shows the example of decoding steps.

REINFORCE with baseline
To update the neural network model, we use the REINFORCE algorithm (Sutton et al., 2000). In the Markov decision process (MDP) tuple , s; a; r; s; p . for reinforcement learning, the state s is the mission state, the action a is the selected mission from the agent policy p h ðajsÞ which is the neural network model parameterized with h, the reward r is the cost in Eq. (1), and the transition probability s ¼ pðs 0 js; aÞ is the next state after selecting a given s. Note that s is deterministic in this work.
Since the REINFORCE algorithm produces the high variance gradient such that the algorithm might converge extremly slow during training, the baseline b is utilized to reduce the variance. Paremeters h are updated with the policy gradient method as: h h þ ar h JðhÞ; where N is the number of batch size, k is the number of mission, a is the learning rate, and b is the baseline. Note that the baseline b of this study is moving average of the cost during training (Kool, van Hoof & Welling, 2019). The proposed algorithm is trained with 1,280,000 mission instances with 512 batch size, 100 epochs, 1e−4 learning rate with the Adam optimizer (Kingma & Ba, 2015).

NUMERICAL SIMULATIONS
This section provides the comprehensive simulation results to show the performance of the proposed approach. Every simulation is run on NVIDIA GeForce RTX 2080 GPU, Intel(R) i9-9900KF CPU, and 64 GB RAM. The neural netowork is trained with the different number of missions in the range of (3, 30). The position of every mission is generated randomly in (0, 1) scaled two-dimensional (2-D) map with a uniform distribution. The coverage mission's radius is generated randomly in the range of (0.04, 0.08) with the uniform distribution. The place-position of the delivery mission is distant from the pick-position in the range of (−0.1, 0.1) with the uniform distribution. The position of the depot is the origin without loss of generality. The velocity of the UAV is 1 and the flight time budget is 6. We compare our algorithm (termed as Transformer-RL) with the Google OR-Tools (https://developers.google.com/optimization/routing/vrp) which is the state-of-the-art solver for the combinatorial optimization problem. We modified the software into two types. The first baseline algorithm (OR-Type1) solves the given mission as a distancelimited VRP with a single-vehicle. The OR-Type1 generates a single route per iteration within the flight time budget. We give the penalty cost to the uncompleted missions to prevent generating an empty route. The penalty makes the algorithm try to generate a shorter route while satisfying the flight time limitation. Then, the OR-Type1 makes a plan iteratively until every mission is completed. The second baseline algorithm (OR-Type2) solves the mission instance as a distance-limited VRP setting, assuming that the number of available vehicles and the number of missions to be performed are the same. The assumption reduces the effort to solve the problem iteratively unlike the OR-Type1. Thus, the OR-Type2 generates multiple routes at once that complete every mission while deciding the desirable number of vehicles to utilize. We also compared ours with the simple greedy algorithm and the Pointer network-based reinforcement learning algorithm (PointerNet-RL) (Bello et al., 2017). The simple greedy algorithm selects the next mission with the lowest cost from the current mission node while satisfying the flight time limitation, and the PointerNet-RL uses RNN (Vinyals, Fortunato & Jaitly, 2015) for the neural network structure instead of the attention mechanism used in this study. Figure 8 visualizes the sample solution and total cost of each algorithm. Note that the return path to the depot of each route is represented with the black dashed line. In Fig. 8, the OR-Type2 generates the best solution which has the lowest cost, and reinforcement learning-based algorithms show better performance than that of the OR-Type1 and the greedy algorithm. The OR-Type1 generates each route while completing the most missions possible and the greedy algorithm makes the most number of routes, which is inefficient due to its myopic strategy.
The total cost of the solution defined in Eq.
(1) and the computation time is used to measure the performance of the algorithms. We use 10,000 mission instance samples to test the performance with the different number of missions in the range of (3, 30). Figure 9 provides the result of the performance analysis. The OR-Type2 shows the best performance in terms of the total cost, while Transformer-RL has a similar performance with the OR-Type2. Figure 10 shows Transformer-RL is better than the PointerNet-RL more clearly by the cost gap analysis with respect to the OR-Type2. Figure 11 provides the computation time for each algorithm. The computation time of the OR-Type2, which is the best algorithm for the cost, grows exponentially along with the scale of the mission. On the other hand, the Transformer-RL and the PointerNet-RL show significantly faster computation time than that of the other algorithms. The greedy algorithm also shows fast computation time, but it has the worst cost performance. Table 1 summarizes statistical results for a certain number of missions with the cost performance, the performance gap, and the computation time.

CONCLUSIONS AND FUTURE WORK
In this article, we proposed an algorithm for mission planning of heterogeneous missions for a single UAV. We formulate the mission planning problem into a vehicle routing problem that has various methods to solve. We used an attention-based deep reinforcement learning approach, expecting fast computation time and sufficiently good performance. The numerical experiments show that the proposed algorithm can be a good selection with the reasonable trade-off between performance and computation time. However, as the proposed algorithm considers a deterministic mission environment and deals with a single UAV, our future work will consider the uncertainty of the mission environment such as the effect of the weather conditions and the operation of multiple UAVs with multi-agent reinforcement learning approaches.