A Bio-Inspired Decision-Making Method of UAV Swarm for Attack-Defense Confrontation via Multi-Agent Reinforcement Learning

The unmanned aerial vehicle (UAV) swarm is regarded as having a significant role in modern warfare. The demand for UAV swarms with the capability of attack-defense confrontation is urgent. The existing decision-making methods of UAV swarm confrontation, such as multi-agent reinforcement learning (MARL), suffer from an exponential increase in training time as the size of the swarm increases. Inspired by group hunting behavior in nature, this paper presents a new bio-inspired decision-making method for UAV swarms for attack-defense confrontation via MARL. Firstly, a UAV swarm decision-making framework for confrontation based on grouping mechanisms is established. Secondly, a bio-inspired action space is designed, and a dense reward is added to the reward function to accelerate the convergence speed of training. Finally, numerical experiments are conducted to evaluate the performance of our method. The experiment results show that the proposed method can be applied to a swarm of 12 UAVs, and when the maximum acceleration of the enemy UAV is within 2.5 times ours, the swarm can well intercept the enemy, and the success rate is above 91%.


Introduction
With the development and maturity of unmanned aerial vehicle (UAV) flight control technology, the platform performance and intelligence level of UAVs are constantly improving. Therefore, the UAV is widely used in the military field and has become more and more significant in modern warfare [1][2][3]. Through collaboration among UAVs, the UAV swarm consisting of multiple UAVs can overcome the limitations of a single UAV in perception and execution and complete complex tasks [4][5][6][7][8][9], such as dynamic task allocation, collaborative reconnaissance, and attack-defense confrontation. Among these tasks, the method for attack-defense confrontation is highly valued as an emerging military technique that requires that the UAV make proper decisions autonomously according to the situation. The need for a UAV swarm with high-level confrontation intelligence is urgent.
This paper focuses on the attack-defense confrontation of a UAV swarm. Generally, in an attack-defense confrontation, the UAV swarm competes against a certain number of enemies with a certain level of intelligence to maximize their respective benefits. The objective of the UAV swarm mainly consists of two parts: destroying the enemy in a limited amount of time and protecting the base from the enemy's invasion. The existing decision-making methods for attack-defense confrontations include matrix game methods, differential game methods, and expert system methods. However, these methods require 1. This paper proposes a bio-inspired decision-making method for UAV swarms for attack-defense confrontation via MARL. Traditional MARL methods suffer from an exponential increase in training time as the swarm size increases. To overcome this problem, the main idea of our method is to make the strategy trained for a small-sized UAV group applicable to a large-scale UAV swarm. Inspired by the phenomenon that predators hunt for prey in small groups, we propose the grouping mechanism, which divides the swarm into two types of groups. Through the grouping mechanism, interference between groups is avoided, so the strategy trained for small groups can be applied to a large-scale swarm, and the scalability of the UAV swarm is increased; 2.
To prevent the problem that the strategy is stuck in a local optimum during training, a bio-inspired action space is designed. Inspired by group hunting behavior in nature, we abstracted six types of actions that have a clear interactive effect. Compared with standard action space, the bio-inspired action space improves the success rate of the confrontation. Furthermore, as it is hard for the strategy to converge under a sparse reward, we design four types of dense rewards evaluating the status of the mission to accelerate the convergence of the strategy. The results show that an effective strategy can be obtained after adding dense rewards; 3.
The numerical experiments are conducted to evaluate our method. The results show that our method can obtain effective strategies and take advantage of the UAV swarm. The success rate of the confrontation is increased, and the UAV swarm can intercept the enemy, which is faster than itself, through cooperation.
This paper is organized as follows: In Section 2, the attack-defense confrontation problem is formulated, and the preliminary steps are introduced. In Section 3, the decisionmaking method of the UAV swarm for attack-defense confrontation is introduced in detail, including the framework, the grouping mechanism, and the design of MARL. In Section 4, the experiment results are presented, and the performance of our method is evaluated. In Section 5, the contribution of this paper is summarized, and future work is presented.

Attack-Defense Confrontation Problem
In this paper, the attack-defense confrontation problem can be formulated as follows: As Figure 1 shows, it is assumed that our base has detected an enemy UAV approaching. To protect our base, k UAVs are launched to intercept the enemy UAV. The objective of the enemy UAV is to approach our base while evading our UAVs. If our base is within the detection range of the enemy UAV, it is considered that our base is exposed, and the interception mission fails. Considering that the enemy UAV may take countermeasures such as radar and infrared countermeasures to defend itself, the attack from one UAV is not 100% effective. Therefore, in this paper, only if the enemy UAV is within the attack range of four of our UAVs at the same time, it is considered that our UAVs cooperate to launch a saturation attack. In this case, it is confidently believed that the enemy UAV is destroyed and the interception mission succeeds.
As Figure 2 shows, the success conditions of the interception mission are defined as follows: p i (t suc ) − p enemy (t suc ) ≤ ρ atk , ∃U = {u 1 , u 2 , u 3 , u 4 } ⊆ {1, 2, . . . , k}, ∀i ∈ U (1) where p i represents the position of the i-th UAV, p enemy represents the position of the enemy UAV, ρ atk represents the attack range of our UAVs, and U represents a set containing a certain 4 of k UAVs. Each element u in set U represents a UAV, p base represents the position of our base, ρ det represents the detection range of the enemy UAV. Equation (1) represents that the enemy UAV is within the attack range of 4 of our UAVs at t suc . Equation (2) represents that our base is not exposed before t suc . Equation (3) Figure 1. Attack-defense confrontation problem.
As Figure 2 shows, the success conditions of the interception mission are defined as follows: where i p represents the position of the i-th UAV, enemy p represents the position of the enemy UAV, atk ρ represents the attack range of our UAVs, and U represents a set containing a certain 4 of k UAVs. Each element u in set U represents a UAV, base p represents the position of our base, det ρ represents the detection range of the enemy UAV.
Equation (1) represents that the enemy UAV is within the attack range of 4 of our UAVs at suc t . Equation (2) represents that our base is not exposed before suc t . Equation (3) represents that our UAVs should accomplish the interception mission in a limited time max t .   (1) represents that the enemy UAV is within the a suc t . Equation (2) represents that our base is not exposed b sents that our UAVs should accomplish the interception mi

Dynamics Model of the UAV
The UAV is assumed to be a mass point in a two-dim model of our UAVs is expressed as follows:

Dynamics Model of the UAV
The UAV is assumed to be a mass point in a two-dimensional plane. The dynamic model of our UAVs is expressed as follows: where . p i represents the derivative of p i , i.e., the velocity of the i-th UAV, v i represents the velocity of the i-th UAV, . v i represents the derivative of v i , i.e., the acceleration of the i-th UAV, a i represents the control input of the i-th UAV, and λ represents the linear drag coefficient of the UAV. Limited by the performance of UAV, the magnitude of the velocity and acceleration of UAV should meet certain constraints: where v max and a max are the velocity limit constant and the acceleration limit constant, respectively. Similarly, the dynamics model of the enemy UAV is expressed as follows: .
where . p enemy represents the derivative of p enemy , i.e., the velocity of the enemy UAV; v enemy represents the velocity of the enemy UAV; . v enemy represents the derivative of v enemy , i.e., the acceleration of the enemy UAV; a enemy represents the control input of the enemy UAV; and λ represents the linear drag coefficient of the UAV.
The magnitude of the velocity and acceleration of the enemy UAV should also meet certain constraints: a enemy ≤ a enemy max where v enemy max and a enemy max are the velocity limit constant and the acceleration limit constant, respectively.

Movement Strategy of Enemy UAV
In an attack-defense confrontation problem, the objective of the enemy UAV is to approach our base as close as possible while keeping as far away as possible from our UAVs. To make the enemy UAV move autonomously, we design the enemy UAV's movement strategy based on the artificial potential field method. The basic idea is to assume that the enemy UAV is subject to an attractive force generated by our base and repulsive forces generated by our UAVs. The enemy UAV moves in a certain direction according to the combined force.
The control input a enemy of the enemy UAV is expressed as follows: where f p base , p enemy represents the attractive force and g p i , p enemy represents the repulsive force. They can be calculated using the following formulas: f p base , p enemy = a enemy max p base − p enemy p base − p enemy (12) The magnitude of the attractive force is constant, so the enemy UAV will move towards our base even if it is far from it. When the enemy UAV is far from our UAV, it is not necessary to change the movement direction. Therefore, only if the distance between the enemy UAV and our UAV is smaller than ρ det , the magnitude of the repulsive force will be large enough to affect the movement direction of the enemy UAV.

Multi-Agent Reinforcement Learning
Reinforcement learning (RL) is a method that enables an agent to learn the optimal behavior strategy through interactions with the environment and is suitable for solving decision-making problems.
Multi-agent reinforcement learning (MARL) is an extension of RL in multi-agent systems. Typically, MARL algorithms adopt a framework of centralized training and decentralized execution (CTDE) [17,18]. The CTDE framework of MARL is shown in Figure 3.
necessary to change the movement direction. Therefore, only if the d enemy UAV and our UAV is smaller than det ρ , the magnitude of the be large enough to affect the movement direction of the enemy UAV.

Multi-Agent Reinforcement Learning
Reinforcement learning (RL) is a method that enables an agent t behavior strategy through interactions with the environment and is decision-making problems.
Multi-agent reinforcement learning (MARL) is an extension of RL tems. Typically, MARL algorithms adopt a framework of centralized tralized execution (CTDE) [17,18]  There are two types of neural networks in the CDTE framework: critic networks. The input of the actor network is the local observatio by i o , and the output of the actor network is the action for agent i to  There are two types of neural networks in the CDTE framework: actor networks and critic networks. The input of the actor network is the local observation of agent i denoted by o i , and the output of the actor network is the action for agent i to execute, denoted by a i . The input of the critic network is the joint state s = (o 1 , o 2 , . . . , o n ) consisting of all local observations and the joint action a t = (a 1 , . . . , a n ), and the output of the critic network is the state-action value. At time step t, every agent selects its action independently according to its actor network. After the joint action a t is executed, the joint state s t will be updated, and the reward r(s t , a t ) received by all agents will be used to train the actor and critic networks.
The critic network parameterized by φ is trained by minimizing where L(φ) represents the loss function of the critic network parameterized by φ, s t represents the joint state at time step t, a t represents the joint action at time step t, Q φ (s t , a t ) represents the output of the critic network, y represents the expected output of the critic network, and where r(s t , a t ) represents the reward for executing the action a t in the state s t , γ is a discount coefficient. The actor network parameterized by µ is updated according to where J(µ) represents the objective function of the actor network parameterized by µ; π(a i t o i t ) represents the output of the actor network which is the probability for agent i to execute the action a i t with the local observation; and o i t , b(s t , a t ) represents the baseline of state-action value.

Framework
In an attack-defense confrontation problem, our UAVs should decide how to move to intercept the enemy UAV. Inspired by the predatory behavior of pack hunters in nature, we propose a bio-inspired decision-making method for UAV swarms for attack-defense confrontation. We divide our UAVs into attack groups and backup groups according to the grouping mechanism. The attack group directly engages with the enemy UAV and learns movement strategy via multi-agent reinforcement learning. Backup groups adjust their formation according to the position of the enemy UAV and are ready to engage. The framework of the decision-making method of the UAV swarm for attack-defense confrontation is shown in Figure 4.  Figure 4. Framework of decision-making methods for UAV swarms for attack-defense confrontation.

Grouping Mechanism
Based on the dataset of observations of wolves hunting elk in Yellowstone National Park, MacDulty suggests that the relationship between hunting success and group sizes is nonlinear [19]. When the group size is small, hunting success increases as the group size increases. However, hunting success peaks at a small group size and levels off when the group size is beyond 4. The reason for this phenomenon is that individuals in a small group cooperate better and their abilities are fully exhibited, while in a large group, individuals interfere with each other and some individuals cannot contribute to the hunt.
Similarly, when the group size of the UAV swarm is large, our UAVs interfere with each other, making it difficult to intercept the enemy UAV. Therefore, as shown in Figure  5, our UAV swarm is divided into several groups, and the area is divided into several

Grouping Mechanism
Based on the dataset of observations of wolves hunting elk in Yellowstone National Park, MacDulty suggests that the relationship between hunting success and group sizes is nonlinear [19]. When the group size is small, hunting success increases as the group size increases. However, hunting success peaks at a small group size and levels off when the group size is beyond 4. The reason for this phenomenon is that individuals in a small group cooperate better and their abilities are fully exhibited, while in a large group, individuals interfere with each other and some individuals cannot contribute to the hunt.
Similarly, when the group size of the UAV swarm is large, our UAVs interfere with each other, making it difficult to intercept the enemy UAV. Therefore, as shown in Figure 5, our UAV swarm is divided into several groups, and the area is divided into several zones. Every group is composed of four UAVs and is distributed in different zones. If the enemy UAV enters a zone, the UAV group in the zone becomes the attack group, and other UAV groups become the backup groups.
group cooperate better and their abilities are fully exhibited, while in viduals interfere with each other and some individuals cannot contrib Similarly, when the group size of the UAV swarm is large, our U each other, making it difficult to intercept the enemy UAV. Therefore, 5, our UAV swarm is divided into several groups, and the area is d zones. Every group is composed of four UAVs and is distributed in d enemy UAV enters a zone, the UAV group in the zone becomes the atta UAV groups become the backup groups. The attack group intercepts the enemy via MARL, which is pr Section 3.3. If the enemy UAV moves to other zones, the UAV group prevent interfering with other UAV groups.
The backup groups should adjust their positions dynamically ac tion of our base and the enemy UAV. As shown in Figure 6  The attack group intercepts the enemy via MARL, which is presented in detail in Section 3.3. If the enemy UAV moves to other zones, the UAV group stops pursuing to prevent interfering with other UAV groups.
The backup groups should adjust their positions dynamically according to the position of our base and the enemy UAV. As shown in Figure 6, assuming that the current position of our base p base = (x b , y b ), the current position of the enemy p enemy = (x e , y e ), the current position of the formation center of the UAV group p center = (x c , y c ). The expected position of the formation center of the UAV group p e c = (x e c , y e c ) should be on the line between our base and the enemy UAV.
We design a discrete-time proportional-derivative (PD) controller The expected position of the formation center of the UAV group p e c can be expressed as follows: x e c = x c We design a discrete-time proportional-derivative (PD) controller to control the movement of UAVs in the backup groups. The control input a i (t) at time t for the i-th UAV can be determined as follows: where k p = 2.5, k d = 2.2, and T s = 0.2s are parameters in the PD controller.

Design of MARL
The attack group is trained to intercept the enemy UAV based on MARL. Therefore, the elements of MARL, including action space, state space, and reward function, should be designed, respectively.

Bio-Inspired Action Space
Many predators in nature hunt in groups for prey that is faster or larger than themselves. Similarly, in an attack-defense confrontation, our UAVs are predators, and the enemy UAV is the prey. Inspired by the hunting behavior of herd predators in nature, a bio-inspired action space is proposed. The bio-inspired action space contains two types of interaction: interaction between enemy UAVs and our UAVs and interaction among our UAVs.

(1) Interaction between Enemy UAVs and Our UAVs
MacNulty summarized the ethogram of large-carnivore predatory behavior by observing wolves in Yellowstone National Park [20]. He proposed that predatory behavior can be divided into six phases: search, approach, watch, attack-group, attack-individual, and capture. This paper focuses on the three main phases of group hunting behavior: approach, watch, and attack-individual, and abstracts these three phases into three types of action.
Approach. As shown in Figure 7, when our UAV and the enemy UAV are far apart, our UAV takes approaching action to quickly decrease the distance to the enemy UAV for performing the interception mission.
s 2023, 8, x FOR PEER REVIEW and capture. This paper focuses on the three main phas proach, watch, and attack-individual, and abstracts the action.
Approach. As shown in Figure 7, when our UAV a our UAV takes approaching action to quickly decrease t performing the interception mission.  The control input of the i-th UAV can be calculated as follows: Watch. As shown in Figure 8, when our UAV is not within the detection range of the enemy UAV, it takes watching action to keep its distance from the enemy UAV and avoid causing the enemy UAV to escape. During this phase, our UAVs encircle the enemy UAV in preparation for the next phase of the interception mission.
The control input of the i-th UAV can be calculated as Watch. As shown in Figure 8, when our UAV is not w enemy UAV, it takes watching action to keep its distance fr causing the enemy UAV to escape. During this phase, our in preparation for the next phase of the interception missio When our UAV takes action, it moves clockwise or cou UAV as the center of the circle. As shown in Figure 9, taki ample, the control input of the i-th UAV can be calculated When our UAV takes action, it moves clockwise or counter-clockwise with the enemy UAV as the center of the circle. As shown in Figure 9, taking clockwise motion as an example, the control input of the i-th UAV can be calculated as follows: where v t represents tangential velocity of our UAV relative to the enemy UAV; e t represents the unit vector perpendicular to the line from the position of our UAV to the position of the enemy UAV; a r represents centripetal acceleration corresponding to tangential velocity; θ represents the angle between the direction of the control input of our UAV and the direction of the line connecting the enemy UAV and our UAV; R(θ) represents rotation matrix; and e r represents the unit vector in the direction of the line from the position of our UAV to the position of the enemy UAV.   Similarly, the control input of counter-clockwise motion can be calculated as follows: Attack-individual. As shown in Figure 10, similar to the harassment of the wolf pack, our UAVs induce the enemy UAV to move in a certain direction by constantly alternating between attack and retreat. In the process, our UAVs shrink the size of the encirclement, eventually achieving the capture of the enemy UAV.  Similarly, the control input of counter-clockwise motion can be calculated as follows: Attack-individual. As shown in Figure 10, similar to the harassment of the wolf pack, our UAVs induce the enemy UAV to move in a certain direction by constantly alternating between attack and retreat. In the process, our UAVs shrink the size of the encirclement, eventually achieving the capture of the enemy UAV.  It is noted that the direction of the control input during our UAV's attack and retreat is not along the direction of the line connecting our UAV and the enemy UAV but rather towards the predicted future position of the enemy UAV.
The control input of an attack can be calculated as follows: The control input of retreat can be calculated as follows: p enemy in Equations (27) and (28) represents the predicted future position of the enemy UAV, which can be calculated as follows: where λ d represents the prediction coefficient. The larger the prediction coefficient, the more distant the predicted future position. Additionally, it can be seen that the predicted future position is related to the speed of the enemy UAV and the distance between the enemy UAV and our UAV. This is because the greater the speed of the enemy UAV or the greater the distance between the enemy UAV and our UAV, the greater the offset required to intercept, and the greater the distance between the predicted future position and the current position.
(2) Interaction among Our UAVs In this paper, interaction among our UAVs is abstracted into three types of action: separation, alignment, and cohesion.
Separation. As shown in Figure 11, our UAVs take separation actions to prevent collisions between each other.
The control input of retreat can be calculated as follows: where d λ represents the prediction coefficient. The larger the prediction more distant the predicted future position. Additionally, it can be seen that the predicted future position is relate of the enemy UAV and the distance between the enemy UAV and our UAV. T the greater the speed of the enemy UAV or the greater the distance betwe UAV and our UAV, the greater the offset required to intercept, and the great between the predicted future position and the current position.
(2) Interaction among Our UAVs In this paper, interaction among our UAVs is abstracted into three ty separation, alignment, and cohesion.
Separation. As shown in Figure 11, our UAVs take separation actions lisions between each other.  The control input of the i-th UAV can be calculated as follows: where w j denotes the weighting factor which can be calculated as follows: Alignment. As shown in Figure 12, our UAVs take action to keep each other at a certain distance and achieve group movement. The control input of the i-th UAV can be calculated as follows: where v avg denotes the average velocity of other UAVs, which can be calculated as follows: Cohesion. As shown in Figure 13, our UAVs take action to approach each other and facilitate mutual support.   The control input of the i-th UAV can be calculated as follows: where p avg denotes the average position of other UAVs which can be calculated as follows: (3) Action Space The action space of our UAVs contains nine actions, including approach, watch (clockwise), watch (counter-clockwise), attack-individual (attack), attack-individual (retreat), separation, alignment, cohesion, and void. Each action corresponds to a control input, and the control input for void is 0.

State Space
The local observation o i of the i-th UAV consists of information from three parts: the enemy UAV, our base, and other UAVs. Specifically, o i can be expressed as follows: where p rel enemy and v rel enemy represent the relative position and the relative velocity of the enemy UAV, respectively; p vel base represents the relative position of our base; and p rel j,i represents the relative position of the j-th UAV.

Reward Function
In MARL, the score score suc is usually determined based on the success of the task, and it is used as a reward r for training.
However, the biggest problem with such a setup is that the rewards are too sparse. Especially when it is hard to accomplish the task, the agents cannot obtain the rewards in a short time, and it is difficult to evaluate the quality of the current strategy. The direction of updating the strategy shows randomness, causing the problem that the algorithm is difficult to converge. To solve this problem, this paper modifies the reward function by adding prior knowledge to the reward function and by evaluating the current status, adding a dense reward to induce the agents to update the strategy in the direction of the superior status.
Considering that our UAVs need to approach the enemy UAV at a certain distance to perform the interception mission, a status evaluation function score dis related to the distance to the enemy UAV is added, and it can be expressed as follows: The function value remains constant when the distance is smaller than ρ atk , and it decreases gradually to 0 as the distance increases. Furthermore, the functions are smooth, bounded, and differentiable in their domains, which facilitates the training of the neural network and avoids gradient explosion.
Additionally, to avoid the enemy UAV escaping in the opposite direction from our UAVs, our UAVs should be scattered around the enemy and intercept the enemy from different directions. So, a status evaluation function score encircle related to the dispersion of our UAVs is added, and it can be expressed as follows: where θ i represents the angle between the line connecting the i-th UAV and the enemy and the line connecting its counter-clockwise neighboring UAV and the enemy, as shown in Figure 14, σ represents the standard deviation of the angles.
network and avoids gradient explosion. Additionally, to avoid the enemy UAV escaping in the oppos UAVs, our UAVs should be scattered around the enemy and inte different directions. So, a status evaluation function encircle score rel of our UAVs is added, and it can be expressed as follows: ( )  Meanwhile, since the main goal of the interception mission is to prevent the enemy from approaching our base, the closer the enemy is to our base, the greater the threat to our base. A status evaluation function score base related to the distance to our base is added, and it can be expressed as follows: Additionally, in the early period of training, it is easy for the enemy to invade our base. To update the strategy of our UAVs for hindering the enemy, a time reward function score time is added and it can be expressed as follows: Therefore, the modified reward function for training is expressed as follows: r = ω s score suc + ω d score dis + ω e score encircle + ω b score base + ω t score time (48) where ω s = 10, ω d = 2, ω e = 3, ω b = 3, and ω t = 1 are weighting factors. The weight parameters in (48) were selected according to empiricism. The greater the contribution of the function to the intercept mission, the greater the weight parameter.

Numerical Experiments
In this section, the strategy of the attack group is trained, and the strategy is applied to a swarm of 12 UAVs according to the grouping mechanism. Numerical experiments with enemies with different maximum accelerations are executed to test the performance of our method.

Experiment Setup
The experiment environment is built using Unity's ML-Agents Toolkit. As shown in Figure 15, the training environment is 100 m long and 100 m wide. The circle on the left represents our base. The four squares represent four UAVs of the attack group. The circle on the right represents the enemy UAV. Parameters of the environment are listed in Table 1. The training parameters of MARL are listed in Table 2.

Numerical Experiments
In this section, the strategy of the attack group is trained, and the strategy is applied to a swarm of 12 UAVs according to the grouping mechanism. Numerical experiments with enemies with different maximum accelerations are executed to test the performance of our method.

Experiment Setup
The experiment environment is built using Unity's ML-Agents Toolkit. As shown in Figure 15, the training environment is 100 m long and 100 m wide. The circle on the left represents our base. The four squares represent four UAVs of the attack group. The circle on the right represents the enemy UAV. Parameters of the environment are listed in Table  1. The training parameters of MARL are listed in Table 2.     As Figure 16 shows, 12 UAVs are divided into 3 groups, and the environment is 175 m long and 100 m wide.  As Figure 16 shows, 12 UAVs are divided into 3 groups, and the environment is 175 m long and 100 m wide.

Performance Analysis
To validate the bio-inspired action space in our method, the success rates of the method with bio-inspired action space and the original action space in the training process are compared. The original action space contains five actions: up, down, left, right, and void. The curves of the success rates are shown in Figure 17, and the final success rates after 45,000 episodes of training are listed in Table 3.

Performance Analysis
To validate the bio-inspired action space in our method, the success rates of the method with bio-inspired action space and the original action space in the training process are compared. The original action space contains five actions: up, down, left, right, and void. The curves of the success rates are shown in Figure 17, and the final success rates after 45,000 episodes of training are listed in Table 3.

Method Final Success Rate
Original Action Space 89% Bio-Inspired Action Space 97% It can be seen that both curves converged after 45,000 episodes of training. The curve with bio-inspired action space grew slowly in the early period of training, but it grew rapidly after about 45,000 episodes, and the success rate eventually remained at 97%. The curve with original action space grew rapidly in the early period of training, but it grew slowly after 24,000 episodes, and the success rate eventually remained at 89%. It shows that the bio-inspired action space can avoid being stuck in a local optimum and increase the final success rate. Compared to the original action space, the bio-inspired action space contains more types of actions, resulting in a slow growth in success rates in the early period. However, these actions have a clear interactive effect on both our UAVs and the enemy UAV, which facilitates the update of the strategy in a better direction.
After the strategy of the attack group is obtained, the success rate of the attack group against enemies with different maximum accelerations is evaluated. The results are shown in Figure 18 and Table 4.

Method
Final Success Rate Original Action Space 89% Bio-Inspired Action Space 97% It can be seen that both curves converged after 45,000 episodes of trainin with bio-inspired action space grew slowly in the early period of training, rapidly after about 45,000 episodes, and the success rate eventually remained curve with original action space grew rapidly in the early period of training slowly after 24,000 episodes, and the success rate eventually remained at 89 that the bio-inspired action space can avoid being stuck in a local optimum the final success rate. Compared to the original action space, the bio-inspired contains more types of actions, resulting in a slow growth in success rates period. However, these actions have a clear interactive effect on both our U enemy UAV, which facilitates the update of the strategy in a better direction.
After the strategy of the attack group is obtained, the success rate of the against enemies with different maximum accelerations is evaluated. The resul in Figure 18 and Table 4.   The strategy is applied to a swarm of 12 UAVs, and the success rate against enemies with different maximum accelerations is obtained. The results are shown in Figure 19 and Table 5.

3
The strategy is applied to a swarm of 12 UAVs, and the success ra with different maximum accelerations is obtained. The results are show Table 5.  Figure 19. Success rates of the UAV swarm against the enemy with different max  Figure 19. Success rates of the UAV swarm against the enemy with different maximum accelerations. It can be seen that the success rate decreases as the maximum acceleration of the enemy UAV increases. Compared to the success rate of the attack group, the success rate of the UAV swarm is higher. The success rate against enemies with 3 times the maximum acceleration of ours increased from 2% to 53%. It shows that the grouping mechanism of our method can take advantage of the UAV swarm and increase the success rate. When the enemy's maximum acceleration is within 2.5 times ours, our UAV swarm can intercept the enemy well, and the success rate is 91%.

Demonstration of Attack-Defense Confrontation
In this subsection, the process of the interception mission performed by the attack group and the UAV swarm is recorded. Figures 20 and 21 show how the attack group intercepts an enemy UAV. The maximum acceleration of the enemy is 0.45 m·s −2 , the maximum speed of the enemy is 1.5 m·s −1 , the maximum acceleration of our UAVs is 0.3 m·s −2 , and the maximum speed of our UAVs is 1.0 m·s −1 .
In this subsection, the process of the interception mission performed b group and the UAV swarm is recorded. Figures 20 and 21 show how the attack group intercepts an enemy UAV mum acceleration of the enemy is 0.45 m·s −2 , the maximum speed of the enemy the maximum acceleration of our UAVs is 0.3 m·s −2 , and the maximum speed o is 1.0 m·s −1 . When the episode begins, the attack group approaches the enemy UAV the interception mission. At t = 12 s, the speed of our UAVs shows a large diff speed of UAV 1 and UAV 4 is about 0.9 m·s −1 , faster than the speed of UAV 2 which is about 0.75 m·s −1 . Thus, our UAVs form a U-shaped formation, which In this subsection, the process of the interception mission performed by the atta group and the UAV swarm is recorded. Figures 20 and 21 show how the attack group intercepts an enemy UAV. The ma mum acceleration of the enemy is 0.45 m·s −2 , the maximum speed of the enemy is 1.5 m·s the maximum acceleration of our UAVs is 0.3 m·s −2 , and the maximum speed of our UA is 1.0 m·s −1 . When the episode begins, the attack group approaches the enemy UAV to perfor the interception mission. At t = 12 s, the speed of our UAVs shows a large difference. T speed of UAV 1 and UAV 4 is about 0.9 m·s −1 , faster than the speed of UAV 2 and UAV which is about 0.75 m·s −1 . Thus, our UAVs form a U-shaped formation, which is helpful When the episode begins, the attack group approaches the enemy UAV to perform the interception mission. At t = 12 s, the speed of our UAVs shows a large difference. The speed of UAV 1 and UAV 4 is about 0.9 m·s −1 , faster than the speed of UAV 2 and UAV 3, which is about 0.75 m·s −1 . Thus, our UAVs form a U-shaped formation, which is helpful to avoid the enemy escaping. At t = 26 s, the enemy is within the attack range of our 4 UAVs, and the interception mission is successful. Figures 22 and 23 show how the UAV swarm intercepts an enemy UAV. Twelve UAVs are divided into three groups. Group 1 consists of UAVs 1 to 4. Group 2 consists of UAVs 5 to 8. Group 3 consists of UAVs 9 to 12. The maximum acceleration of the enemy is 0.75 m·s −2 , the maximum speed of the enemy is 2.5 m·s −1 , the maximum acceleration of our UAVs is 0.3 m·s −2 , and the maximum speed of our UAVs is 1.0 m·s −1 . When the episode begins, group 3 approaches the enemy UAV, and groups 1 and 2 adjust their positions in their zones. From t = 20.9 s to t = 36.6 s, the enemy UAV, with the advantage of higher performance, accelerates to a higher speed to avoid the interception, breaks through the defense line formed by group 3 and enters the zone of group 2. Group 2 forms a U-shaped formation at t = 41.9 s and eventually intercepts the enemy UAV at t = 47.1 s. avoid the enemy escaping. At t = 26 s, the enemy is within the attack range of our 4 UA and the interception mission is successful. Figures 22 and 23 show how the UAV swarm intercepts an enemy UAV. Twelve UA are divided into three groups. Group 1 consists of UAVs 1 to 4. Group 2 consists of UA 5 to 8. Group 3 consists of UAVs 9 to 12. The maximum acceleration of the enemy is 0 m·s −2 , the maximum speed of the enemy is 2.5 m·s −1 , the maximum acceleration of o UAVs is 0.3 m·s −2 , and the maximum speed of our UAVs is 1.0 m·s −1 . When the episode begins, group 3 approaches the enemy UAV, and groups 1 and adjust their positions in their zones. From t = 20.9 s to t = 36.6 s, the enemy UAV, with t advantage of higher performance, accelerates to a higher speed to avoid the interceptio breaks through the defense line formed by group 3 and enters the zone of group 2. Gro 2 forms a U-shaped formation at t = 41.9 s and eventually intercepts the enemy UAV at 47.1 s.
Although, in the above process, the enemy UAV broke through the defense li formed by Group 3, Group 3 still played the role of hindering the enemy UAV and boug enough time for Group 2 to dynamically adjust the position. As the enemy UAV enter the zone of Group 2, Group 2 had already adjusted to a suitable position. So, it was ab for Group 2 to quickly form an interception formation and realize the interception of t enemy.

Conclusions
This paper proposes a decision-making method for UAV swarms for attack-defen confrontation via MARL. For traditional MARL methods, the training time increases e ponentially as the swarm size increases. Inspired by the phenomenon that many predato Although, in the above process, the enemy UAV broke through the defense line formed by Group 3, Group 3 still played the role of hindering the enemy UAV and bought enough time for Group 2 to dynamically adjust the position. As the enemy UAV entered the zone of Group 2, Group 2 had already adjusted to a suitable position. So, it was able for Group 2 to quickly form an interception formation and realize the interception of the enemy.

Conclusions
This paper proposes a decision-making method for UAV swarms for attack-defense confrontation via MARL. For traditional MARL methods, the training time increases exponentially as the swarm size increases. Inspired by the phenomenon that many predators in nature hunt in small groups, our method abstracts the grouping mechanism to fully utilize the capability of the UAV swarm and mitigate interference between UAVs. The confrontation strategy is first obtained by training a group of four UAVs. Then, according to the proposed grouping mechanism, we apply the strategy to a larger-scale swarm. Therefore, even if the swarm size increases, the training time remains the same. Furthermore, to prevent the strategy from being stuck in a local optimum during training, six types of actions that have a clear interactive effect are generalized from hunting behavior. Several experiments are conducted to evaluate the performance of our method. The results show that when the maximum acceleration of the enemy UAV is within 2.5 times ours, a swarm of 12 UAVs can intercept the enemy well, and the success rate is above 91%. In addition, the grouping mechanism can take advantage of the UAV swarm and increase the success rate. And the method with the bio-inspired action space has a higher success rate compared with the method with the standard action space.
In this work, it is assumed that all UAVs are restricted to a 2D plane and that the UAV can obtain information about other UAVs without delay. Current work has mainly validated the effectiveness of our method on a simplified model. For future work, we will use a more precise dynamics model of UAVs and consider more constraints. Additionally, our method will be applied in a real-world flight experiment to demonstrate its feasibility.