Solving large-scale multi-agent tasks via transfer learning with dynamic state representation

Many research results have emerged in the past decade regarding multi-agent reinforcement learning. These include the successful application of asynchronous advantage actor-critic, double deep Q-network and other algorithms in multi-agent environments, and the more representative multi-agent training method based on the classical centralized training distributed execution algorithm QMIX. However, in a large-scale multi-agent environment, training becomes a major challenge due to the exponential growth of the state-action space. In this article, we design a training scheme from small-scale multi-agent training to large-scale multi-agent training. We use the transfer learning method to enable the training of large-scale agents to use the knowledge accumulated by training small-scale agents. We achieve policy transfer between tasks with different numbers of agents by designing a new dynamic state representation network, which uses a self-attention mechanism to capture and represent the local observations of agents. The dynamic state representation network makes it possible to expand the policy model from a few agents (4 agents, 10 agents) task to large-scale agents (16 agents, 50 agents) task. Furthermore, we conducted experiments in the famous real-time strategy game Starcraft II and the multi-agent research platform MAgent. And also set unmanned aerial vehicles trajectory planning simulations. Experimental results show that our approach not only reduces the time consumption of a large number of agent training tasks but also improves the final training performance.


Introduction
Reinforcement learning (RL) has received increasing attention from the artificial intelligence (AI) research community in recent years. Deep reinforcement learning (DRL) 1 in single-agent tasks is a practical framework for solving decision-making tasks at a human level 2 by training a dynamic agent that interacts with the environment. Cooperative multi-agent reinforcement learning (MARL) is a more complicated problem in the RL field due to the exponential growth of decision dimensionality. 3 The approach encourages multiple agents to achieve a goal by credit assignment, 4 and it has a solid link to many real-world problems, such as performing well in multi-player video games 5 and traffic light control. 6 However, there are many challenges in MARL where agents must interact with each other in a shared environment. 7 Furthermore, in a largescale multi-agent task, the dynamic environment becomes more complicated and even unsolvable. 8 Transfer learning (TL) is an efficient way to solve RL problem by leveraging prior knowledge. 9,10 Reusing existing knowledge can accelerate the RL agent learning process and make complex tasks learnable. It is crucial to decide how, when, and what to store knowledge into the knowledge space and reuse it. 9 There is no general valid solution for all domains. An improper transfer might cause damage to the learning process instead of accelerating it, which is known as a negative transfer. Therefore, it is crucial to design transfer principles in different RL training scenarios, especially in complex tasks. In this article, we use TL methods to help improve the training efficiency and effectiveness in large-scale multi-agent tasks.
The state observed by agents in multi-agent training under partially observable settings changes dynamically. This poses an obstacle to the transfer of policies across different numbers of multi-agent tasks. We must address dynamic states using a state representation approach to solve this problem. To the best of my knowledge, no related work has been done to explore this topic in a multi-agent scenario. In this article, we use an attention mechanism to handle dynamic observations so that the multi-agent observation dimension can be relatively stable for different numbers of tasks. This approach also lays the foundation for the transfer of strategies.
Thus, on the basis of the above work, we use TL to help solve the large-scale multi-agent training problem. First, we proposed a dynamic state representation network (DSRNet) to remove the transfer barrier between different numbers of agents tasks. In turn, various classical algorithms of RL can be combined with the transfer method. We then selected typical methods to verify the capability of our methods in different experimental settings.
This article focuses on a real-time strategy (RTS) game to explore the large-scale fully cooperative MARL. Star-Craft is an RTS game that is very popular around the world. StarCraft provides a suitable environment for AI researchers to simulate combat scenarios. SMAC 11 has become a standard benchmark for evaluating discrete cooperative MARL algorithms. We scaled SMAC to our large-scale multi-agent needs using the classical centralized training distributed execution algorithm QMIX 12 as a baseline to test our transfer framework. Another set of experiments used a platform that could support a larger number of agents, Magent. 13 On top of this platform, we chose to combine the classical independent RL training methods double deep Q-network (Double DQN) 14 and asynchronous advantage actor-critic (A3C) 15 to validate our transfer method. Moreover, we conduct UAVs collision avoidance planning simulations that show our framework's ability for large-scale robot control training.
The main contributions of this article can be concluded as follows: a. Our approach verified the feasibility of TL in largescale multi-agent cooperation schemes. b. We introduced an attention network to the single agent in the partially observable setting for the representation of variable units. c. We achieved practical TL with good performance from a few agents to more agents in different environments, which shed light on the training problem of very large-scale multi-agent scenes.
In the next part of this article, we present our work through sections on related work, problem formulation and background knowledge, MARL transfer framework, experiments, potential robotic applications, and conclusion.

Related work
TL has played an important role in accelerating singleagent RL by adapting learnt knowledge from past relevant tasks. 10,16,17 Inspired by this scenario, TL in MARL [18][19][20][21] is also studied with respect to transferring knowledge across multi-agent tasks to help improve the learning performance. The above work has two main directions: knowledge transfer across tasks and transfer among agents. However, few works consider transferring knowledge across different numbers of agents, especially from a small number of agents to a large number of agents.
Attention mechanisms have become an essential model that has been adopted in many deep neural networks. In particular, self-attention 22 trains the attention weight at a specific position in a sequence by considering all other positions in this sequence. Vaswani et al. 23 showed that a machine translation model composed of only a selfattention model could achieve state-of-the-art results. Wang et al. 24 reconstructed self-attention as a non-local operation to model spatial-temporal dependencies in video sequences. Nevertheless, self-attention mechanisms have not been fully explored in MARL.
The learning state representation (SL) aims to catch changes in the environment caused by the agent's actions; this special representation is particularly suitable for extracting dynamic states in RL tasks. The main function of SL is to generate a low-dimensional state space in which RL policy can perform well and be efficient. The studies [25][26][27][28][29] adapt SL methods to make the RL training process faster by separating the representation learning process and policy learning process.
The purpose of using policy distillation is to remove parameters that are not necessary for the original model, thereby improving the generalization of the traditional model. 30 Distillation is performed by comparing the classification results of the teacher and student networks without loss of information using soft labels. Policy distillation, which is the distillation of one or more behavioral policies from the teacher model to the student model, has been introduced to RL. 31 Policy distillation has three cases: (1) the student model is trained with a negative loglikelihood loss (NLL) to predict the same task, (2) the student model is trained with mean-squared-error loss (MSE), (3) the student model is trained with Kullback-Leibler divergence loss (KL). This approach allows the network size to be compressed without performance degradation, and multiple task-specific policies can be consolidated into a single policy. Policy distillation is widely used in single-agent RL. 32 A few previous works have focused on the transfer properties of MARL; for example, Barrett and Stone 33 proposed an ad hoc teamwork algorithm, Omidshafiei et al. 34 proposed the LeCTR algorithm to accomplish knowledge transfer between two agents, and Hernandez-Leal et al. 35 combined the Bayesian method and Pepper model in multi-agent matchmaking. However, the above methods do not focus on the problem of largescale agents and become uncommon when the agent size becomes large, which is the challenge addressed in this work. In this article, we devised a method to efficiently distill large and heavy network policies into small and light networks in the deep MARL environment by taking ideas from these distillation methods.

Partially observable stochastic games
Fully cooperative multi-agent tasks can be modelled as decentralized partially observable stochastic games (POSGs) 36 that extend from Markov decision processes (MDPs). In this article, we follow the POSG setting, where agents cannot obtain complete environmental information.
A POSG is composed of < S; U ; T ; R 1:::n ; O; n; g >, where S ¼ S 0 Â S 1 Â ::: Â S n is the whole state space of the environment; U ¼ A 1 Â ::: Â A n is the joint action space; T : S Â U Â S ! ½0; 1 is the state transition function; R i : S Â U Â S ! R is the reward function for agent i; O is the observation space for agents; n is the number of agents; and g is the discount factor. The goal of agents is to learn a policy p that maximizes the discount accumulated where t is the step size.

State representation learning
State representation learning (SRL) is a special form of representation learning that learns abstract state features in low dimensions. Formally, the SRL task is to learn a mapping function for translating the current dynamic high-dimensional state to a concise low-dimensional state r t ¼ ðs t Þ. We can decompose the policy function p as In particular, Martin et al. 37 defined a good state representation as being able to represent the actual value of the current state and generalize the learned policy to unseen states, even unseen tasks.

Self-attention mechanism
Attention mechanisms have been widely adopted in computer vision and natural language processing. 38 Such mechanisms make neural networks focus on important feature representations.
Vaswani et al. 23 adopts queries, keys and values that can be described by three matrices Q; K; V. The final attention is calculated as follows where d k is a scaling factor. In our method, we adopt a multi-head attention framework to learn the dynamic observation features and the relationship between different agents' observations.

QMIX
To solve the centralized training and decentralized execution paradigm setting of the multiagent problem, QMIX 12 proposed a method that learns a joint action-value function Q tot . The approach adapts a mixing network to decompose the joint Q tot into each agent's independent Q i . Q tot can be computed as follows Q tot ðs; u; q; Þ ¼ g ðs; Q 1 ðt 1 ; m 1 ; q 1 ; ::: In the above equation, is a parameter of the mixing network.

MARL transfer framework
In this study, we propose a multi-agent transfer framework based on policy distillation and state representation. This framework consists of two main components: a state representation that allows policies to transfer across different multi-agent tasks and a policy distillation approach that reduces the number of parameters in the model to reduce the transfer cost. The entire transfer process is illustrated in Figure 1.

Dynamic state representation network
In our POSG setting, each agent's total observation consists of the environment's state information, the agent's own state information and other agents' partial observations. In the scenario of the RTS game, the game agent's state information can be divided into three different parts: the agent's own observations, the allies' information and the enemies' information. 39 From another perspective, these observational states can be divided into dimensional dynamic observations and dimensional static observations. The dimensionality of dimensional dynamic observations increases linearly with the number of agents. To facilitate knowledge transfer and model reloading between multiagent tasks with different numbers of agents, we use multi-head attention networks to help us align and represent dynamic observations in a static observation space. Precisely, we proposed DSRNet to pre-process the dynamic observation space.
Observation classification. SMAC provides a range of different micromanagement control scenarios for cooperative MARL research. 11 A certain SMAC scenario has several homogeneous or heterogeneous types of allies and enemies. Agents can only receive partial observations within their range of view at every time step. The range can be described as a circular area around every unit with a radius equal to the observable range, as shown in Figure 2. In this range, agents can observe the following attributes for all alive units: shield, health, relative x, relative y, and distance.
We can classify observed features into two categories based on whether the dimension of the feature changes: static features and dynamic features. Each agent i with its partial observation at step t (5) can be represented as (6)   , which changes in different tasks, is processed by the attention mechanism. The right part of DSRNet implements multi-head attention to compress and represent the dynamic observation information. More precisely, we learn a representation r i;j t for agent i's observation o i;j t , where j consists of scaled dot-product attention neural network layers and a concatenation layer that uses an aggregation operator to obtain the output of the attention network. The final output of the multi-head attention network can be formulated as (9) Next, the outputs of the above two parts of DSRNet are concatenated to the downstream of the following NN layers. The final output is the Q-values generated by the QMIX algorithm.
By adapting our DSRNet, the small-scale learned models can easily be reloaded as an initialization model for large-scale tasks. This training program can greatly increase the learning rate and improve the final strategy effect.

Gradient-based policy distillation
We propose a policy distillation approach that performs well in a multi-agent environment. In MARL, a policy is a rule of action described by a model that maximizes the reward for the agents. Thus, the distillation of the action probability distribution pða T Þ of the teacher network T into the action probability distribution pða T Þ of the student network S can likewise be considered as a policy distillation. In this article, we designed a gradient-based policy distillation as in knowledge distillation to efficiently distill pða T Þ. The specific distillation method is shown in the following equations gradient ¼ X sof tmaxðpða T ÞÞ Ã pða S Þ À sof tmaxðpða T ÞÞ The student model uses the probability distribution pða T Þ Ã as the final action policy, where lr is the learning rate.

Experiments
In this section, we test the performance of our transfer framework based on two sets of video game experiments, Figure 3. The DSRNet structure. The left part is the network architecture of the multi-headed attention mechanism, while the right part is the overall representational network for agent observations. one based on the extended SMAC 40 environment and the other based on MAgent. 13 Moreover, we also test our transfer framework on UAVs planning simulation. 41

The StarCraft multi-agent challenge
The StarCraft multi-agent challenge (SMAC) 40 is based on the popular RTS game StarCraft 2 and focuses on micromanagement challenges, where an independent agent controls each unit that must act based on local observations. It is a popular benchmark for fully cooperative multi-agent tasks. SMAC provides many battle scenarios.
Our experiment is based on open-source library PyTorch, SMAC, and Pysc2. We performed our experiments on a single server with a Linux system. SMAC provides different battle scenarios and difficulty options. However, to simplify the transfer process and increase the richness of the experiment, the original setting of the map has been suitably modified. We expanded the number of agents based on the original map "3m," with three marines on both teams, and limited the active attack options for our ally units. In the transfer process, agents learn on a mission with 4 marines versus 4 marines, which we name 4m. Then, agents progressively learn on an 8 marines versus 8 marines mission (8m) and 16 marines versus 16 marines mission (16m), as shown in Figure 4. The red units are allies in these maps, while the blue units are enemy units. We train only the red units, and the blue units are set to the very hard level of the game's built-in AI. These settings and scenarios make it more difficult for us to win without losing the fairness of the game.
To ensure the validity of the experiment, the parameters were fixed throughout the experiment. These parameters include observation space, action space, game mechanics, environmental parameters, and game difficulty. We also strictly ensure that the model execution is distributed, that is, the agent can obtain data from only its own valid observation range when executing the strategy. The adjustment of the hyperparameter settings greatly affects the final result of QMIX, 42 so we strictly use the same hyperparameter settings as in the original SMAC experiment, as shown in Table 1.
The primary evaluation metric of a single task is the average win rate and reward, which varies with the number of steps agents run. The whole evaluation process can be tested periodically during the training process. Therefore, the transfer performance can be compared easily via the average win rate or reward. In the experiment, our curves are the average of five independent runs.
We evaluate the effects of our solution for largescale MARL tasks based on the proposed framework. These experimental results are based on one main source task (4m) and two target maps (8m, 16m) as shown in Figure 4. The basic algorithm used here for training multiple agents is QMIX. The final results of the experimental tests are described in detail in the following Figure 5.
We show very strong results in Figure 5. The left part (a) of this figure demonstrates the average win rate and reward running on the 8m mission and the right part (b) shows these for 16m. The performance of the policy model with the transfer is better than that of the original QMIX algorithm. To be precise, the effect of the transfer can be understood in three ways. First, there is a clear initial performance improvement at the beginning of the training, which is particularly evident in the rewards curve. Second, there is an increase in asymptotic performance in the later stages of training, that is, an increase in the final win rate. Third, if an 80% win rate is taken as an acceptable threshold, the transferred algorithm can reach it much faster, especially for the 16m mission, where the transferred algorithm can reach it in only 100,000 time steps, compared to 500,000 time steps for learning from scratch, saving 80% of the training steps.  Our experiments focus on choosing Battle scenarios. In this scenario, each agent has the following action options within the local observation range: move to another position or attack the opponent. Here, we assign a À0.001 bonus for each move, a 0.1 bonus for attacking an enemy, and a À0.1 bonus for wiping out all agents in the local experience of 10 if they are killed. We consider missions with different numbers of agents (30 vs. 30, 40 vs. 40, 50 vs. 50).
As shown in Figure 6, red and blue indicate two different pairs of troops. They receive a bonus for killing their opponents and a penalty for being killed. Agents must cooperate with their teammates to destroy all enemy agents.
We validate the effect of transfer with DSRNet and nontransfer by two different independent RL algorithms (A3C and Double DQN). Table 2 shows the average number of remaining teammates and killed opponents in a 50 versus 50 target task. The performance of Double DQN and A3C trained from scratch can be significantly improved with the transfer framework; that is, more teammates remain, and a higher average number of enemies are killed.

UAV trajectory planning simulation
In this part, we test our transfer framework in a UAV simulated environment based on ML-Agents Toolkit. 41 As shown in Figure 7, this platform consists of static and dynamic obstacles. The main task of cooperative control for UAVs is to get through the pathway with collision avoidance ability.
We assume the UAVs are shuttling in a 3D space, and the terminal time is set to T ¼ 10 s. The initial position distribution of multi-UAVs is formed by a Gaussian distribution with a variance of 0.6 centering at ðÀ50; À50; À50; 0; 0; 0Þ. We set one type of dynamic obstacle that is represented by blue balls that have a random radius below 1 m. The position coordinates of dynamic obstacles change with time as ðÀ50 þ 100t=T ; À50 þ 100t=T ; 50 À ðÀ50 þ 100t=T Þ 2 =50Þ. Moreover, we also generate static obstacles like a white cube with channels, red stoppers or green cylinders.
In this experimental setting, we used the QMIX algorithm to model the speed control and evasion strategies of the UAVs. Moreover, we compare the performance of this method on the success rate of collision avoidance. Collision avoidance includes two parts: one is the collision avoidance between UAVs and obstacles, and the other one is the collision avoidance among UAVs agent. Figure 8 shows the success probability of collision avoidance. We select different numbers of UAVs ranging from 2 to 32, and the final probability is averaged by fivetime runs. From Figure 8(a), it can be seen the decreasing trend of success probability of collision avoidance between UAVs. However, our transfer framework obviously slows down the decreased speed. Moreover, in Figure 8(b), UAVs can avoid dynamic and static obstacles well. Our transfer framework combined by QMIX and DSRNet can also help improve the success probability, especially when the number of UAVs is increasing (up to 32).

Potential robotic applications
Robot collaboration is an important challenge in robotics. Robot swarming integrates simple, inexpensive and    modular multiple units into working groups based on tasks, which can perform large and complex tasks. 43 The use of machine learning, especially MARL, is a promising direction to help inter-agent communication and coordinated collaboration. For example, with the MARL transfer framework proposed in this work, the knowledge learned by a few robots can be extended to a large number of robots for cooperative tasks, such as industrial cooperative control and physical UAV swarm control. 44 Furthermore, this article can be based on partially observable centralized training distributed execution or independent training, which is well adapted to the deployment of agents and resource allocation.

Conclusion
This article proposes a new scheme for training large-scale multi-agent cooperation tasks via TL. This scheme is achieved by means of a dynamic state representation net (DSRNet) for a variable number of agents in the partial observation setting. Additionally, the DSRNet can work with the CTDE paradigm algorithm QMIX and the independent RL algorithms A3C and Double DQN. We have conducted extensive comparison experiments on the wellknown multi-agent testing platforms SMAC and MAgent. We also applied our method to UAVs' trajectory planning simulation experiments. When implementing the scheme, both the training time and final results show impressive improvement. The results provide novel insights into the problem of training large-scale multi-agent tasks.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.