Dynamic deployment of multi-UAV base stations with deep reinforcement learning

Unmanned aerial vehicles (UAVs) can be utilized as aerial base stations (BSs) to provide auxiliary communication services. In this letter, we propose a deep reinforcement learning (DRL)-based dynamic deploy- ment method for multi-UAV communications. The phasic policy gradient (PPG) is designed to improve the sample efﬁciency and the attention of the multi-UAV deployment. Simulation results are provided to verify the effectiveness of the proposed method.

Introduction: Unmanned aerial vehicles (UAVs) equipped with miniaturized communication devices can be served as aerial base stations (BSs) to provide auxiliary communication services for the ground users (GUs), which have become a promising way to meet the communication demands in the 6G and beyond [1][2].
Since the transmit power of UAV-BS is limited, the reasonable deployment locations of UAV-BSs are essential to improve the spectrum efficiency. With the rapid development of artificial intelligence (AI) [3][4], deep reinforcement learning (DRL) provides an effective way to solve the dynamic deployment and cooperation problem for multi-UAV BSs communications [5].
In this letter, we propose a universal deep reinforcement learning (DRL)-based dynamic deployment method for UAV-BSs. The movement process of the UAV-BSs is modelled as Markov decision process (MDP), and a dedicated reward function is designed to cover the moving GUs. In the case of a large number of GUs, we further propose a phasic policy gradient (PPG) approach with an auxiliary task, which incorporates UAV information preprocessing to improve the sample efficiency and the attention of UAV information. Finally, simulation results are provided to verify the effectiveness of the proposed method.
System model: We consider a scenario where M UAV-BSs serve N GUs. We denote (x i ,y i ,z i ), i ∈ {1, 2, …, M}, (x j ,y j ,0), j ∈ {1, 2, …, N} the location of UAV i and GU j, respectively. All the UAVs adapt to the GUs' movements in the target area and track them to ensure that the GUs are within the coverage. We divide T into t time slots, and T is regarded as a time interval to update the GUs' locations, t ∈ {1, 2, …, T}. The locations of UAV i and GU j at time slot t can be re-expressed as u UAV i (t )and u GU j (t ). Besides, each UAV is equipped with an ultra-largescale multiple-input multiple-output (MIMO) antenna array to form a high-precision directivity beam to reduce the interference.
The air to ground (ATG) channel is composed of a line of sight (LoS) and non-line of sight (NLoS) links, and the small-scale multipath fading is neglected for simplicity. Furthermore, according to a large number of experimental results, we can obtain the statistically significant probability of LoS links of the UAV as: where φ i,j is the elevation angle of the communication link between the UAV i and GU j, a and b are the environment-related parameters. At the same time, the probability of NLoS links can be denoted as P NLoS The path loss of LoS and NLoS links can be respectively defined as: where f c is the carrier frequency, d i, j = u UAV i − u GU j is the distance from the UAV i to the GU j, ζ LoS and ζ NLoS are additional path loss, and c is the speed of light.
Therefore, the pathloss model of ATG channel can be expressed as: UAVs are required to cover as many moving GUs as possible in the shortest possible time, so we can get the following optimization objectives as: where T i,j is the connection relationship between UAV i and GU j. Moreover,T i,j = 1 if the GU j is within the coverage of UAV i, and otherwise T i,j = 0. R is the rated coverage radius. (5b) and (5c) denote that each GU can only form a connection with at most one UAV. (5d) and (5e) are the limits of UAVs' flight range which make them work in the target area. l t i denotes the distance that UAV i fly in the time slot t, (5f) and (5g) represent the constraints on UAVs' flight trajectory. a max is the acceleration limit, which is used to save energy and for flight safety reasons. ϑ max denotes the maximum steer angle, which can prevent the azimuth angle from excessive changes and make the flight trajectory smoother.
DRL can adopt the offline trained policy model for deployment, the action a t to be performed is directly mapped from the current state s t . We define Q πθ (s, a) = E [R t |s = s t , a = a t ]as the expected return with policy π θ , R t = T −1 t=0 γ t r t . θ are the parameters of actor-network, and γ is the discount factor. Our goal is to find the robust and general policy π *to maximize the cumulative reward in the period T: where S(π θ (s t )) is the entropy of policy π θ at the state s t , ψ is the tradeoff coefficient. In the model training phase, we utilize stochastic policy π θ (s t ) to select the next action a t that to be performed for the state s t , and the entropy offers the possibility of more experience for the policy π θ (s t )so that the critic can more accurately evaluate the state values Vϕ(s t ), ϕ are the parameters of the critic network. The actual state value is defined as: The MDP is modelled by a tuple (S, A, P, R, γ ), where S is the state space, s t ∈ S, A is the action space, a t ∈ A, P is the transition probability, and R is the reward space. Since the experience samples can not be reused, the traditional policy gradient is an inefficient method while proximal policy optimization (PPO) is a possible alternative with stable performance. We define the objective function of our actor like the PPO way whereÂ t is the advantage function estimated by critic that used to measure the superiority of the action a t taken by the current policy π θ at state s t . To evaluate it more accurately, we apply the generalized advantage estimator (GAE) method to control the bias and variance of the estimation in a relatively balanced manner, which is given by: where λ ∈ (0, 1), which indicates the different emphasis on bias and variance. To evaluate the advantage function of critic more accurately, we define its loss function as: whereV tar is the value target also calculated by the GAE method.
To better share the extracted feature parameters between actor and critic networks and to avoid the interference caused by sharing the underlying network, we use the separated actor and critic networks along with an auxiliary task to induce actor to learn from critic, i.e. in addition to the output policy π θ (s t ), an additional value head V θ (s t ) is set to predict the state value in the actor-network. After a period of learning in the normal way with (8) and (11), we further let the value head V θ (s t ) learn the state value V(s t ) using the past experience samples: where KL(π θold ||π θ ) is the KL divergence of π θold and π θ , and π θold is the current policy after the first phase (learning in the normal way). In this way, the actor can learn the state value while improving its policy and better distil state features. The KL divergence ensures that the new and old policies not to deviate too much. Then, the above two phases are then repeated periodically. We set s t = (u UAV i , u GU j , n t , d dis )as the state vector of UAVs, and a t = (ϑ t i , l t i )as the joint action of UAVs. n t denotes as the covered GUs at time slot t, and variable d dis = d t i , i ∈ {1, 2, . . . , M} indicates the distance between each UAV and the centre of GUs area, which is given by (13): Then, the reward function is defined as: where n t = n t − n t−1 is the changes of covered GUs, andexp( λgnt N ) is the function of coverage rate that its exponential form provides an additional gradient to the reward. κ t is the number of coordinates that the UAVs cross the boundary, and d sum = M i=1 d t i is the sum of the distance between each UAV and GUs' central location. Moreover, ξ num , ξ g , ξ dis and ξ dis are coefficients that control the size of each reward items.
In actual, we expand the proportion of UAV position information in the state vector, i.e. we use a feature extraction layer to preprocess the UAV position information and then combine other dimensional information into the actor and critic networks. This allows the neural network to better extract the features of the UAVs' motion.
Simulations and results: In this section, simulation results are provided to verify the performance of the proposed method. Additionally, particle swarm optimization (PSO) and static deployment (SD) algorithms are compared to verify the effectiveness and superiority of the proposed method. We use the PSO algorithm to find the optimal deployment location for each episode to maximize coverage, while SD refers that the We consider a 10 km × 10 km area in the urban environment, where M = 3, N = 30, and the rated coverage radius R is set as 1 km. Note that, this coverage radius does not cover all the GUs under certain GU distributions, so that UAVs need to do their best to reach as many users as possible. The GU cell with poor quality of service (QoS) is assumed to be a square region of 3 km × 3 km in which all GUs appear a new distribution randomly every episode, and the GU cell also appears stochastically. We want to train a universal and robust policy model that enables UAVs to find the best trajectory to cover as many GUs as possible according to the arbitrary distribution of GUs within the cell.
Over the 2000 training episodes, we show how the cumulative rewards change for each episode in Figure. 1. The test coverage rate of DRL, PSO and SD methods are shown in Figure. 2, which are 94.2%, 92.4% and 66.2% respectively. Through the above experimental results, it is easy to find that PPO can steadily improve the policy but is limited by the lack of efficiency in learning the state features, which leads to more difficult feature extraction in the early learning, and thus the cumulative rewards are raised slowly. PPG further utilizes the existing data through the learning of V θ (s t ) in an off-policy manner, which allowing actor and critic to share the objective. This enables feature sharing between the two networks and accelerates feature learning.

Conclusions:
In this letter, we proposed a DRL-based dynamic deployment approach for UAVs to provide services for the moving GUs in the poor QoS area. The PPG approach combined with auxiliary task learning enables better extraction of state features, and the entropy bonus of the actor allows the decision agent to maintain a more random strategy during training to increase exploration. Finally, the effectiveness of the proposed algorithm is verified by simulation experiments.