Multi-UAV Path Learning for Age and Power Optimization in IoT with UAV Battery Recharge

In many emerging Internet of Things (IoT) applications, the freshness of the is an important design criterion. Age of Information (AoI) quantifies the freshness of the received information or status update. This work considers a setup of deployed IoT devices in an IoT network; multiple unmanned aerial vehicles (UAVs) serve as mobile relay nodes between the sensors and the base station. We formulate an optimization problem to jointly plan the UAVs' trajectory, while minimizing the AoI of the received messages and the devices' energy consumption. The solution accounts for the UAVs' battery lifetime and flight time to recharging depots to ensure the UAVs' green operation. The complex optimization problem is efficiently solved using a deep reinforcement learning algorithm. In particular, we propose a deep Q-network, which works as a function approximation to estimate the state-action value function. The proposed scheme is quick to converge and results in a lower ergodic age and ergodic energy consumption when compared with benchmark algorithms such as greedy algorithm (GA), nearest neighbour (NN), and random-walk (RW).


I. INTRODUCTION
The Internet of Things (IoT) era is allowing the implementation of new time-sensitive applications through the deployment of sensor nodes to collect information in real-time. Use cases include intelligent transportation, environmental monitoring, and human safety. To address time sensitivity in such applications, a metric termed as Age of Information (AoI) was introduced in [1] to quantify the degree of freshness of the information about a certain process. It is defined as the time elapsed since the generation of the packet that was most recently delivered to the destination node. The application of unmanned aerial vehicles (UAVs) as mobile relay units has been proved to be very efficient in solving the problem of minimizing the AoI while maintaining energy limitations [2]. The UAV relays can reduce the transmission distance of IoT nodes by moving close to the source nodes and then relaying the transmitted information to the destination node [3]. This facilitates communication and saves energy in remote areas, where it is cumbersome to replace the batteries of the sensor nodes.
Recently, learning schemes such as deep reinforcement learning (DRL) have been extensively applied in solving the This work has been partially supported by Academy of Finland 6G Flagship program (Grant no. 346208), FIREMAN (Grant no. 326301), and the European Commission through the Horizon Europe project Hexa-X (Grant Agreement no. 101015956).
The authors are with Centre for Wireless Communications (CWC), University of Oulu, Finland. Email: firstname.lastname@oulu.fi. problem of jointly minimizing the AoI and energy consumption in IoT. However, the suitability of a DRL algorithm is strongly conditioned on the dimension of action and state spaces, which turns out to be a curse in massive scenarios [4]. This issue can be handled by deploying multiple UAVs to collect information along with device clustering to reduce the state-action spaces.
Several works have considered the use of UAV for AoI minimization. For instance, the authors in [5] jointly optimized the scheduling policy and flight trajectory of the UAV to minimize the weighted sum AoI. The work in [6] proposed a DRL model to minimize the freshness of information in a single-hop vehicular network. In [7], the authors presented a multi-agent DRL solution to coordinate between the UAVs to efficiently perform wireless energy transfer (WET) and wireless information transfer (WIT). To minimize the AoI in massive deployment up to fifty devices, the work in [8] presented a model-free DRL solution, whereas the authors in [9] formulated the problem as a mixed-integer program and a convex-optimization-based solution.
To this end, the contributions of this paper are summarized as follows: • We propose a DRL solution to jointly minimize the AoI and the devices energy consumption in a massive deployment of up to hundred IoT devices. • Our model accounts for UAVs battery constraints and flying time to recharging depots. • We apply k-means to perform device clustering, while accounting for the UAVs scheduling capacity. • Our approach outperforms the baseline RW, greedy and NN models in terms of age and IoT energy consumption.

A. System Model
We consider a 2D grid world of a set K = {1, 2, · · · , K} of K low-power IoT devices. Each device is randomly distributed in the grid world and is given a coordinate c k = (x k , y k ) after being projected to the 2D plane as in [10], [11]. The IoT devices are served by a set U = {1, 2, · · · , U } of U rotary-wing UAVs. Each UAV flies over the grid world to collect information from the devices and relay the collected information to the BS located at the center of the grid world (i.e, at (0, 0)). The grid world has fixed charging depots D located at the four corners.
Each UAV starts and ends its trajectory at one of the charginf depots. The grid world is divided into square cells,  where the movement of each UAV occurs in four directions (i.e, east, west, north, south) or preserving its location by not moving at all (hovering). Time slots are discretely divided as [τ , 2 τ , ...], where τ is the time that the UAV needs to move from the center of one cell to the center of an adjacent cell. The time unit τ is determined by calculating the ratio between the distance between the centers of two adjacent cells d g and the velocity of the UAV υ t . The system model is illustrated in Fig. 1.

1) Energy Consumption:
Consider that the scheduling policy of the IoT devices S(t) ∈ S = {0, 1, ..., K}, where S(t) = (k 1 , k 2 , ...) means that the nodes k 1 , k 2 , ... are scheduled to transmit at time slot t. Each UAV forwards the received packet to the BS. We assume the presence of LOS communication between the sensors and UAVs, and between the UAVs and BS, therefore, the channel gain between UAV u and the BS at time slot t is given by where g 0 is the channel gain at the reference distance of 1 m, d u,BS is the distance between the UAV and the BS, h u is the altitude of the UAV, h BS represents the height of the antennas at the BS, and c u (t) is the position of UAV u at time instant t [10]. P k is the transmission power of an IoT device k and it is calculated as follows where M is the packet size of the sensor updates, B defines the signal bandwidth, σ 2 the noise power, and d u,k is the distance between UAV u and IoT device k [11]. We discretize the battery capacity of each UAV E max,u into energy quanta N u , where the amount of energy in each energy quantum is given by the ratio E max,u /N u . Denote the battery level of UAV u at time slot t as e u (t) ∈ E u = {0, 1, ..., e u,max }. The battery of the UAV is affected by the energy consumed to relay an update packet to the BS e R u (t) and the energy consumed due to flying or hovering e F u (υ t ). The battery evolution of the UAVs can be described as where is ceiling approximation. The energy consumed to relay an update packet to the BS is given by with whereas the energy consumed due to flying or hovering is given by where P u (υ t ) is the power consumption of the UAVs when moving or hovering and is formulated in [12] as where P 0 and P 1 represent the blade profile power and derived power when the UAVs are hovering, respectively, υ t describes the velocity of the UAVs and S tip depicts the tip speed of the blade. Meanwhile, s 0 is the mean rotor induced velocity when hovering, d 0 represents the fuselage drag radio, ρ is the air density, µ 0 represents the rotor solidity and Z the area of the rotor disk.
2) AoI Calculation: We formulate the discrete AoI as the time elapsed since the last time a device transmitted a packet. The AoI is used as a degree of fairness in scheduling the devices. If a device transmits an update packet, its AoI is reset to one. The AoI of device k is given by where A max denotes the maximum allowed AoI in the model.

C. Problem Formulation
The main objective of the UAVs is to jointly minimize the weighted average AoI and the transmission power of the IoT devices. Hence, We the optimization problem is formulated as follows P1 : min where δ k is the importance weight that denotes the importance of device k and c d,u are the coordinates of the charging depot where UAV u is going to take off. Here, λ is a multiplicative variable that controls the trade-off between the AoI and the transmission power. The larger the value of λ the more the objective function cares about the power over the AoI. If λ = 0, the model learns to produce the best AoI without taking the transmission power into account. The constraints of the given optimization problem assure that the UAVs still have enough energy to move and serve the devices and forcing the initial and final positions of each UAV to be at one of the charging depots.
The optimization problem (9) is a non-linear integer programming optimization problem whose complexity grows with the number of deployed devices. In addition, the UAV experiences a large dimension of state space, which is almost a continuous state space. To overcome the dimensionality curse, we propose a DRL with a deep Q-network (DQN) approach, which works as a function approximation to estimate the Qfunction and solve the given problem efficiently and feasibly.

A. Clustering and Rate-Mobility Characterization
Consider that each device k is assigned to a cluster l ∈ L, where L = {1, 2, . . . , L} is a set of clusters of length n. We call n l the number of devices on cluster l. A UAV will try to communicate with all devices within a cluster l based on a given policy. For this, before starting moving from one grid position to another, the UAV will send an uplink grant to all devices in the specified cluster. Thus, devices should be able to transmit their updates before the UAV arrives at the next position. Hence, the relation between the number of devices on a cluster n l and the fixed transmission rate R b l of devices cluster l is given as n l ≤ Note that this number is directly related to the average rate, and the speed of the UAV. The BS performs the clustering using k-means according to the positions of the devices and by setting the calculated maximum number of devices in a cluster [13]. The scheduling policy can be redefined as S(t) ∈ S = {0, 1, ..., L}, where S(t) = l means that the nodes in cluster l are scheduled to transmit at time slot t.

B. Markov Decision Processes Formulation
We formulate the problem as a Markov Decision Process (MDP) that is composed of the tuple s, a, r, p , where s is the state, a presents the action, r denotes the reward function, and p describes the state transition probability. Hence, at time instant t, the agent (UAV) observes the current state s(t) from the environment and tries to follow the optimal policy by selecting the best action a(t), which maximizes the reward r(t) and transiting to the next state s(t + 1) with a probability p(s(t), s(t + 1)). For convenience, we propose an episodic MDP, where an episode starts with each UAV at one of the charging depots and ends when at least one UAV needs to recharge its battery at the nearest charging depot.
1) State space: The state space of the system at time slot t is defined as s(t) = (c(t), A(t), β(t)) where c(t) is a vector containing the position of each UAV c u (t) ∈ C at time slot t. A(t) = (A 1 (t), A 2 (t), ..., A L (t)) contains the average AoI of the IoT devices in each cluster, where A l (t) ∈ I = [1, 2, ..., A max ]. β(t) = (β 1 (t), β 2 (t), ..., β U (t)) with β U (t) ∈ B, is a vector that contains the difference between the battery status of each UAV and both the required energy to arrive to the nearest charging depot d ∈ D and the energy consumed by packet relays considering the worst case when the UAVs relay packets in every time slot t. Finally, the state space of the system is given by Σ = C U × I K × B U .
2) Action space: The action space at time slot t is defined as a(t) = (F u (t), S u (t), where f u (t) is the movement of UAV u and S u (t) is the scheduling policy of UAV u. Each UAV u selects a cluster l to serve all the devices within this particular cluster. The action space is given by 3) Transition probability: The transition between states relies on the 3 components of the state space. The AoI is updated according to (8), the β is updated according to the energy calculations discussed in II-B1. The position of each UAV c u is updated according to the selected action f u (t), where Hovering. (11)

4) Reward function:
The reward system is defined to minimize the weighted sum of the age of information as well as the average transmit power for all IoT devices. We define the immediate reward r u for the u UAV at time instant t as which is the DRL version of the objective function in (9a).

C. DQN solution
The state-action value function (Q-function) Q π (s, a) describes how good an action a is at state s while following the policy π [14]. It can be updated each time instant as follows Q (s (t) , a (t)) = Q (s (t) , a (t)) + α r (t) + γ max a Q (s (t + 1) , a) − Q (s (t) , a (t)) , (13) where α is the learning rate, r(t) is the immediate reward, γ Q (s (t + 1) , a (t + 1)) is the discounted state-action value at time instant t + 1, and γ is the discount factor.
The DQNs consist of two neural networks, where the first network (current network) works as a Q-function estimator, whereas the other (target network) works as a target Qfunction network [4]. This approach solves the problem of large dimensionality in complex models. Moreover, the model defines the exploration rate , which decays with time. To break the correlation between samples and utilize past samples, the DQN introduces experience replay, where it stores the past Algorithm 1: The proposed DRL algorithm 1 Define parameters from table I. 2 Calculate n l using (10). 3 The number of clusters L = K n l . 4 Apply k-means to perform clustering. 5 Initialize the replay buffer and t = 1. 6 Define , γ, α, O, and the number of episodes E. 7 Choose a value for λ in (12). 8 for e = 1,...,E do 9 while No recharging needed (i.e. β 1 (t) > 0), do 10 Explore a random action a with probability or select optimal action a = max a Q(s(t), a) with probability 1 − .

11
Save s(t), a(t), r(t), p(t) in the replay buffer. 12 Sample a mini-batch from the buffer. 13 Update the current network.
14 Update the target network every O instants. 15 t = t + 1.
16 end 17 end experiences s(t), a(t), r(t), s(t + 1) in a buffer and samples a small batch randomly for training. Algorithm 1 summarizes the proposed DRL framework and Fig. 2 illustrates the DQN architecture and interaction with the environment.

IV. NUMERICAL RESULTS
In this section, we discuss the simulation results of the proposed DRL algorithm and compare them to various baseline models such as the GA, NN, RW. The GA tends to minimize the age only by scheduling and moving towards clusters with the highest age. This almost corresponds to the case when λ = 0, and the UAV applies time division multiple access (TDMA) to distribute resources fairly. The NN always schedules the nearest cluster in order to minimize the transmit power. We consider a grid world of 1100 m × 1100 m, which is divided into 11 × 11 grids. The simulation parameters are defined in Table I.
We build a DQN of five hidden layers (64,128,256,128,128 neurons) with α = 0.0001, Adam optimizer, replay buffer of size 100000, γ = 0.99, and 100000 trained episodes using Pytorch framework on NIVIDIA Tesla V100 GPU. The proposed DQN model has spatial complexity illustrated in terms of the number of parameters (weights and biases) of 344, 290 parameters, which need around 30M B of memory. In terms of the computational complexity, the model performs 170, 816 multiplications and additions. The time complexity to execute one episode using the proposed algorithm is 0.0918 s compared to the 0.0665 s of the RW. Throughout this section, the term "ergodic" refers to time and statistical average. Figure 3 presents an example trajectory path of two UAVs for a trained episode. We can notice that with the NN in Fig. 3a, the UAVs move randomly and schedule the nearest devices. In Fig.3b, the GA chooses the devices with the highest age careless of the large path losses. Fig. 3c shows the trained DRL scheme. Since more devices are located in the right upper section of the map, both UAVs tend to fly over the cluster centroids close to this region, which indicates the learning behaviour. Moreover, it is worth concluding that a free flight passing above these centroids could be a low-complexity suboptimal trajectory. Figure 4a depicts the accumulative reward for the DRL and RW schemes for different values of λ. It is not a surprise that higher λ values reflect lower accumulative rewards due to the nature of the reward function in (12). However, we can see that the DRL scheme offers a significant improvement in the reward compared to the RW for all λ values. Looking at figures 4b and 4c, it was also expected that neither the age nor the power consumption are affected by λ for all schemes expect for the DRL. This present an aspect of adaptability for the DRL scheme, where one can choose to prioritize the age or the power consumption, and vice versa using the same algorithm. Thus, it can achieve promising results on the age as the GA scheme, or lower power consumption as the NN scheme. This exchange can be observed in Fig. 5, where we observe the achievable regions of age and power for the DRL scheme for different values of λ values. We can see the DRL scheme as lines, since it benefits from the variation of λ, where the other schemes are just static points. Another important insight is that increasing the number of UAVs as well as decreasing the number of IoT devices improve the values of both age and transmit power in the achievable region.

V. CONCLUSIONS
In this paper, we considered a relatively large IoT network, where multiple UAVs serve as mobile relay nodes with the objective of minimizing the age of information and the energy consumption. The problem was formulated as an optimization problem to plan the trajectory of the UAVs from one charging depot to another such that the ergodic age and energy consumption of the network is minimized. We addressed the problem by proposing a DRL-based solution, where the BS clusters the IoT devices according to their positions and UAV flight time between grids to improve the performance. Our proposed approach outperforms other state-of-the-art solutions  such as GA, NN and RW. In particular, the proposed DRLbased solution provides the best age-energy trade-off in a wide range of scenarios involving different numbers of UAVs and IoT nodes. Another contribution of this work is the simplicity of the proposed solutions, which addresses the problem of high dimensionality in the action space, thus enabling its application in a massive IoT deployment scenario with the number of IoT devices in the hundreds as a future extension.