Autonomous 3-D UAV Localization Using Cellular Networks: Deep Supervised Learning versus Reinforcement Learning Approaches

Unmanned aerial vehicles (UAVs) are becoming an integral part of numerous commercial and military applications. In many of these applications, the UAV is required to self-navigate in highly dynamic urban environments. This means that the UAV must have the ability to determine its location in an autonomous and real time manner. Existing localization techniques rely mainly on the Global Positioning System (GPS) and do not provide a reliable real time localization solution, particularly in dense urban environments. Our objective is to propose an effective alternative solution to enable the UAV to autonomously determine its location independent of the GPS and without message exchanges. We therefore propose utilizing the existing 5G cellular infrastructure to enable the UAV to determine its 3-D location without the need to interact with the cellular network. We formulate the UAV localization problem to minimize the error of the RSSI measurements from the surrounding cellular base stations. While exact optimization techniques can be applied to accurately solve such a problem, they cannot provide the real time calculation that is needed in such dynamic applications. Machine learning based techniques are strong candidates to provide an attractive alternative to provide a near-optimal localization solution with the needed practical real-time calculation. Accordingly, we propose two machine learning-based approaches, namely, deep neural network and reinforcement learning based approaches, to solve the formulated UAV localization problem in real time. We then provide a detailed comparative analysis for each of the proposed localization techniques along with a comparison with the optimization-based techniques as well as other techniques from the literature.


I. INTRODUCTION
Unmanned aerial vehicles (UAVs) are increasingly demonstrating their potential for use in various military and civilian applications. UAVs have the advantages of a generally small size, convenient use and strong maneuverability which makes them suitable for such applications. The applications of UAVs in the military field include intelligence reconnaissance, emergency response, and geographic survey, while the civilian field applications include package delivery, aerial photography, infrastructure inspection, electric power inspection, agricultural plant protection, search and rescue and environmental monitoring [1]. However, there are several challenges for the UAV to deliver on the forementioned applications. One of the most important enabling requirements for the UAV to deliver on such applications is the ability to determine its location at any given point in time. The localization technique used currently by most commercial UAVs is a combination of the GPS and the Inertial Measurement Unit (IMU). However, in dense urban environments and indoor environments, the global navigation satellite systems (GNSS) signals do not provide an accurate and reliable localization solution due to reflections by high-rise structures and line of-sight (LOS) blockage. Also, the dependence on detectable transmissible signals may comprise the success of certain missions. Hence, it is necessary to design an alternative high-precision positioning method that is not based on GPS or other detectable signals.
As an alternative to the global positioning system (GPS), the radio received signal strength index (RSSI) measurements from the surrounding cellular infrastructure can be used for localization purposes due to its simplicity and cost-effectiveness. Since the cellular infrastructure is widely deployed in urban areas worldwide, the reliance on cellular signals can offer an attractive alternative to GPS for UAV localization applications. We focus our attention on the 5G technology in this paper since it is expected to be widely deployed around the world. According to the 3rd Generation Partnership Project (3GPP), the 5G technology offers attractive features such as dense small cell deployments, millimeter wave (mmWave) communications, and deviceto-device (D2D) communications [28]. The 5G technology addresses the current cellular network challenges, including ultra-low latency and higher reliability and capacity requirements, by optimizing the network operations to guarantee, in real-time, the QoS needs of emerging wireless and IoT services and consequently is expected to be widely deployed [29].
Cellular-connected UAV communication possesses substantially different characteristics that pose new technical challenges as opposed to cellular communication with terrestrial mobile devices. Such challenges include dominance of line-of-sight (LoS) interference and reduced ground base stations (GBSs) antenna gain [2]. UAVs flying at high altitudes are served from the side-lobes of the base station (BS) antennas [3]. Several studies suggest that cellular networks with communication techniques toward 5G such as higher antenna gains due to beamforming can potentially be utilized to provide improved connectivity for UAVs [2] [4]. As such, several recent efforts and open work items are directed towards the integration of the UAVs with 5G cellular networks. Our objective is the autonomous localization of the UAV without the need to actually interact with the cellular network. Accordingly, we focus on the broadcast signals transmitted by the 5G cellular networks which can be detected by any cellular-enabled device. Specifically, the 5G cellular network periodically broadcasts the Synchronization Signal Block (SSB), which is composed of 4 subblocks including the Primary Synchronization Sequence (PSS), the Secondary Synchronization Sequence (SSS), the Physical Broadcast Channel (PBCH) and the Demodulation Reference Signal (DMRS), mainly for synchronization, cell search and initial beamforming [13]. The UAV can detect and measure the Secondary Synchronization Reference Signal Received Power (SS-RSRP) from the surrounding base stations for selflocalization purposes.
The UAV localization problem formulation based on cellular measurements is non-trivial and difficult to solve. While exhaustive optimization techniques can be applied to accurately solve for such a problem, a real-time calculation is needed in such dynamic environments. Machine Learning (ML) based techniques could potentially provide an attractive alternative to provide a near-optimal localization solution that provides a practical real-time calculation that is needed in such dynamic applications. Deep Learning (DL) is a branch of supervised machine learning involving neural networks with several layers capable of capturing complex non-linear relationships between the inputs and the outputs. Deep learning-based approaches typically fall into one of two categories, namely, regression-based models and the fingerprint-based techniques which require extensive data collection to build a fingerprint database for the training. Reinforcement Learning (RL) is another branch of machine learning where the agent can learn through direct interaction with the environment. The Reinforcement Learning model is based on the Markov decision process (MDP). Specifically, Reinforcement Learning agents make decisions, observe the results, and then automatically adjust their policies to achieve their objective to select an action that maximizes a reward.

A. Related Work
There are several recent studies that have investigated the use of cellular signals and/or Machine Learning based approaches in the localization of mobile devices or UAVs.
In [5], the authors propose utilizing higher order Voronoi tessellations at the base station to localize mobile devices in an outdoor environment. In the proposed base station ordering localization technique (BoLT) algorithm, the mobile device sends back a list containing the order of neighboring BSs based on the received signal power achieving a localization accuracy of a few meters. In [6], CellinDeep, a deep learning-based localization system for mobile devices in an indoor environment is proposed by creating a footprint map of RSSI measurements at different locations during the training phase. The area is divided into grid cells of 1m 2 and RSSI measurements from 17 cell towers are recorded prior to the training phase at each cell. The localization problem is then presented as a classificationproblem to find the grid cell with the maximum likelihood followed by a fine localizer module achieving a fine grained accuracy of 0.78m. Alternatively in [7], the authors introduce DeepCReg, a convolutional neural network based regressor that leverages cellular data to estimate the location of a mobile device in an outdoor environment. The system achieves median localization accuracy of 2.82m in the 2-D localization problem for mobile devices. In [8], an optimization-based approach utilizing carrier phase measurements is proposed to localize and navigate a UAV in 3-D assuming limited GPS presence. According to the authors, this technique realizes a Root Mean Square Error (RMSE) location error of 0.8m using 7 CDMA BSs and 0.36m using measurements from 9 CDMA BSs. In both cases, the UAV had access to the GPS for 10 seconds then the GPS was cut off. During the time the GPS was available, the cellular signals were used to cluster and characterize the clock deviations.
However, to the best of our knowledge, none of the presented works provide an accurate and effective real-time solution to the 3-D localization problem for the UAV in an outdoor environment. The presented techniques either mostly focus on the localization of the agent in the 2-D space assuming the height of the mobile device is known yielding low accuracy when extended to 3-D or require extensive knowledge of the environment during the training phase to build a fine-grained accurate fingerprint map. Such proposed approaches and data collection processes would be infeasible and unscalable in large 3-D outdoor geographic areas and unknown environments.

B. Paper Contributions and Structure
The objective of this paper is to provide an accurate realtime solution for the 3-D autonomous localization of UAVs in an outdoor environment using existing 5G networks independent of GPS or other detectable mobile signals. This is the first work, to the best of our knowledge, to analyze the effectiveness of various machine learning based approaches to provide a real-time solution to the 3-D UAV selflocalization problem using 5G cellular networks in an outdoor environment with near optimal accuracy. The major contributions of this paper can be summarized as follows • We formulate the UAV localization problem as an optimization problem in which the drone needs to rely on the RSSI measurements of the surrounding 5G base stations without having to actually interact with these base stations. The objective function minimizes the overall mean least square error to determine the UAV location. • We propose a 3D UAV localization algorithm through multi-lateration that is based on 5G RSSI measurements from 4 base stations. We develop an optimization-based approach to determine the optimal bound of the solution to the formulated localization problem. We present Nelder-Mead as a heuristic approach as well as Exhaustive Search based solutions. • We propose a deep supervised learning approach to provide a near-optimal localization solution that provides a real-time calculation that is needed in such dynamic environments. We adopt a black-box mapping approach between the inputs, which are the noisy measured distances based on 5G RSSI readings and the 4 gNBs' coordinates, and the output, which is the estimated UAV coordinates. • We propose a reinforcement learning based approach to provide a practical real-time calculation for the localization problem. • We provide an in-depth comparative analysis between the proposed deep neural network and reinforcement learning based approaches to assess the efficiency and specific use case for each approach to solve the localization problem. • We provide a comparative analysis with a benchmark solution to the UAV localization problem using cellular signals proposed in the literature based on cellular carrier phase measurements and weighted non-linear least squares estimation. • We conduct the complexity analysis for our proposed deep and reinforcement learning approaches to examine the fulfilment of the objective of meeting the real time calculation requirement.
The rest of this paper is organized as follows. In Section II, we introduce our environmental assumptions and present our system model. In Section III, we formulate the 5G based 3-D UAV localization problem. In Section IV, we solve the formulated UAV localization problem through applying the proper optimization techniques to determine the optimal bound of the solution. In Sections V and VI, we propose two machine learning based techniques to provide real-time nearoptimal localization. Specifically, we propose a deep supervised learning approach in Section V as well as a reinforcement learning approach in Section VI. In Section VII, we present and analyze our analytical results as well as present an alternative approach proposed in the literature based on Carrier phase measurements for comparison purposes. Finally, Section VIII concludes this paper.

II. SYSTEM MODEL
The RSSI measurements from at least four cellular base stations are needed to localize the UAV in 3D given that the intersection of 4 spheres (one sphere per gNB) is a point. Accordingly, we consider a suburban outdoor environment with four 5G gNBs located around 100 − 300 apart as shown in Fig. 1. Let the index 1 ≤ ≤ 4 specify a given gNBi and let di represent the Euclidian distance between each gNBi and the UAV. The estimated coordinates of the UAV are denoted by the vector x while the coordinates of the antennas of every gNBi are denoted by the vector gi where x, gi ∈ ℝ n and = 3. Let hi represent the height of the UAV reference to the antenna of gNBi and ri represent the distance between the vertical projection of the UAV in the plane = ℎ and the gNBi.
We adopt the free-space path loss model in accordance with [9] and assume standard practical antenna configurations in accordance with [3], [4]. The measured received power from each base station, given a transmit power of , is given by where (ℎ , ) represents the probabilistic mean model path loss in both LoS and NLoS links and is given by where and represent the mean path losses in case of LoS and NLoS scenarios, respectively.
is the probability of LoS conditions between the UAV and the gNBi while is the probability of NLOS conditions. and are given by where fc is the carrier frequency, c is the speed of light and ŋLoS and ŋNLoS represent the mean additional losses for LoS and NLoS, respectively. and are functions of environment-dependent parameters, a and b and are given by Consequently, the path loss model can be expressed as Then the estimated distance, di, between gNBi and the UAV is given by For every gNBi, measuring the received SNR and determining the estimated distance, di, between the UAV and gNBi will translate into an estimated location of the UAV denoted by the vector xi which is potentially any point on the surface of the sphere with center gi and radius di given by: The multi-lateration algorithm, to be described shortly, uses the RSSI measurements, P, and the measured path losses, L, between the UAV and each of the 4 gNBs given by The estimated distances, D, based on the noisy RSSI reading between the UAV and every gNBi are given by We assume that the probability density function of the distance measurement noise, , follows a zero mean gaussian distribution in accordance with [1]. Accordingly, we model the distance di between the UAV and gNBi as where corresponds to the actual distance between the UAV and gNBi.
According to [16], 3 spheres in the 3-D space with radii , , and and centers given by the coordinates of gNBs gi, gj and gk are the vertices of the triangle defining a 2-D plane, . The 3 spheres intersect at straight line orthogonal to the plane defined by .
The intersection of this straight line with the plane is given by point with barycentric coordinates given by Accordingly, we approximate the UAV location by xL= (xL, yL, zL) as where the index L = [1,4], nl is the normal vector to the 2-D plane and the vector hl corresponds to the height of the UAV relative to plane . Next, our objective is to find the estimated location xe = (x, y, z) that minimizes the error of measurements from the 4 gNBs.

III. PROBLEM FORMULATION
We formulate the UAV localization problem as an optimization problem to minimize the sum of the least square estimate for the errors of the measured distances between the UAV and the = 4 gNBs. This problem can be written as where ℎ is the maximum tolerable path loss and is given by The objective function in (19) results in a constrained nonlinear programming problem (NLP) classified as NP-hard problem which is difficult to solve. Accordingly, we utilize the penalty method to reformulate the objective function in the form of an unconstrained NLP problem given by + where the penalty coefficient ψi > 0, and the constraints are given by For each gNBi, we assume each distance measurement noise, ε i , is an independent random variable with standard deviation σ i and probability density function given by The joint probability density function of the distance measurement errors is given by the product of the four probability densities [31] given by Next, we derive theoretical lower and upper bounds for the localization error vector, , to analyze the impact of the accuracy of the RSSI measurements from the four gNBs on the UAV localization error. The Cramér-Rao Lower Bound (CRLB) is one of the commonly utilized metrics for characterizing the lower bound of the accuracy limitations in RF-based localization applications [5]. The CRLB is calculated through the inverse of the covariance matrix of the RSSI measurements errors known as the Fisher information matrix [32]. Let the Fisher information matrix, , be defined as an × matrix with element , given by The CRLB, , for the distance measurement error, , is given by We derive an upper bound to guarantee a minimum confidence level, , for the UAV localization error. Specifically, the upper bound, , for the distance measurement noise, , satisfies As such, the theoretical lower and upper bounds for the UAV localization error can be established based on the channel propagation conditions of the RSSI measurements from the four gNBs.

IV. THE OPTIMIZATION BASED APPROACH
Our objective in this section is to accurately solve the localization problem in (21). Several techniques can be utilized to provide optimal and near optimal solutions with high accuracies. The exact solution can be found by utilizing the Exhaustive Search method bounded by the coverage area of the 4 gNBs. Despite its accuracy, this technique involves a significant complexity given the large area in which the UAV is allowed to fly. To illustrate its complexity, we perform a complexity analysis to be presented in the form of the big-O notation. We let n to denote the algorithm's variable input size including the RSSI readings from m gNBs as measured by a UAV to be localized in 3-D at a given time instant and their corresponding coordinates. In Table 1, we illustrate the computational function of each complexity notation. Initially, we solve the multilateration algorithm which contributes a linear complexity of ( ). To this end, we state the worst-case complexity of the computational steps for each optimization algorithm. The complexity is proportional to the flying space of the UAV bounded by the coverage area of the 5G gNBs which is typically very large. The complexity can thus be given by = ( ) (29) To solve the given localization problem at hand without the substantial number of iterations and complexity, we propose to utilize the Nelder-Mead (NM) algorithm as a heuristic optimization approach. The Nelder-Mead algorithm is convenient and efficient in solving unconstrained multidimensional objective functions and does not require any differentiation to search for the optimal point [17]. The Nelder-Mead algorithm is composed of four operations, namely, reflection, expansion, contraction, and shrink [18]. The reflection, expansion, contraction and shrink coefficients are given by α, γ, ρ and σ, respectively. As shown in Algorithm 1, we use the estimates obtained by equation (15) as the initial test points.

9: End
Next, we analyze the computational complexity of the steps described in Algorithm 1. This method includes an outer loop of n iterations and an inner loop of t iterations. Accordingly, the computational complexity of the Nelder-Mead based localization technique can be given in the form of big-O notation by = ( ) + ( ( )) (30) The computational complexity for both approaches is proportional to the number of iterations needed to reach the optimal solution. As such, we conclude that Exhaustive Search and Nelder-Mead algorithms do not guarantee to meet the real-time requirement but are to be used to determine the true bound of the solution as benchmarks for comparison to our state-of-the-art machine learning based approaches to be proposed in the next sections. Finding the best in A combinations for n iterations

V. THE PROPOSED DEEP LEARNING BASED APPROACH
Our objective in this section is to develop a machine learning based model capable of capturing the non-linear co-related relationships between the UAV location and the measured 5G RSSI readings from the 4 gNBs. In this section, we deal with the optimization problem as a blackbox mapping process between the inputs and the outputs. We propose to leverage deep supervised learning to solve the formulated error minimization problem in a computationally efficient manner such that it can be practically used for realtime UAV localization. We propose utilizing an Artificial Neural Network (ANN) as a framework to learn the mapping between the inputs and outputs. There are various types of ANNs. We propose to leverage a Multi-Layer Perceptron (MLP) composed of a deep feedforward neural network architecture as no feedback loop is required between the inputs. The model is to be trained offline to resolve the 3-D UAV localization problem as a regression problem as opposed to a classification problem to enable scalability with the 3-D UAV localization area. The training data needed for the proposed deep supervised learning approach are to be synthetically generated through applying the Nelder-Mead and Exhaustive Search optimization-based techniques that we discussed in section IV.
As summarized in Algorithm 2, we deal with the optimization problem as a black-box mapping process between the input, which is the noisy measured distances based on the 5G RSSI readings and the gNB coordinates and the output, which is the estimated UAV coordinates. Therefore, we train a feed-forward neural network to learn this mapping. As shown in Fig. 2, the input vector in a 16 × 1 vector composed of the 4 gNBs 3-D coordinates and their corresponding RSSI readings while the output is a 3 × 1 vector corresponding to the estimate of the 3-D UAV location. We study the effect of various neural network architectures, backpropagation training functions and hyper parameter tuning on the performance evaluation of the proposed deep learning algorithm. The neural network weights are trained in a supervised manner with forward and back-propagation algorithms to optimize a given performance function [19].
There are various implementations for the backpropagation building on the standard backpropagation algorithm. Specifically, we use Levenberg-Marquardt backpropagation and Bayesian Regularization which are two backpropagation training functions used for neural network approximations. The Levenberg-Marquardt (LM) algorithm is a robust backpropagation algorithm for performance function approximation [21], [22]. It is a pseudo-second order method which estimates the Hessian matrix, which is a square matrix used to describe the local curvature of the function, using the sum of outer products of the gradients. The Levenberg-Marquardt algorithm does not consider the outliers in the data, which may lead to overfitting noise. To overcome this problem, Bayesian Regularization can be applied to the neural network learning problem. Bayesian Regularization expands the cost function used by the model to minimize the error as well as the effective number of parameters using the minimal weights [23]. Bayesian Regularization adds little overhead to the to the Levenberg-Marquardt process given that a Hessian approximation is already computed.  Next, we perform the complexity analysis of the proposed deep learning approach to be presented in the form of the big-O notation. The complexity of the proposed deep learning algorithm operating in the online phase has a linear value that is calculated as where > 0 is a constant corresponding to the neural network dimensions given by where Hd represents the number of hidden layers of the neural network. The calculated computational complexity in case of the deep learning based technique is proportional only to the neural network size as opposed to the number of iterations required to compute the optimization based techniques. As such, the proposed deep learning-based technique can be used in real-time to solve the formulated UAV localization problem.

VI. THE PROPOSED REINFORCEMENT LEARNING BASED APPROACH
As mentioned earlier, our objective is to come up with a real-time efficient solution for the UAV 3-D localization problem. Therefore, we also devised a RL based algorithm in order to be able to determine the most suitable ML based technique for this problem. RL is a branch of ML that learns directly by interacting with the environment. It addresses problems where there is no explicit training data available. Q-Learning (QL) is a type of RL learning involving agents which make decisions, observe the results, and then automatically adjust their policies to achieve their objective to select an action that maximizes a reward. It is worth noting that a QL agent can be trained in an offline or online implementation. The traditional table-based QL, proposed by Watkins [10], maximizes the expected value of the total reward over any and all successive steps by taking action in the current state and follows an optimal policy afterwards. The tabular QL does not scale with the increase in the size of state space. In most real applications, there are too many states to visit and keep track of. For scalability, function approximation is used to approximate a value function of each state-action pair through a number of iterations. An artificial neural network representation, also known as deep Q-network (DQN), can be used for non-linear function approximations [24][25][26][27]. The goal is to select the action which has the maximum Q-value using the following update rule where s is the state, a is the action, α is the learning rate and r is the reward attained for the current state-action pair. The discount factor γ determines the importance of future rewards, i.e. a high discount factor sets the priority towards distant rewards, whereas a lower value will force the agent to consider only immediate rewards. In the next subsections, we first present the Q-learning environmental setup. We then detail the proposed QL architecture and the overall operation for the formulated UAV localization problem.

A. Q-LEARNING ENVIROMENT
In this subsection, we define the state, action and reward for the Q-learning agent. We define the state, s, as a 16 × 1 vector composed of the 3-D coordinates and the corresponding RSSI readings from the 4 gNBs as follows Let the action space correspond to the allowable flight area of the UAV bounded by the coverage area of the 4 gNBs to maintain the SINR threshold along with the minimum height requirement of the UAV as defined by the constraints in equation (19). We discretize the 3-D action space, , to equally spaced increments. For example, considering a 200 × 200 × (ℎ − ℎ ) 3 area, it can be divided into 1m increments along each axis. Assuming ℎ = 30 and ℎ = 100 , then is a 3 × 1 vector with = 1 × 2868471 discrete possible elements. We also investigate the effect of discretizing the action space into smaller sized increments e.g. 0.1 in our evaluation to enable localizing the UAV to a decimeter accuracy.
We define the action as a 3 × 1 vector corresponding to the estimated UAV location. It can be written as We define the reward as the negative of mean square error as follows We approximate Q-value function by setting the discount factor γ to zero to consider only the immediate reward.

B. DEEP Q-NETWORK
In this subsection, we present the proposed architecture to estimate the Q-value function. We utilize a DQN representation, termed as the critic network, to estimate the Qvalue for a given state-action pair. The critic network architecture is shown in Fig. 3. The input to the neural network is a 19 × 1 input vector composed of the state and action pair and the output is a scalar estimate of the immediate reward. The network is composed of an addition layer followed by Hhidden layers with tanh activation functions. The Q-learning agent interacts directly with the environment and the critic network is trained according to set hyper parameters. We train the critic network by applying experience replay and batch training where the union of the state, action, reward, and transition of each step is stored as an item in the experience pool, and a particular number of samples from the experience pool are selected to feed into the neural network in each step to do the weighting parameters' update [30]. We apply Bayesian regularization back propagation to determine the critic network learnable parameter, θ, that minimizes the MSE for the Q-value approximation.

C. THE OVERALL Q-LEARNING APPROACH
In this subsection, we detail the overall algorithm and computational complexity for the proposed reinforcement learning based approach. As shown in Fig. 4, the current state 16 × 1 vector, s, is input to the agent. The agent multiplies the state vector, s, by a 1 × unit vector J. The resulting 16 × vector is then applied to the addition layer of critic network to evaluate the approximate Q-value for each state-action pair.
The action with the maximum Q-value is then selected as the most likely estimate for the UAV location. The proposed reinforcement learning based approach is summarized in Algorithm 3.  We now perform the complexity analysis of the proposed RL based approach to be presented in the form of the big-O notation. In this case, the complexity is proportional to the discretization step size of the flying space of the UAV bounded by the coverage area of the 5G gNBs. To estimate a UAV location at a single point, we first resize the state vector by performing a copy operation with constant time complexity. A critic network is used to approximate the function mapping between the inputs corresponding to the gNB coordinates, the RSSI readings and an estimate of the UAV location and the output which is a scalar value corresponding to the maximum reward. The complexity of the critic network can thus be given by where >0 is a constant corresponding to the neural network dimensions given by where Hr represents the number of the hidden layers of the critic network. Finally, the action with the maximum reward is selected as the estimated UAV location. Consequently, the complexity of the reinforcement learning algorithm is given by is proportional to the critic network size and the discrete action space independent of any iterations needed to converge to an optimal solution as is the case of optimization-based techniques. As such, the proposed RL based technique can be used solve the formulated UAV localization problem to meet the real time requirement.

Algorithm 3: Reinforcement Learning Approach
Input: Action space A, Batch size ℬ, Experience replay buffer length . 1: Random initialization of network weights and biases 2: for training episodes ⃪ 1 to K 3: while (MSE> ℯ) do 4: Sample random batch size ℬ from the training data.

VII. Evaluation Results
In this section we conduct simulations to evaluate the performance of each of proposed approaches to solve the UAV localization problem formulated in section III. We consider a 500 × 500 × 100 3 outdoor flight area with the system and environmental parameters settings as shown in Table 2. We assume random noise with zero mean gaussian distribution and standard deviation, σi. As shown in Table 3, we consider various gNB-UAV placement scenarios. The deep learning hyper parameter settings used during the training phase are shown in Table 4. The reinforcement learning agent hyper parameter settings used to train the critic network are summarized in Table 5. We consider the mean-square error (MSE) merit function to approximate the mapping of the weights of the neural networks during the offline phase. We evaluate the optimal results solved by utilizing the Exhaustive search, Nelder-Mead, as well as a Carrier phase optimization-based approach proposed in [8] to be described in detail in section VII.A for benchmarking purposes. Then, we perform an in-depth comparative analysis for performance of each of our proposed deep neural network and reinforcement learning based approaches as compared to the optimal benchmark techniques.
Specifically, we assess the efficiency and specific use case for each of the deep neural network and reinforcement learning based approaches to solve the localization problem. In our proposed deep and reinforcement learning approaches, the neural networks are trained during the offline phase with the objective of providing a real-time near-optimal localization solution. All the results are based on a data size of 100 runs and 98% confidence analysis. gNBs are uniformly placed ~150 m apart and the UAV is conveniently located in the middle above the gNBs. 2 gNBs are uniformly placed ~150 m apart and the location of the UAV relative to gNBs is random. 3 gNBs are randomly placed and the UAV is located above in the middle relative to the gNBs. 4 gNBs are randomly placed and the location of the UAV is random.

A. THE CARRIER PHASE APPROACH
In order to compare our proposed techniques to a representative technique from the literature, we present an optimization-based approach utilizing carrier phase measurements which is proposed in [8] to solve the UAV localization problem. The authors utilize cellular signals to localize and navigate a UAV assuming limited GPS presence. The authors assume the initial presence of GPS to accurately determine the carrier phase ambiguity for each of the N base stations then leverage the relative stability of BSs clocks to estimate the UAV location at later stage when GPS is cut off. The carrier phase observable in meters from BSi at time t=k, corresponding to equation (11), can be expressed as where is the wavelength of the carrier signal, ɸ is the observed carrier phase reading for BSi, T is the carrier period and ( ) is the zero mean gaussian measurement noise. The carrier phase observable can be re-parametrized in terms of UAV and BS positions as follows where and are the UAV and BSi 2-D position vectors, respectively, whereas and are the receiver and cellular BS clock biases, respectively, is the speed of light and is the carrier phase ambiguity for BSi. The terms [ − ] + can be combined into one term that can be defined as According to the study, cellular BSs possess tight carrier phase synchronization, which results in very similar clock biases up to an initial bias 0 , and procced to leverage this relative frequency stability to reduce the parameters to be estimated. Accordingly, the authors reparametrize the clock biases in terms of a time varying common bias term, , as well as BSi specific clock bias deviation term, ( ), as follows Accordingly, the carrier phase observable is expressed as where ր ( ) = ( ) + ( ) is the overall measurement noise and 0 is the initial carrier phase bias term that is estimated at the beginning of the setup during the GPS availability duration by knowing the initial UAV position and measurement (0) and is given by The above estimation ignores the initial measurement noise. However, the authors propose to average several measurements for the total duration in which the GPS is initially available. To estimate the common bias term ( ), the authors propose to lump the N BSi clocks into clusters, each of size Nl where N=∑ =1 and L is the total number of clusters. Extending this approach to our application scenario to localize the UAV through measurements from four base stations, we let N = 4 and L ≤ 2. Note that, given the 2-D position of the UAV is being estimated with L clock clusters, the number of clusters, L, cannot exceed N-2. Next, the authors utilize a weighted nonlinear least squares (WNLS) estimator to estimate the UAV location given by where j is the WNLS iteration index, ր is the measurement noise covariance and H is the measurement Jacobian and ( ) is given by This method contains an outer loop of iterations and an inner loop of iterations . Accordingly, the computational complexity of this localization technique is given by The calculated is proportional to the needed number of iterations of the computational steps to converge to an optimal solution and as such is not guaranteed to meet the real time requirement.
It is worth noting that this approach has major drawbacks that render it impractical where GPS presence in urban areas has the limitations that we discussed earlier. Moreover, the study's results are based on assuming an altimeter is available to accurately estimate the UAV height and it then proceeds to estimate the 2-D position vector for the UAV due to the poor vertical diversity of cellular phase measurements.

B. THE OPTIMIZATION BASED RESULTS
We first perform simulations to determine the optimal bound of the solution by evaluating the performance of Exhaustive Search technique which can usually be done on smaller system scales due to its complexity. We then assess the performance of the Nelder-Mead technique in comparison to the exhaustive search to examine its relative performance to the gobal optimal solution obtained by the Exhasustive Search. The objective is to enable us to examine the possibility of using the Nelder-Mead technique at larger scales, instead of resorting to the use of exact optimization techniques, due to its lower complexity. This is to help researchers decide on the use of such an alternative in similar problems to save considerable time and resources while obtaining reasonably close optimal bounds. As shown in Fig. 5(a), we demonstrate the mean UAV localization error assuming random placement for each of the four gNBs and UAV versus different standard deviation values of the noisy RSSI readings. The results show that Nelder-Mead as a heuristic optimal approach performs closely to Exhaustive Search. By examining the results, we find that the Nelder-Mead and Exhaustive Search methods result in a mean localization error of 0.7 and 0.5 for a random UAV-gNB placement scenario and random standard deviations 0.01<σi<0.1, respectively. In Fig. 5(b), we show the maximum locazation error for both algorithms as compared to the bounds derived in section III. We evaluate based on a 98% confiedence analysis and evaluate assuming optimal propagation conditions for each of the four RSSI measurments. As expected, the maximum UAV localization error for both is within the theoritical bounds for both optimization based algorithms.

C. THE DEEP LEARNING BASED RESULTS
In this section we evaluate the performance of the deep supervised learning-based approach in solving our formulated UAV localization problem. Fig. 6 demonstrates the effect of various neural network architectures, backpropagation training functions and hyper parameter tuning on the performance evaluation of the proposed deep learning algorithm. As shown, the localization accuracy improves as the number of nodes increases since the higher number of nodes enables the neural network to learn the nonlinear co-relations. We also investigate the effect of varying the batch size on the neural network performance. As shown in Fig. 6, the smaller batch sizes accelerate the training converging to suboptimal accuracy as opposed to using larger batch sizes. The performance also improves with increasing the number of network layers. However, the rate of improvement slows down as the number of layers increases as the network becomes susceptible to overfitting due to the increase in the number of parameters. We also demonstrate the effect of training the neural network using different functions. The mean square error is lower in case of applying Bayesian Regularization as opposed Levenberg-Marquardt given that the validation criteria prevents overfitting while allowing the training for a larger number of epochs. We also show that the localization accuracy increases with increasing the number of training epochs. However, the performance improvement decreases significantly when the number of training epochs exceeds 220 epochs as the network weights converge to the local suboptimal solution.
Our simulation results show that a neural network architechture that consists of 4 hidder layers with at least 100 nodes per layer is needed to capture the non-linear relationships between the inputs and the outputs. We also studied the effect of training the neural network by utilizing training data obtained from applying the different optimization algorithms. Our simulation results show a neural network with a 4 hidden layers architechture trained for at least 260 epochs with a minimum training data size of 10 6 and utilizing Bayesian regularization yields the best results. Our simulation results also show that the mean localization error of a 4-layer neural network trained for 280 epochs utilizing Bayesian regularization with a training data size of 10 6 obtained by applying the Exhaustive Search and Nelder-Mead algorithms is 2.6m and 2.7m, respectively. This confirms our earlier observation regarding the closeness of the optimal bound results of both the Exhaustive Search and the Nelder-Mead optimization techniques.

D. THE REINFORCEMENT LEARNING BASED RESULTS
In this section, we evaluate the performance of the reinforcement learning-based approach that we proposed as compared to the optimal solution obtained by applying the Exhaustive Search and Nelder-Mead techniques and the near optimal solution obtained by applying the deep learning approach. As shown in Fig. 7, we see the effect on performance with increasing number of layers of the critic network. As expected, the performance improves with increasing the number of network layers from 1 to 2 given the quadratic relationship between the inputs and the outputs. However, the performance degrades as the number of layers increases to 3. This can be attributed to the need for additional training episodes and/or experience buffer size to accurately train the network with a higher number of weights and biases. We also show the effect of the experience replay batch size in training the critic network. The performance improves when the experience replay batch size used to train the critic network is increased. Finally, we demonstrate the effect of action space discretization step size by which we are able to reach near optimal localization results when the action space is discretized to a decimeter accuracy. Our simulation results show that a 2-layer critic network with 30 nodes per layer is capable to capture the non-linear relationship between the inputs consisting of the RSSI readings, gNB coordinates and the UAV action space, and the output which is the expected reward or localization error to be minimized. We show that a QL agent with experience replay batch size of 8 × 10 4 utilizing Bayesian regularization for at least 11,000 episodes and action space discretization step size of 0.1 m yields best error result of 0.87m, which is comparable to the optimal solution. Accordingly, we conclude that the proposed reinforcement learning based approach can be utilized to effectively solve the formulated localization problem with comparable accuracy to the optmization based techniques.

E. OVERALL RESULTS AND ANALYSIS
In this section, we provide a comparative analysis for each of the techniques including Exhaustive search, Nelder-Mead, Deep Learning, Reinforcement learning and the Carrier phase technique proposed in [8] in terms of mean localization error and time complexity. First, we show the mean localization error for each approach under different gNB-UAV placement scenarios. As shown in Fig.  8(a), the results demonstrate that our proposed deep and reinforcement learning approaches can perform closely to optimal results realized by iterative Exhaustive Search, Nelder Mead and Carrier Phase approaches.
Next, we show in Fig. 8(b) the time complexity derived in the form of big-O notation that we have derived for each of the presented localization techniques as a function of the problem dimension n. The results demonstrate that the computational complexity of our proposed deep and reinforcement learning based approaches is lower with increasing the problem dimension as compared to the other optimization based approaches and accordingly provide the advantage of meeting real-time constraints needed in such dynamic environments.Our simulation results show the effectiveness of our proposed deep and reinforcement learning based approaches to solve the UAV localization problem. We demonstrate that our proposed reinforcement learning algorithm achieves a lower localization error as compared to deep supervised learning-based algorithm. However, the reinforcement learning algorithm requires higher computational complexity that is proportional to the discrete action space size as compared to the deep supervised learning-based approach. closely to the optimal bound while proving the required realtime results. While both the proposed deep learning and reinforcement learning approaches provide a near optimal solution, the results show that the proposed reinforcement learning algorithm achieves a lower localization error as compared to deep supervised learning-based algorithm at the expense of the added complexity that it sustains.