AMF Optimal Placement based on Deep Reinforcement Learning in Heterogeneous Radio Access Network

To support various service requirements such as massive Machine Type Communications, Ultra-Reliable and Low-Latency Communications in 5G scenario, Network Function Virtualization (NFV) plays an important role in the 5G network architecture to manage and orchestrate network services. As the key network function responsible for mobility management, Access and Mobility Management Function (AMF) can be deployed (cid:13)exibly at the edge of the radio access network to improve the performance of mobility management based on NFV. In this paper, the optimal placement of AMF is addressed based on Deep Reinforcement Learning (DRL) in a heterogeneous radio access network, which aims to minimize the network utility including the average delay of mobility management requests at AMF, the average wired hops to relay the requests and the cost of AMF instances. By considering time-varying features including user mobility and the arrival rate of user mobility management requests, an AMF optimal placement approach is proposed for the long term optimization. Simulation results show that the performance of the proposed DRL based AMF optimal placement approach outperforms that of the baselines.


Introduction
5G service requirements motivate the evolution of the core network since the existing core network architecture is difficult to meet the demands of new services, such as enhanced Mobile Broadband (eMBB), massive Machine Type of Communication (mMTC), Ultra-Reliable and Low-Latency Communication (URLLC) as well as vehicular network [1,2]. As one of the functions of existing core network, mobility management becomes an indispensable technology which should be worked out. In order to deal with the above challenges, Network Function Virtualization (NFV) and Software Defined Networking (SDN) are enabled in the 5G core network architecture [3,4]. Mobility management based on NFV is addressed to meet various service requirements of 5G.
A comprehensive survey is presented in [4] for the up-to-date solutions based on SDN and NFV on the * Correspondence: hjin@bupt.edu.cn The Key Laboratory of Universal Wireless Communications for Ministry of Education, Beijing University of Posts and Telecommunications, 100876 Beijing, China Full list of author information is available at the end of the article † Equal contributor current EPC network architecture. Among the solutions concerning mobility management, two solutions are categorized based on NFV, one is based on Mobility Management Entity (MME) including the solution of virtual mobility management (vMME) and Distributed mobility management (DMME); the other is based on the NG core architecture, which presents new virtual function components for the network architecture [5,6]. Both of the solutions introduce important challenges, mainly related to the placement of the telco-specific VNFs (i.e. MME, PGW, SGW, AM-F, SMF, and AUSF) to ensure optimal connectivity for users and simultaneously reduce the deployment cost [7].
In the perspective of research on virtual mobility management, there has been some research issues on the placement of vMME/vEPC NFs. Some of them focus on the mapping of vEPC [7][8][9][10][11][12][13], which mainly related to the optimization of mobility management on data plane, while others concentrate on the mapping of vMME [7,[11][12][13][14][15] and optimal placement of mobility management related VNFs. For example, in [7], the placement of virtual network slice is proposed over a federated cloud by deploying the virtual instances of key vEPC NFs considering both 4G and new N-Fs of 5G. In [8], the concept of soft Evolved Packet Core (softEPC) is presented to instantiate the virtual network functions of EPC flexibly and dynamically over a physical network topology. Regarding the optimization objects, operation cost, load balancing and delay/response time are major objectives in these issues [7,11,12,14].
In the aspect of mobility management related new 5G architecture, according to the 5G network architecture, the main function of mobility management is decomposed into several components, including Access and Mobility Management Function (AMF), Session Management Function (SMF), Unified Data Management (UDM),User Plane Function (UPF) and Authentication Server Function (AUSF). Different placement of AMF, SMF, UDM, UPF and AUSF brings about different performance of mobility management, which sheds a light on the research direction of mobility management [5,6]. In [7], the placement of virtual network slice is addressed by deploying AMF, SMF, UPF and AUSF in 5G. In [16], based on Bayesian coalition formation game, the optimal coalition is formulated to deploy AMF, UPF and SMF collaboratively, the number of VNF instances to instantiate and the virtual resources allocated for each instance are optimized to maximize the profit of each cloud provider and guarantee the Quality of Service (QoS).
Among the network function components related to virtual mobility management in new 5G architecture, AMF is the most important VNF, and its deployment and performance optimization becomes the bottleneck of mobility management [15]. As a typical NFV based virtual network component on control plane, AMF is served to process mobility management requests from mobile users, with different characteristics of mobility management events in different service scenarios. The instances of AMF can be adjusted by optimization algorithms, which are designed from the aspect of QoS for mobility management and OpEx/CapEx of operators for AMF. In [15], from the perspective of scalability of 5G core network, auto-scaling of AMF instances is addressed to cope with the challenges of mobility management posed by massive number of connected devices. With the static threshold method, the decision is made to scale-out or scale-in the AMF instances by setting threshold parameter of network load to avoid wasting resources. In [13], EPC is modeled as a simplified queuing model based on VNF and an auto-scaling algorithm is proposed to obtain the tradeoff between auto-scaling and cost in 5G network. Although the above static threshold based method is effective and easy in auto-scaling of instances, it leads to service level agreement violations, and the thresholds are difficult to configure and change frequently [17]. That is to say, the intrinsic dynamics of user mobility and the arrival rate of user mobility management requests are important factors which should be taken into consideration on the optimization of mobility management.
To overcome the above threshold configuration problem, machine learning (ML) based approach is used [18][19][20]. In [18], [19] and [20], the traffic load of AMF is forecasted by Convolutional Neural Network (CN-N), Deep Neural Network (DNN) and Long Short Term Memory Network (LSTM) based on real datasets in mobile network, respectively. With the forecast traffic result, AMF instances are optimally scaled. The ML method achieves better performance compared with threshold-based solutions. However, the historical datasets are required to train the model and some ideal assumptions is needed. It is difficult to cope with the optimization of virtual mobility management with complete knowledge of the scenarios by the traditional approaches due to inherent dynamics of user mobility and mobility management requirements. As a model-free method without any datasets and prior information of the environment, reinforcement learning (RL) has been used as a promising approach to solve resource and service management problems in mobile network [21,22]. For example, in [23], a Deep Reinforcement Learning (DRL) based approach is proposed to optimize different slice request allocation to the network slices. In [24], a dynamic resource reservation and virtual radio access network resource slicing algorithm is proposed based on DRL to control the increase or decrease of the slice resource based on QoS and user resource utilization. In [25], an accelerated two-step RL algorithm is proposed for the proper VNF sizing and placement in a real network under various environments. In [26], a DRL approach is proposed to automatically make a decision optimally allocating the network resources through observing the network states in the scenario of in-network caching and device-to device (D2D) communications. In [27], a deep learning based real-time energy-efficient power control approach is presented which can operate in an online fashion even in high mobility scenarios.
With the above investigation, to the best of our knowledge, the optimal placement of AMF based on DRL remains a challenging issue taking into account the time varying features including user mobility and the arrival rate of user mobility management requests.
In this paper, the optimal placement of AMF based on DRL is addressed in the heterogeneous radio access network. The main contributions are listed as follows: 1. The optimal placement of AMF in the heterogeneous radio access network is formulated as a Markov Decision Process (MDP) problem to minimize the network utility including the average delay of mobility management requests at AMF, the average wired hops to relay requests and the cost of AMF instances. 2. The time-varying features are considered including user mobility and arrival rate of user mobility management requests. 3. To solve the problem of large state and action space, an algorithm based on DRL is proposed called AMF-OP-DRL, in which DNN is used to estimate the value function of Q-Learning. The convergence performance of AMF-OP-DRL algorithm is presented under different batch sizes and learning rates. The performance of AMF-OP-DRL is evaluated compared to baseline algorithms. The impact of some factors to the accumulated network utility are analyzed, including the arrival rate of mobility management requests, the number of VNF instances on each AMF server as well as the weight coefficients in the network utility. The rest of this paper is organized as follows. In Section 2, the system model for AMF optimal placement in HetNet is described. In Section 3, the optimization problem is formulated to place AMF. The DRL based algorithm is described in Section 4. In Section 5, the performance of the optimization algorithm is evaluated by simulation. Finally, conclusions are given in section 6.

System model
In order to meet mobility management demands of various services for 5G, AMF can be placed nearby Radio Access Network (RAN). By allowing universal virtual machines (VMs) to host AMF, the number of AMF instances can be orchestrated according to the mobility management requirements and AMF resources.
In this section, in the scenario of heterogeneous radio access network, the system model for AMF optimal placement is provided, and the framework of AMF orchestration based on MANO is illustrated. The modelling of the placement of AMF servers with multiple instances is presented based on a queuing system. Fig.1 gives the system model of AMF optimal placement in the scenario of a heterogeneous radio access network with several macro base stations (MBSs) and small-cell base stations (SBSs).

Scenario of AMF optimal placement in heterogeneous radio access network
In Fig.1, MBSs are deployed to provide wide area coverage, and SBSs are provided to enhance network coverage and increase transmission capacity. Assuming that MBSs and SBSs are deployed and managed by the same network operator, both MBSs and SBSs are equipped with universal servers, and each server is enabled with several AMF instances. There are wired links between BSs. The AMF servers can be placed on MBSs and SBSs in the RAN, which are orchestrated and managed based on MANO.
Since each User Equipment (UE) accesses AMF via its attached SBS, that is to say, the UE interacts with AMF either directly when an AMF server is placed on its attached SBS, or it interacts with AMF by onehop or multi-hop of BSs, which means that all BSs are relay nodes to access AMF.

2.2
The framework of AMF optimal placement based on MANO The framework of AMF optimal placement based on MANO is shown in Fig. 2. In the framework, the related resources of the physical infrastructure include computation resource and network resource. The computation resource resided in the MBSs and SBSs are abstracted as the virtual computation resource for AM-F, and the network resource refers to those networking links available for transmitting AMF information, which are virtualized by the virtualization layer of the framework.
According to the principles of NFV management and orchestration [3,28], AMF orchestration modules based on MANO mainly includes NFV Orchestrator for AMF, VNF Manager for AMF and Virtualized Infrastructure Manager for AMF.
NFV Orchestrator for AMF is a use case of NFV Orchestrator, which is responsible for fulfilling resource orchestration functions and network service orchestration functions on AMF. It supports lifecycle management of AMF as network service, AMF instances automation management, policy management and evaluation, as well as enhancement for AMF service instances and AMF instances. It collects usage information of NFVI resources by AMF instances or groups of AMF instances, and supports management of the relationship between AMF instances and NFVI resources allocated to those AMF instances by using NFVI resources repository and information received from Virtualized Infrastructure Manager for AMF [3].
VNF Manager for AMF is in charge of lifecycle management of AMF instances, including AMF instantiation, AMF instance software update/upgrade, modification, scaling out/in and up/down and termination, as well as AMF instance assisted or automated healing and AMF lifecycle management change notification, etc. [3,29].
Virtualized Infrastructure Manager for AMF is designed to orchestrate the allocation/upgrade/release/ reclamation of NFVI resources (including the optimization of such resources usage), and it manages the association of virtualized resources to physical compute, storage and networking resources. It collects performance and fault information of hardware, software and virtualized resources; and forwards performance measurement results and faults/events information related to virtualized resources. It manages in a repository inventory related information of NFVI hardware and software resources, and discovery of the capabilities and features (e.g. related to usage optimization) of such resources as well [3,29].
Based on the layered functionality and producerconsumer paradigm, Virtualized Infrastructure Manager for AMF and NFV Orchestrator for AMF collect monitoring parameters for AMFs periodically from NFVI resource repositories (e.g. the arrival rates of mobility management requests on AMFs for different service scenarios). The monitoring parameters are processed and selected as the inputs of the AMF optimization algorithm for AMF instance automation management, which is resided in the NFV Orchestrator for AMF. Based on the AMF optimization algorithm result, VNF Manager for AMF adjusts AMF instances(e.g. AMF instance placement optimization with some rules and constraints); while Virtualized Infrastructure Manager for AMF orchestrates association of virtual resources of AMFs to physical computation and networking resources(e.g. the optimal AMF placement and the number of AMF instances in the physical infrastructure).
From the above framework, NFV Orchestrator for AMF and virtualized infrastructure manager for AMF are of paramount importance to the optimal placement of AMF. Different placement result brings about different cost of AMF deployment and different performance of virtual mobility management.
One of the important criteria to evaluate the performance of virtual mobility management is the response time of mobility management requests on the control plane, which mainly caused by propagation delay, transmission delay and processing delay of mobility management requests. Propagation delay depends on the distance to transmit the requests, and transmission delay of requests is the delay caused by physical layer communication, while the processing delay includes queuing delay and serving delay of requests. Regarding the processing of requests on AMF, queuing delay of requests is affected by the capacity of AMF. The serving delay of requests is related to the average serving time for every mobility management request by one AMF instance. If AMF is placed in the radio access network, backhaul delay, queuing delay within the relay routers or the aggregation delay in the backhaul and CN are avoided [30].
According to the above analysis and the system model shown in Fig. 1, it is a challenge to investigate the tradeoff between the cost of AMF placement, the communication cost caused by AMF placement and the insurance of response time of mobility management requests. In the following subsection, the serving process of AMF with multiple instances is formulated based on a queuing system, including mobility management request model and AMF queuing model. The key notations in the modelling of AMF optimal placement are shown in Table 1.

Mobility management request model
Based on the system model of Fig. 1, assuming that there are N BS BSs, including MBSs and SBSs. N AM F universal servers are deployed at BSs, and each AMF server supports multiple active instances. N U E UEs are resided in the radio access network. Let BS denote the set of BSs, and UE denote the set of UEs who can access the AMF servers.
Assuming that the orchestration of AMF runs over equal length time steps, mobility management requests are collected from UEs at the beginning of each time step, and the orchestration for AMF is made at each time step. At time step k, the mobility management requests of UE i is denoted by λ i (k), which follows Poisson distribution. In mobility management, the events processed frequently are Service Request (S-R), Service Release Request (SRR) and Handover Request (HR) [31][32][33], which are represented by λ SRi (k), λ SRRi (k) and λ HRi (k), respectively. Thus, the total requests from UE i are modeled by the sum of the requests for the three kinds of mobility management events as (1),

AMF queuing model
Assuming that all of the packet size of mobility management requests are the same, then the propagation and transmission delay of one request packet within one hop of BSs is identical. Let H BS−BSi,j be the wired hops between BS i and BS j. When a user sends the mobility management request, the request arrives at its serving BS first, and then it is transmitted to an AMF server for processing. If the AMF server is placed on the serving BS, the request is processed by the AMF server without communication cost between the serving BS and the BS deployed with the AMF server, otherwise, the request is relayed to the closest BS deployed with an AMF sever and processed if no AMF server is placed on the serving BS.
The serving AMF with multiple active instances on the BS is modeled as a M/M/c queue, where c represents c AMF active instances on each AMF server, which means there are c running AMF instances.
Assuming that each AMF instance serves the mobility management requests with serving rate µ and the serving time follows an exponential distribution. Let τ s = 1/µ denote the average serving time of the requests. The mobility management requests from UEs follow Poisson distribution. Let λ denote the request arrival rate of the AMF queuing model, and τ q represent the average queuing time, and ρ = λ/µ. The average queuing time can be obtained as (2) [34], The average waiting time of mobility management requests is derived as (4): Let Sa be the saturation rate of the M/M/c system, according to the stability condition of M/M/c queuing system, Sa < 1, namely λ/ (cµ) < 1 [35].
From the above modelling of AMF queuing in the heterogeneous radio access network, it indicates that difference placement of AMF servers with multiple instances lead to different waiting time of mobility management request and different communication cost.

Problem formulation
In order to characterize the dynamics of AMFs in the heterogeneous radio access networks, including the placement of AMF servers with active instances on the MBSs and SBSs, the mobility of users as well as the delay caused by AMFs on mobility management request processing, a Markov process is used to model the optimal placement problem. It motivates the adoption of the Markov Decision Process (MDP) to formulate the above concerned problem. In addition, since our aim is to achieve the tradeoff among the cost of AMF placement, the communication cost caused by AMF placement and the insurance of response time of mobility management requests under some constraints including the minimum number of placement for AMF server and maximum arrival rate of mobility management requests, the reward setting of the MDP will be closely related to the Delay, Hops and Cost AM F models.
In the following subsection, the motivation of problem formulation based on MDP is analyzed, and then the formulation of AMF optimal placement is addressed.

Motivation of problem formulation based on MDP
From the AMF queuing model and mobility management request model, it reveals that the mobility of mobile users plays a key role in the following ways. On the one hand, the arrival rate of mobility management requests of mobile users influences the delay caused by AMF queuing model, which potentially affects decision of active AMF server placement, as well as the cost of AMF placement and the hops to relay the requests. On the other hand, since less arrival rate of user requests on AMF servers leads to the traffic offloading of requests to other active AMF servers, which provides the possibility to turn off some AMF servers on some BSs to save AMF consumption cost. Facing with the dynamics of user mobility, mobility management requests and AMF placement, it is natural to formulate the AMF optimal placement problem from the perspective of MDP.
MDP provides a formalism for reasoning about planning and acting in the face of uncertainty, which is defined by a tuple (S, A, {T P } , r). S is the set of possible states, A is the set of available actions, and {T P } provides the transition probabilities to each state if action a is taken in state s, and r is the reward function. The process of MDP is illustrated as follows. At an initial state s (0), the agent takes an action a (0); then the state of the system transits to the next state s (1) according to the transition probabilities { T P s(0)a(0) } , and the agent receives a reward r (0). With the process continuing, a state sequence including s (0), s (1), s (2), . . . , is generated. The agent in a MDP aims to maximize a discounted accumulative reward [22].

The formulation of AMF optimal placement
In this subsection, the AMF placement is formulated as a MDP in the heterogeneous RAN, and the reward function is called network utility, which is defined based on the AMF instance cost, the average waiting time of mobility management requests on AMF as well as the total communication cost to relay mobility management requests. The modelling is described as follows.
At a given placement decision time step k, let U (k) denote the serving binary indicator set between UEs and BSs, namely, In other words, UE i is in the coverage area of BS j, otherwise, u i,j (k) = 0.
Let P (k) denote the AMF placement on BSs at time step k, P (k) = {p j (k) , 1 ≤ j ≤ N BS }. If p j (k) = 1, then one AMF server with c active instances is placed on BS j, otherwise, p j (k) = 0. p j (k) can be modelled as a two-state Markov chain [26]. The transmission from p j (k) to p j (k + 1) depends on the system action at time step k. For simplicity, let λ Bj (k) represent the mobility management request arrival rate to BS j at time step k, and λ Bj (k) = ∑ NUE i=1 u i,j λ i (k). The change of λ Bj (k) is modeled as a Markov process, λ Bj (k) is computed at the beginning of every time step. The state transition from λ Bj (k) to λ Bj (k + 1) depends on the distribution of users and the arrival rate of mobility events at time step k + 1.
According to the assumption in the AMF queuing model, we assume that the proportion of BS j's mobility management requests processed by AM-  (4), the average waiting time of mobility management requests on the m − th AMF server can be computed as (5): Thus, the average delay of each mobility management request at all the instances of AMF servers is obtained as (6), The average hops of requests among BSs for each UE due to AMF placement can be obtained as (7): Considering the cost of AMF, which can be modeled by the sum of the number of AMF instances in the network [13], the cost of AMF instances is got as (8), Where c is the number of instances in one AMF server.
Let N etwork − U tility denote the total utility for AMF placement, which includes the average hops to relay the requests for each UE, the cost of the number of AMF instances and the average delay of requests on AMF, then the normalized network utility is obtained as (9): Where w1, w2 and w3 are the weight coefficients which can be configured by the network operator, respectively. D max , H max and A max are the maximum value of Delay, Hops and Cost AM F , respectively.
Give a decision time k, the modelling of the placement of AMF is an optimization problem to minimize the N etwork − U tility defined in (9), which is formulated as follows, Where C1 indicates that each BS chooses whether to place the AMF server on its side or not. C2 means that there exists at least one AMF server in the network, and C3 represents that the maximum request arrival rate of each AMF server cannot exceed cλ max .
Based on the above analysis on λ Bj (k) and p j (k), the state, action and reward of the MDP is defined as follows: • State: the state vector can be described as s (k) = [λ B (k) , P (k)], where λ B (k) is a vector representing the mobility request serving rate of every BS at time slot k. P (k) is a vector denoting the placement state of AMF servers on every BS. • Action: The action space is defined as: a (k) = (α j (k)), where α j (k) ∈ {0, 1}, and α j (k) = 1 means the AMF server is active at BS j, otherwise, α j (k) = 0. The size of the action space is 2 N BS , in order to reduce the action space, the agent controls the on-off state of one AMF server in each time step [36]. • Reward: After an action is taken in each time step, the agent receives the immediate reward r (k) named network utility, which is defined as (11) where P enalty can be selected as a value that is greater than the maximum value of N etwork − U tility when the arrival rate of each AMF server does not exceed cλ max . Assuming that the action is drawn from a stochastic policy:π (a (k) | s (k)) = T P (a = a (k) | s = s (k)), in which the transition probability set T P maps each state-action pair (s (k) , a (k)) to (s (k + 1) , a (k + 1)), then the optimization problem can be modeled as a MDP problem: ⟨S, A, T P, r⟩, where S and A are the state space and action space, and r is the reward function. The objective of the MDP problem is to maximize the discounted accumulated reward starting from state-action pair (s (k) , a (k)), which is defined as the state-value function in (12): Where γ ∈ [0, 1] is a discount factor penalizing the future rewards. Therefore, the optimal state value function V π * AM F under the optimal policy π * can be obtained as follows: Hence, the value iteration algorithm can be adopted to compute V π * AM F . However, the transition probability set T P is not known, it is a model-free problem. Therefore, the remaining task is to find a way of obtaining a good policy for the formulated MDP problem.

A DRL based solution to the AMF optimal placement
According to [22], Reinforcement Learning algorithms, especially Q-learning, are widely used to solve MD-P problems in the cases when the state space, explicit transition probabilities, and the reward function are not essential. Although Q-learning is also widely adopted in the research of wireless networks for network control without knowing the transition probability and reward function in advance, it has the following three limitations: (1) Traditional Q-learning stores the Q-values in a tabular form.
(2)To achieve the optimal policy, Q-learning usually needs to revisit each stateaction pair infinitely [22]. (3) The state for Q-learning is often manually defined [36,37]. The above three limitations make it impractical for Q-learning to solve optimization problem with large system state and action spaces. DRL overcomes these problems and has the potential to achieve better performance based on the following three facts. Firstly, DQN is used to store learned Q-values in the form of connection weights between different layers. Secondly, with the help of replay memory and generalization capability by NNs, DRL has good performance via less interaction with complex environments. Thirdly, DR-L avoids manual input by directly learning representation from high dimensional raw network data with DQN [36,37]. Considering these benefits, the NFV orchestrator for AMF uses DRL to learn the control policy of on-off states of AMF servers by interacting with the dynamic environment to minimize the discounted and accumulative network utility for K decision steps, and that is to maximize the long term reward Based on the above analysis, a DRL based approach to the AMF optimal placement problem (AMF-OP-DRL) is presented.
Based on RL, the Q function in the model is defined as follows in (14), Comparing (12) with (13), the optimal state value function can be obtained by calculating the optimal Q function Q π * AM F (s (k) , a (k)), the optimal Q function is obtained by the iterative formula defined in (15) [22], Where α is the learning rate. Based on (15), the optimal policy can be obtained by computing, storing and updating the Q values of the state-action pairs in the Q-table. Due to high state space, it is difficult to get enough samples to traverse each state. To solve the problem, we resort to use DNN as an approach to approximate the Q function [37]. Based on DNN, the Q function Q π AM F (s (k) , a (k)) can be represented as Q AM F (s (k) , a (k) ; θ), and θ are the weight sets.
According to the neural network structure proposed in [37], the replay memory and target DQN are also considered, since the replay memory breaks the correlations between samples, and the parameters of the target network is fixed and periodically updated, which stabilizes the training process and accelerates the convergence. Fig. 3 shows the training process of AMF placement.
The goal of the DRL agent is to learn the placement policy of the AMF by interacting with the environment to minimize the discounted accumulated network utility in K decision steps, which can be represented by (16): DRL is trained by iteratively minimizing the loss function at each training step, the loss function can be denoted as (17): Where θ are the parameters of the evaluation network, and θ − are the parameters of the target network. M EM refers to the replay memory. The detailed procedure of the proposed AMF-OP-DRL algorithm is shown in the Algorithm 1. Select an action a (k) randomly with probability ϵ or a (k) = arg max a∈A Q AM F (s (k) , a; θ) with 1 − ϵ. 10: According to executed action a (k) and user movement around the network, the DRL agent observes the reward r (k) and the next state s (k + 1).

11:
Store the transition (s (k) , a (k) , r (k) , s (k + 1)) into the replay memory M EM . 12: Sample a mini-batch transitions randomly from M EM and perform a gradient update on the loss function defined in (15) with respect to θ. 13: Reset θ − = θ every COU N T S steps. 14: end for 15: end for 16: Output : the placement policy of AMF: P (k + 1).
Where COU N T S is the parameter for epochs, and θ − is reset by θ for every COU N T S steps.

Results and discussion
In this section, the performance of the AMF-OP-DRL algorithm is evaluated by simulation compared with several baseline algorithms, including simulation parameters in 5.1 and performance evaluation in 5.2.

Simulation parameters
In the simulation, 12 BSs are distributed in a rectangular area of size 560m × 390m, with 4 × 3 grid. The coverage area of each BS is 140m × 130m [33].

Mobility management related parameter settings
There are 12000 UEs moving in the area, user mobility pattern follows Random Walk (RW) model, in which each user choose a random direction from [0, 2π] and the speed ranges between [v min , v max ]. In the simulation, the direction is simplified as four directions, at each time step, every UE moves in four directions with the same probability [38]. The user speed ranges between [1,3] m/s. The arrival rate of mobility management requests follows Poisson distribution. The arrival rate of HR is 0.0012 procedures/s/user, the arrival rate of SR is 0.0045 procedures/s/user and the arrival rate of SRR is 0.0045 procedures/s/user [31]. Each AMF server is equipped with fixed number of AMF instances.

Parameter settings of the AMF M/M/c
queuing system According to the modeling of AMF M/M/c queuing system, when the queuing system is stable, Sa < 1. Based on the formula (4), the impact of c, µ to the average waiting time is presented in Fig. 4 and Fig. 5, respectively. From Fig. 4 and Fig. 5, it indicates that when λ changes, the value of Sa is different depending on c and µ. Based on the constraints of the average waiting time of requests, the parameter of c, µ and λ max can be obtained.
In the simulation, we choose µ = 10 procedures/s/user, c = 5 and λ max = 9 procedures/s, then Sa = 0.9. That is to say, whenever the arrival rate is greater than 45 procedures/s, the AMF server cannot process those mobility management requests beyond its processing capacity, the system is saturated.
The cost of one AMF instance is set as 1. w1, w2 and w3 are set as 1, 1 and 5, respectively. According to (10), in order to ensure that the reward value is small enough when the placement decision is overloaded, we choose the threshold of P enalty when the arrival rate of each AMF server is cλ max , namely P enalty is set as 7 in the simulation. It can be set larger than the threshold as well.

Parameter settings of DQN
The DQN is enabled with an input layer, a hidden layer and an output layer. There are 24 neurons in the input layer, 16 neurons in the hidden layer and 13 neurons in the output layer [39]. Table 2 shows the parameter settings of DQN.
The detailed layout architecture of the DNN in the simulation is given in Fig. 6.

Performance evaluation of AMF-OP-DRL
algorithm In this subsection, the performance of AMF-OP-DRL algorithm is evaluated, including the convergence performance of AMF-OP-DRL, the impact of arrival rate to the number of AMF servers, the impact of weight coefficients to Delay, Hops and AMF cost, as well as the performance comparison with baseline algorithms.

Convergence performance of AMF-OP-DRL
The convergence performance of AMF-OP-DRL algorithm based on DQN under different batch sizes are presented in Fig. 7 when the batch size is 32, 64 and 128, respectively. The performance is the best when the batch size is 32. This is because when the batch size is small, more randomness is introduced and longer training time is needed to achieve convergence, while large batch size tends to get local optimal result.
The impact of DQN learning rate to convergence performance is shown in Fig. 8. The learning rate is set as 0.01, 0.001 and 0.0001, respectively. The DQN achieves the optimal performance when the learning rate is 0.01.
Since the complexity of the DRL algorithm is related to the number of BSs in the network, in order to investigate the convergence performance of the algorithm when the number of BS is large, the number of the BSs is set as 56 and the number of UEs is 12000 in the simulation. The arrival rate of HR, SR and SRR is the same as that in 5.1.1. The number of neurons in the input layer of the DQN is adjusted to 112, and the number of neurons in the output layer is set as 57. There are two hidden layers and each layer has 60 neurons. The hyper parameters of DQN are adjusted according to the convergence performance of the algorithm. The convergence performance is shown in Fig.  9. The algorithm begins to converge when the epoch is around 700. It indicates that AMF-OP-DRL algorithm achieves good performance when the number of BS is large.

The impact of request arrival rate to the number of AMF servers
The impact of request arrival rate to the number of AMF servers is evaluated in Fig. 10, Fig. 11 and Fig.  12, respectively, when w1 : w2 : w3 = 1 : 1 : 5. Fig. 10 shows the number of AMF servers under different HR arrival rate with different number of VNF instances on each AMF sever, when the HR arrival rate varies from 0 to 0.007 procedures/s/user, the arrival rate of SR and SRR is 0.0045 procedures/s/user, and the number of AMF instances changes from 4 to 6. It shows that the number of AMF servers increases with the increase of HR arrival rate, and the optimal number of AMF servers is different. When the number of instances is 4 and 5, the number of AM-F servers is more sensitive to the increase of HR arrival rate than that when the number of instances is 6, because the processing capacity of AMF servers with 4 instances and 5 instances is less than that with 6 instances. There is a cross point between the number of instances is 5 and 6 on AMF servers when the arrival rate of HR event arrival rate is 0.006 procedures/s/user, which depends on the processing capacity of AMF servers as well as different weights of the optimization objectives. In the case when the number of instances is 6 on AMF severs, when the HR arrival rate increases from 0.005 to 0.006, it reveals that the agent tends to obtain the optimal network utility caused by the increase of the arrival requests with the tradeoff of adding one AMF server. Fig. 11 is the simulation result of the impact of SR arrival rate to the number of AMF servers, the arrival rate of HR is set as 0.0012 procedures/s/user and SRR is 0.0045 procedures/s/user. Fig. 12 shows the impact of SRR arrival rate to the number of AM-F servers when the arrival rate of HR is 0.0012 procedures/s/user and the arrival rate of SR is 0.0045 procedures/s/user. As shown in Fig. 11 and Fig. 12, when the arrival rate changes from 0 to 0.004 procedures/s/user, the number of AMF servers with 5 and 6 instances is the same. Since the load of the AM-F servers in both cases is not saturated, the number of AMF servers keeps the same when the arrival rate varies from 0.002 procedures/s/user to 0.004 procedures/s/user. When the arrival rate is 0.005 procedures/s/user, for the AMF servers with 5 instances, the saturation rate is larger than that of those AMF servers with 6 instances, the number of AMF servers becomes different.

The impact of weight coefficients to Delay, Hops and AMF cost
The impact of the weight coefficients to the AMF cost is shown in Fig. 13, in which each point in the figure is obtained from the average value of each training step after the algorithm converges. It is shown that the cost of AMF is the lowest when w1 : w2 : w3 = 1 : 1 : 5. Since the proportion of the AMF cost in network utility function is the highest, the learning strategy of the DRL agent is to place the least number of AMF servers under the constraints. When w1 : w2 : w3 = 1 : 5 : 1, the DRL agent reduces the wired hops to AMF as much as possible, and the AMF cost changes slightly when the user's HR arrival rate increases. That is to say, the policy is preferred to place the AMF servers on every BS, and the user's mobility management requests are processed by the AMF on its serving BS. When w1 : w2 : w3 = 5 : 1 : 1, the AMF cost ranges between the two solutions with w1 : w2 : w3 = 1 : 5 : 1 and w1 : w2 : w3 = 1 : 1 : 5, because the average delay of requests at AMF is affected by the arrival rate and the serving rate of the AMF instances, the number of AMF servers needs to achieve a balance. Fig. 14 is the simulation result on the impact of weight coefficients to the average wired hops. When w1 : w2 : w3 = 1 : 1 : 5 and w1 : w2 : w3 = 5 : 1 : 1, the average wired hops decreases with the increase of HR arrival rate, when the number of AMF servers is increasing to cope with the increasing request rate. The wired hops changes slightly when w1 : w2 : w3 = 1 : 5 : 1, since the number of AMF servers is fixed after several training steps.
The impact of weight coefficients to the average waiting time of mobility management request is presented in Fig. 15. When w1 : w2 : w3 = 1 : 1 : 5, the curve fluctuates because the DRL agent adjusts the number of AMF servers with the increase of HR arrival rate. Since the number of AMF instances on each AMF server is fixed, the fluctuation depends on the number of AMF instances and their serving capacity. The average delay of mobility management requests at AMF increases slightly when w1 : w2 : w3 = 5 : 1 : 1 and w1 : w2 : w3 = 1 : 5 : 1, because the increase of the request arrival rate is smaller compared with the serving rate of each AMF server.
With the above analysis of Fig. 13, Fig. 14 and Fig.  15, it indicates that the AMF placement strategy is affected by the weight coefficients of the network utility. Therefore, the weight coefficients can be adjusted according to different service requirements on mobility management. For example, the weights of the wired hops and the average delay of requests at AMF server can be set higher in the URLLC service scenario, while in some mMTC scenario, the weights of the AMF cost is set higher due to the massive number of UEs distributed in the radio access network.

Performance comparison with baseline algorithms
In this subsection, the performance of AMF-OP-DRL algorithm is evaluated. To the best of our knowledge, there is no research issues on AMF optimal placement in the scenario of heterogeneous radio access network yet. Therefore, the proposed AMF-OP-DRL algorithm is evaluated comparing with baseline algorithms including AMF-OP-Greedy, AMF-OP-HBP, AMF-OP-Random and AMF-OP-KAB, which are illustrated as follows, • AMF-OP-Greedy: It traverses all combinations of N AM F BSs and selects the combination with the lowest accumulated network utility, where N AM F is the number of BSs selected to place AM-F servers, which is learned by the DRL agent at each step. • AMF-OP-HBP: In the algorithm, the AMF servers are placed on BSs ranked top N AM F in arrival rate. The network utility is calculated according to the learned AMF placement policy. • AMF-OP-Random: The AMFs are randomly placed on N AM F BSs. The network utility is got based on the selected AMF placement policy. • AMF-OP-KAB: The algorithm is based on the K-armed bandit algorithm. In every step, there are N AM F actions to choose. Each action corresponds to whether the AMF server is placed on a BS and each action has an average reward value. The agent successively takes an action and then receives a random reward, aiming at maximizing the accumulated reward [22]. For every training step, the AMF placement policy is obtained by the ϵ-greedy strategy. That is to say, the algorithm randomly chooses an action when the random number is less than the probability ϵ, otherwise, it selects the action with the maximum reward value. ϵ is set as 0.1 in the simulation. The comparison result is shown in Fig. 16, in which each point is the average value of 100 epochs. It reveals that the accumulated network utility of AMF-OP-HBP and that of AMF-OP-Random are more than twice than that of AMF-OP-DRL algorithm. The AMF-OP-HBP is even worse than AMF-OP-Random, the reason is that AMF-OP-HBP chooses the BSs with the highest arrival rate and brings about greater probability of overload.
AMF-OP-KAB has better performance than that of AMF-OP-HBP and AMF-OP-Random, but it is worse than AMF-OP-DRL. Because the number of stateaction pairs is large in AMF-OP-KAB, its average reward function of actions does not fit well compared with the DRL. In AMF-OP-DRL, since DNN is used to approximate the Q function, the performance of the DRL based control algorithm is much closer to AMF-OP-Greedy, but the greedy algorithm traverses all possible combinations of AMF placement, which brings about high time complexity when the number of BS is large.
From the above analysis, AMF-OP-DRL algorithm outperforms the four baseline algorithms.
Regarding the complexity of the proposed algorithm, if there are n BSs in the system, then for the offline trained neural network model, the computational complexity of the AMF-OP-DRL is O ( n 2 ) , while the computational complexity of the AMF-OP-Greedy is O , where m is the number of AMF servers to be placed.

Conclusion
In this paper, a DRL based algorithm is proposed to optimize the placement of AMF in the heterogeneous network to minimize the long-term network utility including the average delay of mobility management requests, the average wired hops to relay mobility management of requests and the cost of AMF instances. The time-varying features including user mobility and the arrival rate of user mobility management requests are considered in the model of the DRL environment. The convergence performance of the algorithm is evaluated under various batch sizes, learning rates and the number of BSs. The impact of the HR/SR/SRR arrival rate and the weight coefficients to the performance of AMF-OP-DRL are investigated. The performance of AMF-OP-DRL is also evaluated compared with four baselines including AMF-OP-Greedy, AMF-OP-HBP AMF-OP-Random and AMF-OP-KAB. Simulation results show that it outperforms those four baseline algorithms. In the future research work, the joint optimization of the auto-scaling and optimal placement of AMF would be addressed.