Optimization of Task Offloading Strategy for Mobile Edge Computing Based on Multi-Agent Deep Reinforcement Learning

Combined with wireless power transfer (WPT) technology, mobile edge computing can provide continuous energy supply and computing resources for mobile devices, and improve their battery life and business application scenarios. This article first designs the mobile edge computing (MEC) model of mobile devices with random mobility and hybrid access point (HAP) with data transmission and energy transmission. On this basis, the selection of target server and the amount of data offloading are taken as the learning objectives, and the task offloading strategy based on multi-agent deep reinforcement learning is constructed. Then combined with MADDPG algorithm and SAC algorithm, the problems of multi-agent environment instability and the difficulty of convergence are solved. The final experimental results show that the improved algorithm based on MADDPG and SAC has good stability and convergence. Compared with other algorithms, it has achieved good results in energy consumption, delay and task failure rate.


I. INTRODUCTION
With the rapid development and widespread popularity of the Internet of Things (IoT) technology, cloud computing has been unable to meet the demand in business scenarios where the amount of data collection is too large, immediate and continuity interaction is required, such as online games, real-time streaming media, and augmented reality [1]. To solve above problems, the MEC builds an open platform for data collection, data processing and data analyzing at the edge of the network, so that mobile devices can actively offload computing tasks to edge servers, thereby reducing service response time, improving device battery life, ensuring data security and user privacy [2]. In addition, with the large-scale deployment of 5G and the continuous development of mobile communication system, most energy-consuming applications, including video streaming services, AR/VR transmissions, etc., are now running on battery-powered mobile devices, which leads to The associate editor coordinating the review of this manuscript and approving it for publication was Xiaofei Wang . huge energy consumption and interruption of user services. Therefore, in order to meet the continuous service needs of mobile devices, the WPT technology realizes wireless charging of mobile devices and IoT devices by using the principle that radio frequency (RF) signals can transmit energy in the far field. It is a flexible, controllable, on-demand and low-cost solution, which has the characteristics of stable power supply, short response time, simple installation and environmentfriendly [3]. In order to satisfy the information download request and wireless charging request of mobile devices, HAP can realize wireless information transmission (WIT) and WPT in the same frequency spectrum based on the broadcast characteristics of RF signal and wireless channel. The key difference between HAP and traditional access point (AP) is that the former enables WPT and WIT services at the same time, which fundamentally solves the problem of low computing ability and short battery life of mobile devices, makes the business scenarios more diversified [4].
Although edge servers can relieve computing pressure and battery life pressure for mobile devices through HAP, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ since WIT and WPT are performed in the same frequency spectrum, only one operation can be completed at the same time. If all the mobile tasks are offloaded to the edge server for processing, the amount of data transmission will be too large, the service time of WIT is too long and the service time of WPT is too short, which will eventually cause the mobile device to run out of power and interrupt user services. At the same time, if a large number of mobile tasks are collectively offloaded to a small number of edge servers for processing, considering the limited computing performance and network bandwidth of the edge servers, it will cause serious congestion of computing tasks and significant delays in user services [5]. Therefore, in order to make full use of MEC's computing resources and ensure that mobile devices can provide continuous services, it needs to design a strategy which can determine the amount of offloaded data and the target server for mobile tasks, so that the edge server and mobile devices can perform collaborative processing to effectively improve the user's service quality. In this article, we propose a strategy that uses multi-agent deep reinforcement learning to solve the problem of how much and where to offload the tasks in the MEC by comprehensively considering the computing performance, the signal range, the geographic location of the edge server, and the computing performance, remaining capacity of battery, energy transmission, location information, application data amount of the mobile device.
And it effectively reduces the energy consumption, delay and task failure rate of mobile devices and edge servers, and improves the service quality of the entire MEC platform. The main contributions of this article include three aspects: 1) The MEC model of mobile devices with random mobility and HAP nodes with data transmission and energy transmission is constructed. Since the HAP node can only perform WIP or WPT at the same time, it is necessary to reasonably design the running time of both in unit time to ensure the quality of user service. In this article, the time required for data transmission is calculated by considering the amount of data offloading and the transmission rate based on the location of mobile devices. At the same time, the remaining time is used as the energy transmission time to charge the device, so as to ensure that the total power of the device can complete the computing task and transmission task in unit time. 2) Combined with the actual application scenario of MEC, the target server selection and data amount to be offloaded are taken as the learning objectives, and the task offloading strategy based on multi-agent deep reinforcement learning is constructed. This article combines the MADDPG algorithm and the SAC algorithm to solve the problem of instability and convergence difficulty in the multi-agent environment. Among them, the MADDPG algorithm optimizes the strategy of each agent through the idea of centralized training and distributed execution to reduce the variance of the algorithm; the SAC algorithm introduces maximum entropy into the reward function to ensure that it can explore more action possibilities, and enhance its exploration ability and robustness. At the same time, the MADDPG algorithm and the SAC algorithm are used to solve the continuous action space problem. Considering that the target server selection is a discrete problem, this article uses the reparameterization trick of Gumbel Softmax to solve discrete problems without losing gradient information.
3) The energy consumption, cost, delay and task failure rate of task offloading strategies are comprehensively compared to analyze their advantages and disadvantages. Experiments show that the improved algorithm based on MADDPG and SAC has good stability and convergence. Compared with other algorithms, it achieves good results in energy consumption, delay and task failure rate when the number of mobile devices is large. The remainder of this article is organized as follows. In Section 2, the scope of related works is discussed. The MEC model and evaluation metrics such as energy consumption, delay, cost and task failure rate are described in Section 3. The proposed task offloading algorithm is presented in Section 4. The experimental setup and performance evaluation are described in Section 5. Finally, Section 6 concludes the paper and gives directions for future work.

II. RELATED WORK
With the popularization of 5G technology, the task offloading problem of MEC has received extensive attention, and there are a lot of researches in recent years. For instance, the reference [6] proposed an Orthogonal Frequency-Division Multiplexing Access (OFDMA) based multi-user and multi-MEC-server system, which is used to investigate the task offloading strategies and wireless resources allocation for latency-critical applications. The reference [7] mathematically modeled the MEC architecture, it optimized the MEC calculation offload strategy to decide when to offload the user 's computing tasks to the MEC server for processing, and verified the effectiveness of the strategy through the face recognition application by measuring the round-trip time. Compared with the local execution of mobile devices, the strategy greatly reducing the service delay and saving the energy consumption of the device. The reference [8] studied the multi-user service delay problem in the MEC offloading scenario and proposed a new type of partial computing offloading model, which optimizes the allocation of communication and computing resources through strategies such as optimal data segmentation. Compared with local execution of devices and edge cloud execution, the proposed partial offloading strategy can minimize the delay of all devices, thereby improving the user's service experience quality. In the above research, when heuristic algorithms are used to deal with large-scale task offloading problems, algorithms take too long to generate decision-making due to the high dimension of the problem. At the same time, such algorithms can only find approximately optimal solutions. Therefore, it cannot meet the expected requirements in actual use. In addition, reinforcement learning is also widely used in the problem of offloading MEC tasks. The reference [9] proposed that mobile tasks can select multiple base stations in the MEC for offloading according to demand. And on this basis, it studied the task offloading strategy based on a deep reinforcement learning algorithm to maximize the long-term performance. The reference [10] used RL-based resource management algorithms to optimize the data processing volume of cloud servers and edge servers. And the strategy can reduce service delay and operating costs. The reference [11] chose whether to offload tasks to the edge server which serves for multiple users by deep reinforcement learning algorithm, so as to reduce energy consumption and average computing delay. At the same time, it further optimized the selection strategy of the edge server by adding wireless transmission rate, charging power and battery power to the learning process, so that the entire computing cluster provides better service performance.
Based on the above researches, compared with heuristic algorithms, deep reinforcement learning has the characteristics of self-learning and self-adaption by combining the advantages of deep learning and reinforcement learning. It needs fewer parameters and has better global search capabilities, which can solve the more complex, high-dimensional and more realistic task [12]. However, the researchers of current MEC task offloading, which based on deep reinforcement learning, are all about single agents. They mainly focus on one learning target such as the amount of data offloading or the selection of edge servers. The amount of data offloading only pays attention to the amount of data to be processed by the local device and the remote server cluster respectively, but the impacts of performance and location between different devices are ignored. The selection problem of the target server only focuses on whether the data is offloaded and where it is offloaded, which ignores the fact that the application data can be split for collaborative computing, to save computing time and transmission time. Therefore, this article takes the target server selection and the data amount to be offloaded as comprehensive learning goals with the multi-agent deep reinforcement learning algorithm, so as to improve the resource utilization rate of the entire MEC platform [13]. In addition, in order to ensure that the multi-agent algorithm has better convergence and generalization performance in the MEC partial offloading problem, this article combines the advantages of the SAC algorithm and the MADDPG algorithm to form a partial offloading strategy: 1) In order to solve the problem that deep reinforcement learning algorithm is influenced by super parameters and easy to fall into local optimal solution, the SAC algorithm adds the maximum entropy to the reward function, so that the actor can explore as many actions as possible on the premise of completing the task, to achieve approximate optimal multiple trajectory selection. Therefore, the SAC algorithm is conducive to learning new tasks and as the initialization of more complex tasks. At the same time, the algorithm has stronger exploration ability and robustness, and can solve the problem of unstable convergence. 2) In multi-agent environment, since the strategy of each agent is constantly learning and changing, which causes an unstable environment for a single agent, so the variance value of traditional deep reinforcement learning algorithm will become larger with the increase of the number of agents. Therefore, the MAD-DPG algorithm solves the problem of instability in the multi-agent environment by centralized training and distributed execution. When the MADDPG algorithm is updated, the overall optimization can be performed according to the training strategy of each agent, so as to improve the stability and robustness of the algorithm. In addition, the MADDPG algorithm allows each agent to design its own reward function, which can be used to solve the problem of cooperation or confrontation.

III. SYSTEM MODEL A. MEC MODEL
In order to marginalize and localize computing resources and cache resources, edge servers are usually deployed at HAPs to ensure that mobile devices can obtain WIT and WPT services [14]. As shown in Figure 1, the entire MEC system is composed of HAP nodes, edge servers and mobile devices. Each HAP has a certain signal coverage area, and mobile devices in this range can offload tasks to an edge server for calculation and get a certain amount of power supplement. However, as the location of mobile devices changes, the connection between the mobile devices and the edge server will become extremely unstable due to the long relative distance, which will further exceed the signal range and cause service interruption. In the entire MEC system, the HAP node has the functions of energy transmission and data transmission. It gives priority to energy transmission per unit time to ensure that the mobile device has a certain processing capability, and then provides data upload and download services to realize task offloading of the mobile device. As shown in Figure 2, the mobile device can decide whether to perform the task offloading operation according to the strategy. If this operation is performed, the mobile device first converts the HAP radio frequency signal into electrical energy and stores it in the battery at each time step. Then the battery provides energy for data transmission and task calculation. In this operation, one part of data is transmitted from the mobile device to the edge server for remote calculation and the edge server returns the result, another part is calculated directly on the mobile device, which make full use of the computing resources of edge servers and mobile devices. But it mainly needs to consider the delay caused by data transmission and the energy consumption generated by computing and communication of mobile devices. If task offloading is not performed, the mobile device will store the converted electrical energy in the battery and directly provide energy for local computing. Local computing only uses the computing resources of the mobile device, while the delay and energy consumption of its computing must be considered [15].

B. PROBLEM MODEL
In this article, all computing devices in MEC are represented by (ES, MD), where ES represents a set of m edge servers {es 1 , es 2 , . . . , es i , . . . , es m }, and the signal range SR i and processing performance EC i of each edge server are set separately; MD represents a set of n mobile devices {md 1 , md 2 , . . . , md j , . . . , md n }, the processing performance of each mobile device is set to MC j , the corresponding application is denoted by {data j , L max j }, data j and L max j refer to the total amount of data to be processed by the application and the maximum allowable completion time, where data j is directly proportional to the complexity of the application. Suppose that the data partition of each application is full granularity, that is to say, the application can be divided into subprograms of any size. It is assumed that the operation process of all computing devices will be in accordance with the set of time steps T = {1, 2, . . . , t, . . .}, and the amount of data to be processed by the application in each time step is C. The MEC needs to determine the target server TS t and the amount of offloading data λ t C based on the location of each mobile device, energy supply power, remaining power, remaining amount of application data, and connectable edge servers in the current time step t [16]. At the same time, the remaining data (1 − λ t )C needs to consume power while processing on the mobile device, which cannot exceed the remaining power of the device, otherwise the application processing fails. The delay, energy consumption, cost, and charging capacity of mobile devices and edge servers in the current time step t are modeled as:

1) DELAY MODEL
Assuming that the CPU frequencies of the mobile device and the target server are η local t and η offload t , respectively, and the data amount of local processing and remote processing are (1−λ t )C and λ t C in the unit time, respectively, the time required for mobile device j to process data is: The delay caused by data offloading mainly includes data uploading delay and data processing delay. The data uploading delay is determined by the transmission rate TR between the edge server and the mobile device. Its calculation formula at time step t is: where w is the upload bandwidth, p tran is the transmission power of the mobile device, d t is the distance between the edge server and the mobile device, θ ≥ 2 is the path loss, h t is the channel attenuation coefficient, and N is the Gaussian distribution of noise. Therefore, the calculation formula for uploading delay is: The data processing delay formula of the edge server is: The time required for mobile device j to offload data is: Since data processing and offloading are performed simultaneously, the time required by mobile device j to process unit data in time step t is:

2) ENERGY CONSUMPTION MODEL
Because the mobility of mobile devices is not convenient for replenishing power in time, and the edge server is deployed near the base station to facilitate power supply management, this article takes the energy consumption of mobile devices as the main research object, which is composed of computing energy consumption and transmission energy consumption. The calculated energy consumption of mobile device j at time step t is: where k is the energy consumption coefficient based on the type of CPU. The data transmission energy consumption of mobile device j at time step t is: where p 0 is the fixed energy consumption of the mobile device for communication, α is the signal power amplifier coefficient, p tran is the transmission power consumption of the mobile device, and τ up t is the data transmission time. Therefore, the total offload energy consumption of mobile device j at time step t is:

3) COST MODEL
Users need to pay the corresponding fees to obtain the computing resources that provided by the edge server. In this article, a dynamic price model based on the remaining amount of computing resources is used. When the remaining amount of resources is less, the resource price is higher. At this time, users tend to choose service nodes with lower prices as an offloading server, thereby reducing user expenses while increasing resource utilization. At the same time, because computing resources of the mobile device belong to the user, which does not need to pay the operator for computing, the cost model only needs to calculate the used resource by the edge server, and the dynamic price model of the i-th edge server based on the remaining amount of computing resources is: where τ exec i,t is the computing time of the edge server, pc i is the unit price of the computing resources, and γ is the ratio of computing resources currently occupied by the edge server.

4) ENERGY SUPPLY MODEL
Mobile devices in MEC can use RF-DC converter to convert RF signal into electric energy and store it in battery. The electrical energy collected per unit time is inversely proportional to the relative distance between the mobile device and the edge server. The energy conversion calculation formula of mobile device j is as follows: where υ ∈ (0, 1) is the energy conversion efficiency, p tran is the transmission power consumption of the mobile device, d t is the distance between the edge server and the mobile device, θ ≥ 2 is the path loss, and G is the integrated channel gain between the edge server and the mobile device.

IV. ALGORITHM DESIGN
Reinforcement learning is a sequential decision-making method that continuously conducts trial-and-error learning in the target environment and modifies strategies through feedback results to maximize rewards. Although it has many advantages, it also lacks scalability and is essentially limited to fairly low-dimensional problems. This is mainly because reinforcement learning algorithms have the same memory complexity, computational complexity, and sample complexity as other algorithms. Therefore, in order to solve the high-dimensional decision-making problem that reinforcement learning is difficult to deal with, deep reinforcement learning combines the perceptual ability of deep learning with the decision-making ability of reinforcement learning, and solves the problem with high-dimensional state space and action space by strong function approximation and deep neural network. In this article, the MDP model based on the MEC environment is built, which combines with the multi-agent deep reinforcement learning algorithm, to solve the problem VOLUME 8, 2020 of edge server decision-making and data offloading decisionmaking [10].

A. MDP MODEL 1) STATE SPACE
In order to comprehensively consider the characteristics of mobile tasks and edge server resources in MEC, this article defines the state space of the j-th mobile device at time step t as S j t = (TR 1,t , . . . , TR I,t , . . . , TR m,t , U 1,t , . . . , U I,t , . . . , U m,t , RD j,t , RB j,t , HB j,t ), where TR i,t represents the transmission rate between the mobile device and the i-th edge server, U i,t represents the CPU utilization of the i-th edge server, RD j,t represents the amount of remaining data that the mobile device needs to process, RB j,t represents the remaining power of the mobile device, and HB j,t represents the power generated by the mobile device using RF-DC conversion.

2) ACTION SPACE
In order to offload part of the mobile task to the target edge server for collaborative computing, this article designs two agents with the same state space but different action spaces to determine the target server and the data amount to be offloaded. In the decision-making problem of target server, the action space is defined to be corresponding to the set of edge servers, so the discrete action space is A edge = (es 1 , es 2 , . . . , es m ), using (0/1) j i to indicate whether the task of the j-th mobile device is offloaded to the i-th edge server. For example, action space A edge j = (0, 0, 1, . . . , 0) indicates that the target server of the j-th mobile device is the 3rd edge server; For the problem of offloading data amount, the continuous action space A percent = λ t is the offloading percentage of the data amount C in the time step t, and the precision is kept to two decimal places.

3) REWARD FUNCTION
The multi-agent deep reinforcement learning based on the target server and the offloading amount of data is a completely cooperative game problem. The objective of both agents is to reduce the delay, energy consumption, cost, and task failure rate as much as possible. Therefore, the reward function of the agent is consistent, and its calculation formula is: where n j=1 H j,t and n j=1 E offload j,t represent the total delay and total energy consumption of all mobile devices; m i=1 Cost i,t represents the total cost of all edge servers; The σ, ξ, δ represent the weights of the above three indicators; I(RB j,t = 0) represents that the remaining power of mobile device j is empty when the value is 1, otherwise it is 0. This value is used to measure whether the task is processed successfully; ω represents the penalty value for the task processing failure.

B. METHODOLOGY 1) SAC
The SAC is an improved actor-critic algorithm based on maximized entropy reinforcement learning, which maximizes the entropy to enable the actor to explore action possibilities as many as possible under the premise of completing the task, so as to achieve several approximately optimal trajectory choices [17]. Therefore, the SAC algorithm has stronger exploration ability and robustness, and is not easy to fall into the local optimal solution. Its optimal strategy calculation formula is: H (π (·|s t )) = − log π (·|s t ).
where π * represents the optimal decision, T represents the time series, ρ π represents the trajectory distribution probability under the decision π , γ ∈ [0, 1] represents the discount coefficient, r : S × A → R represents the reward function, s t ∈ S represents the environmental state at time step t, a t ∈ A represents the action taken at time step t, α > 0 is a weighting coefficient used to control the entropy, it is more inclined to explore when the value is larger, and H(π (·|s t )) represents the entropy of the strategy π in the state s t . In order to solve the high-dimensional continuous control problem, the SAC algorithm approximately calculates the state value function V ϕ (s t ), soft Q function Q θ (s t , a t ) and strategy function π φ (a t |s t ) by a neural network. Besides, it updates each parameter alternately by stochastic gradient descent (SGD), where the strategy function follows the Gaussian distribution, and its mean vector and covariance matrix are all obtained by neural network fitting. The objective function of the state value function V ϕ (s t ) is: where D represents the replay buffer of past experience, ϕ, θ, φ represent the parameters of each neural network. The update method of soft Q function is similar to other Q-learning algorithms, and it updates Bellman residuals. The difference is that its value function contains entropy, and its objective function is: − α log(π f (a t+1 |s t+1 ))). (15) whereθ represents the parameter of the target Q network, and the parameter θ of the Q network will be updated 202578 VOLUME 8, 2020 every a certain time, so as to calculate the loss function by the differences between the two Q network parameters, which improves the stability and convergence of training.
The strategy function is updated through the soft Q function. It expects that the action probability distribution of the strategy in the state s t can confirm to the distribution of Q value, so it updates the strategy network parameters by minimizing the KL divergence: , where Z(s t ) is the sum of the all action Q values' expectation with the current strategy in the state s t , the function f is used to calculate the average and variance of the Gaussian distribution, and ε is the noise of Gaussian sampling. Therefore, the strategy function update formula and gradient formula based on action sampling are:

2) MADDPG
Traditional reinforcement learning is difficult to apply directly to a multi-agent environment. The main reason is that each agent is constantly learning and improving strategies. Therefore, changes in the strategies of the other agents cause the instability of the dynamic environment for a single agent, then the agent's own state transition probability will be different in different situations, there is P(s |s, a, π 1 , . . . , π n ) = P(s |s, a,π 1 , . . . , π n ) for any π i = π i . Thereby, the multiagent reinforcement learning cannot directly use the experience replay method for training [18]. At the same time, the complexity of the environment will increase with the increasing number of agents, and the optimization method of estimating the gradient by sampling will also cause great variance, so the strategy gradient algorithm cannot be trained in a multi-agent environment. In view of the above problems, it is mainly because there is no interaction between the various agents, which leads to the neglect of the whole. Therefore, this article mainly solves the problems of target server selection and the data amount of task offloading by the MADDPG algorithm. The MADDPG algorithm is an extension of the DDPG algorithm. It solves the problem of perception among multiple agents by centralized training and decentralized execution. The Actor of the DDPG algorithm will choose action a t according to the current state s t during training, then the Critic utilizes the state action function to calculate the Q value as feedback to the action taken by the Actor, and then it calculates the difference between the estimated Q value and the actual Q value to update the network parameters, and the Actor improves strategies based on the Critic 's feedback. In addition, the Critic of the MADDPG algorithm can obtain the state and action of other agents during training to calculate a more accurate Q value. That is, each agent not only based on its own state but also based on the behavior of other agents to evaluate the value of current actions for achieving centralized training; At the same time, after the training, the Actor of each agent only needs to take appropriate actions according to its state rather than obtain the information of other agents to assist calculation, so as to achieve decentralized execution [19].
This article assumes that φ = (φ 1 , . . . , φ k ) represents the strategy parameters of k agents, π = (π 1 , . . . , π k ) is the corresponding strategy, and the strategy gradient formula of the i-th agent is: where s i represents the observation value of the i-th agent, x = (s 1 , . . . , s k ) represents the state vector containing the observation values of all agents, and Q π i (x,a 1 , . . . , a k ) represents the Q value evaluated by centralized Critic for the ith agent. Because each agent learns a different Q π i function, it can have different reward values to complete cooperation or competition tasks. For the agent's deterministic strategy µ φ i (abbreviated as µ i ), the gradient formula is: , a 1 , . . . , a k )| a i =µ i (s i ) ]. (19) The element composition of the experience replay buffer D is (x,a 1 , . . . , a k , r 1 , . . . , r k , x ), which records the observation values, actions and rewards at the current moment and observation values at the next moment of all agents. The update formula of the centralized Critic's action value function Q µ i is: where Qū i represents the target network andμ = (µφ 1 , . . . , µφ k ) is the set of target strategies with delay parametersφ i . At the same time, since the MADDPG algorithm can only solve the problem of continuous action space, and the problem of edge server selection is a discrete problem, this article utilizes re-parameterization of Gumbel Softmax to perform category sampling without losing gradient information, so as to realize the mapping relationship between continuous actions and discrete actions. The calculation formula is: where p represents the probability vector of k-dimensional, and the parameter τ > 0 is used to control the smoothness of the softmax function. The larger the value is, the smoother the distribution generates, and the smaller the value is, the closer the distribution is to the discrete one-hot distribution [20]. Therefore, it can obtain a discrete distribution that is closer to the reality in training by reducing τ gradually. Although the Actor in the MADDPG algorithm uses a random strategy to ensure sufficient exploration, but the Critic's deterministic strategy only considers one optimal action for a state and cannot explore all possible optimal actions. Therefore, the algorithm is easy to fall into the local optimal solution in this case. In order to solve this problem, this article combines the MADDPG algorithm and the SAC algorithm to enable it to explore optimal paths as many as possible in a multi-agent environment, thereby enhancing the robustness and generalization of the algorithm. The improved algorithm flow is as follows: According to the above algorithm flow, Figure 3 is the flow chart of the improved algorithm in two agents.

A. SIMULATION ENVIRONMENT
This article builds the task offloading model of MEC by comprehensively considering the computing performance, signal range and geographic location of the edge server; the computing performance, remaining power, charging power, location information of the mobile device, and the data amount of different application services. The initial location information of the edge server and mobile device is simulated based on the Melbourne CBD area in the EUA data set, and the location of mobile devices changes with time following the Truncated Levy Walk mobility model to ensure that it moves in the area covered by the signal [21]. The signal coverage radius of the HAP is randomly distributed between [100,400], the uploading bandwidth of the mobile device w = 10MHz, the fixed communication power p 0 = 0.4W, the data transmission power p tran = 0.1W, the signal power amplifier coefficient α = 40, the energy conversion efficiency υ = 0.8, the integrated channel gain G = 20, the initial power is 4000mah, and it is assumed that each mobile device can only send one application request at the same time [22]- [25]. In order to consider the computing performance and power consumption of different edge servers and mobile devices, this article refers to Standard Performance Evaluation Corporation (SPEC) to set the device configuration and average performance power consumption ratio. A larger value indicates that the device consumes less energy at the same performance and the energy consumption coefficient k is calculate overall critic loss L(θ ) = 1 N N i=1 L i (θ ) and update online Q network parameter θ 13.
for each agent i do 14.
update online policy network parameter φ i by calculating the gradient value ∇ φ J π (φ) according to (17) 15. end for 16.

17.
update target policy network parameterφ i of each agent i byφ i ← τ φ i + (1 − τ )φ i , τ ∈ [0, 1] 18. end for 19. end for smaller [26], [27]. At the same time, models of edge servers and mobile devices follow a uniform distribution respectively, and the detailed information is shown in Table 1.
Due to the different amounts of data calculation and popularity requested by different types of applications, this article sets the application of each mobile device to sample according to its popularity value, and its data amount will also follow a uniform distribution within the setting interval. Detailed settings for different applications are shown in Table 2.

B. RESULT ANALYSIS
In order to ensure that the strategies generated by deep reinforcement learning algorithms are efficient and usable, this article first selects 126 edge servers' location information and a certain number of mobile devices' location information from the EUA data set as the initial starting point of each device. Then, the simulation environment is trained by fixing the movement trajectory of each mobile device and the requested application data. Finally, the random motion path data and application data of each device are used to test the trained decision model, so as to compare the universality and efficiency of each strategy. Figure 4 shows the results of the average reward value of each episode obtained by each deep reinforcement learning algorithm in the training process. The larger the value is, the better the result of the decision model is. It can be seen from the figure that the DDPG algorithm and the SAC algorithm have poor convergence results in a multi-agent environment. Compared with DDPG algorithm, the reward value of SAC algorithm after convergence is higher but the convergence speed is slower. This is mainly because SAC algorithm needs more iterations to explore more decision paths, and it is easier to obtain better solutions. In addition, the MADDPG algorithm and the improved MADDPG + SAC algorithm perform better in a multi-agent environment, and the improved MADDPG + SAC algorithm has a higher reward value after convergence. Figure 5 is a graph of resource consumption generated by each algorithm during task offloading. The number of mobile devices will increase in this experiment, and the total data amount of all applications will account for 50% -150% of the processing capacity of the entire edge server cluster. Among them, the offloading strategy based on the Mobile algorithm can achieve good results in terms of cost, but performs poorly in terms of energy consumption and task failure rate. This is mainly because Mobile algorithm takes priority in processing application data on local devices, and then gradually offloads to edge servers when resources are insufficient. In this article, mobile devices are only considered for energy consumption and edge servers are only considered for the cost, so this strategy consumes the least cost but consumes the most energy. But at the same time, according to the data in Table 1, the processing capacity of the mobile device is much poorer than that of the edge server, so the processing of application data by the mobile device will have a higher delay and failure rate. In addition, the offloading strategy based on the Edge algorithm performs best in terms of energy consumption, while it performs generally in the rest. The main reason is that the Edge algorithm preferentially offloads subtasks to the edge   server cluster for processing, which results in the resource utilization of all edge servers can be maintained at a high level and the cost is high, and the corresponding mobile devices consume less energy. Because the processing performance of the edge server can meet the processing requirements of more tasks, so it performs better than the Mobile algorithm in terms of task failure rate.
DDPG algorithm, SAC algorithm, MADDPG algorithm and MADDPG + SAC algorithm all use deep reinforcement learning to automatically generate corresponding offloading strategies from data. As shown in Figure 5, it can be seen that with the growth of the number of mobile devices, the offloading strategy generated by DDPG algorithm performs well in terms of cost, while the performance of SAC algorithm is better than DDPG algorithm in terms of energy consumption and task failure rate. The strategies generated by the above two deep reinforcement learning algorithms perform generally in various indicators, which is mainly because the training results of the two algorithms are unstable in multi-agent environment, and it is difficult to converge to the optimal solution. In contrast, MADDPG algorithm can effectively learn stable strategies by centralized training and distributed execution, which is better than DDPG algorithm and SAC algorithm in comprehensive performance, and the improved MADDPG + SAC algorithm performs best in all deep reinforcement learning algorithms in terms of energy consumption, delay, and task failure rate when the number of mobile devices is the largest.

VI. CONCLUSION
In order to solve the task offloading problem of mobile devices in large-scale heterogeneous MEC clusters, this article first proposes to use multi-agent deep reinforcement learning to solve the problem of how much and where to offload. Then, according to the EUA data set, the offloading strategies generated by each algorithm are simulated. Finally, the advantages and disadvantages of each algorithm strategy are verified by comparing energy consumption, cost, delay and task failure rate. According to the results of comparing various algorithms, the improved MADDPG + SAC algorithm has good performance in comprehensive results.
In future work, we intend to improve the multi-agent reinforcement learning algorithm by transfer learning, reusing knowledge that comes from previous experience or other agents can learning a more complex MEC task, and it makes the task offloading strategy more practical.
HAIFENG LU was born in 1993. He received the master's degree from the Computer Science Department, Donghua University, Shanghai, in 2017, and the Ph.D. degree from the School of Information Science and Engineering, East China University of Science and Technology. His current research interests include edge computing and reinforcement learning. VOLUME