D3PG: Dirichlet DDGP for Task Partitioning and Ofﬂoading with Constrained Hybrid Action Space in Mobile Edge Computing

Mobile Edge Computing (MEC) has been regarded as a promising paradigm to reduce service latency for data processing in Internet of Things, by provisioning computing resources at network edge. In this work, we jointly optimize the task partitioning and computational power allocation for computation ofﬂoading in a dynamic environment with multiple IoT devices and multiple edge servers. We formulate the problem as a Markov decision process with constrained hybrid action space, which cannot be well handled by existing deep reinforcement learning (DRL) algorithms. Therefore, we develop a novel Deep Reinforcement Learning called Dirichlet Deep Deterministic Policy Gradient (D3PG), which is built on Deep Deterministic Policy Gradient (DDPG) to solve the problem. The developed model can learn to solve multi-objective optimization, including maximizing the number of tasks processed before expiration and minimizing the energy cost and service latency. More importantly, D3PG can effectively deal with constrained distribution-continuous hybrid action space, where the distribution variables are for the task partitioning and ofﬂoading, while the continuous variables are for computational frequency control. Moreover, the D3PG can address many similar issues in MEC and general reinforcement learning problems. Extensive simulation results show that the proposed D3PG outperforms the state-of-art methods.


I. INTRODUCTION
Internet of Things (IoTs) [1,2] is considered as the foundation for a wide range of applications, including self-driving cars, smart cities, and environment monitoring.Although IoTs can address small tasks with a reasonable amount of energy consumption, many computational-intensive tasks are beyond the capacity of the IoTs.Moreover, many applications, such as self-driving cars and smart factory robots, require real-time responses, and the IoTs struggle to respond to users when they require a relatively large amount of computational resources.Furthermore, most of the IoTs are extremely sensitive to energy consumption when these devices are running wireless.Since IoTs have limited computational resources and energy support, they offload computational-and delay-intensive tasks to online servers to process the tasks.However, it is challenging to offload a large number of tasks through the core network and process them on remote servers because the networks would be congested and therefore increase the delay time.Mobile Edge Computing (MEC) is proposed to address the tasks in proximity and reduce the burden of the core network [3].Unfortunately, MEC is not a panacea for the above problem because MEC servers are equipped with much less computational resources than central cloud servers; therefore, task offloading and scheduling optimization are vital to exploit the limited resources and improve service quality and reduce costs.
Various methods have been proposed to optimize the MEC resource usage to fully utilize the limited computational resource in MEC servers.To reduce the idle time of edge servers and respond to IoT user timely, an offloading task can be sliced into small sub-tasks, and the sub-tasks can be processed in heterogeneous edge servers [4].Therefore, slicing task [5] into small sub-tasks and offloading to edge servers so that the limited computational power on the edge servers can be fully utilized.Conventional optimization methods (eg.CVX and MIP) [6], machine learning [7], deep learning [8], and reinforcement learning [9,10] methods have been introduced to address computation offloading challenges.It is very challenging to adopt the aforementioned methods into MEC tasks partitioning and scheduling.First, it is considerably challenging to describe the practical MEC network into mathematical forms that can fit into conventional optimization methods.Second, machine learning and deep learning usually require labeled data to train the models, which can be extremely difficult for humans to manually compute and label slicing and scheduling datasets.
Deep reinforcement learning (DRL) methods can mitigate the issues mentioned above.To optimize the task partitioning, offloading, and computing power allocation, the existing methods typically use DRL and optimization techniques to deal with those decision variables separately, which can lead to a poor overall system performance, instead of addressing the joint optimization problem in an end-to-end manner.To jointly optimize those decision variables, we need to deal with a hybrid action space.Moreover, it is even more complex as there are some constraints on the action space, because the sum of the percentage of all sub-tasks for offloading from a given task should be equal to one.The majority of existing DRL models can only address discrete action [10] or continuous action space [11].Several works try to address hybrid action with approximation or relaxation of continuous action space [12].Hausknecht et al. [13].However, they cannot address constrained action space in edge computing.Wu et al. [14] use sof tmax to capture the task partitioning actions to satisfy the constraints of the action space, which is a proportional action space, the sum of which should be one.However, sof tmax does not have an exploration mechanism to explore all the possible actions and derive the optimal policies.
In this work, we propose a novel deep reinforcement learning approach called Dirichlet Deep Deterministic Policy Gradient (D3PG), based on Deep Deterministic Policy Gradient (DDPG), to jointly optimize the task partitioning, computation offloading, and computational frequency control.The developed model can decide to partition the tasks flexibly, offload the sub-tasks to the edge servers, and select the computational frequencies of edge servers to execute the sub-tasks.The goal of the model is to make those complex decisions based on the observation to maximize the number of tasks completed before expiring and minimize the energy consumption and time cost.The model can make the decisions and jointly optimize multiple objectives.The main contributions of this work are: • We developed a novel D3PG model to optimize MEC resource allocation and improve service quality.The proposed model generates a distribution-continuous hybrid action space to address various issues flexibly.Specifically, each action includes a distribution formulated as a Dirichlet distribution for partitioning and offloading tasks, and continuous components for frequency control.• A configurable optimization target is proposed to address multiple joint optimization problems.The model optimizes multiple objectives in an end-to-end manner, and it does not require further optimization like existing methods.• We have tested the developed method by extensive simulations, and results show our method outperforms the the stateof-art methods.The rest of the paper is organized as follows.Section II investigates related works.Section III presents the system modeling and problem formulation.Section IV introduces the proposed method in details.Section V provides the simulation results and Section VI concludes this work.

II. RELATED WORKS
In the literature, many methods have been proposed for task scheduling and offloading.A joint computation offloading and system resource allocation for MEC has been formulated as a mixed-integer non-linear programming format in [15].Then the authors transformed the non-linear programming to linear programming to reduce the complexity and tackle the challenges.Liang et al. [16], adopt linear-fractional programming (LFP), a generalization of linear programming (LP), and a greedy algorithm to optimize the offloading rate and energy consumption.The tasks can also be divided into sub-task processes and processed on local devices or edge servers simultaneously [17,18] based on optimization.Similarly, Gao et al. [19] propose a method to find the optimal ratio of task partition into two sub-tasks for edge servers and local IoTs.He et al. [20] propose an optimization method to partition deep learning inference tasks and offload the partitions to edge servers to find optimaldelay for computing resource allocation.Those standard optimization methods are relatively straightforward to develop and have addressed many optimization problems in task offloading.However, MEC network environments are far complex to describe with mathematical forms, and it is considerably challenging to extend conventional optimization to high-dimensional observations.
Machine learning and deep learning [8] models can learn from historical labeled data and predict future computational offloading so that the system can make plausible decisions for MEC.Shen et al. [21] surveyed machine learning methods for resource slice and planning for the next-generation network.They have also summarized various machine learning and deep learning methods adopted in computational offloading and content offloading for MEC.Lyu et al. [22] use stochastic gradient descent, a popular machine learning training method, to learn and partition data to offload spatially distributed edge servers; the authors argue that the proposed method can make optimal decisions for data partitioning with respect to time delay.Yang et al. [23] proposed a statistical machine learning method to minimize the energy consumption for edge inference [24,25], where deep learning inference tasks are processed on MEC.Ale et al. [26] introduced a deep recurrent neural network to capture and predict the user requests so that they can make decisions for the content offloading and allocate resources based on the prediction.Summarily, the computation offloading problems can be formulated as a supervised classification problem and minimize the cost using deep learning [7].The deep learning models can also adopt bandwidth allocation optimization and maximize the system utility [6].Machine learning and deep learning methods can learn and predict the offloading decisions optimized with respect to exploiting resources and reducing the cost for the MEC.However, machine learning and deep learning require training data with labels, which require enormous effort to collect and label data.Moreover, it is challenging for humans to label data from optimization problems because it is difficult to take optimal actions in such a complex system.
To mitigate the above issues, reinforcement learning [9] and Deep Reinforcement Learning [10,27] methods are extensively adopted for resource planning and optimization in MEC.In the reinforcement learning framework, we do not need to provide labeled data to train the models; instead, the learning agents interact with the environment (i.e., the MEC networks) to learn and find the optimal policies with respect to the objective function (e.g., minimize energy consumption).A Q-learning (a typical reinforcement learning algorithm) based method has been proposed [5] to make decisions for task offloading; the agents learn to decide whether the current tasks are offloaded to edge servers or processed on local devices to minimize the delay time.Similarly, deep Q-learning has been adopted to decide task offloading and select targeted edge servers for smart vehicles [28].
Cheng et al. [29] adopted the DRL method to minimize the time delay for computation offloading to Unmanned Aerial Vehicle (UAV) based edge servers and the DRL model outperforms brute-force and the greedy algorithm.Similarly, Baek et al. [30] proposed a Deep Q-Network (DQN) [10] variant by replacing the convolutional neural network with a recurrent neural network to control and select task offloading to edge servers; the offloading actions generated by policies to exploit the limited MEC resources and maximize number of tasks been processed before expiring.Another DQN variant model [31] is proposed to resource allocation optimization by incorporating Bayesian learning and Long Short-Term Memory (LSTM) [32].In [4], the authors have adopted DRL models to optimize task partitioning and scheduling for vehicular networks that allow two edge servers to process a task collaboratively.Yu et al. [33] proposed a DRL model to optimize the task partitioning and offloading; the model makes decisions based on the profiles of sub-tasks and chooses the local devices or edge servers to process the sequentially depended sub-tasks.
Although reinforcement learning and DRL methods are adopted to address many task offloading and resource allocation problems in MEC, the currently existing methods can only deal with relatively small action spaces and are inflexible to partition large tasks.For example, the most of the DRL models for MEC can only take binary action that decides a task to offload to the MEC server or process on local devices.Another type of DRL model can find an optimal proportional (percentage) task to offload to edge servers to process.In addition, the majority of reinforcement learning models deal with discrete action [10] or continuous action space [11].Several works try to handle hybrid action with approximation or relaxation of continuous action space [12].Hausknecht et al. [13] relaxed the action space to have to support the hybrid action.Masson et al. [34] handled discrete action with Q-learning and policy search for continuous action.Similarly, Khamassi et al. [35] use Q-learning and policy gradient to achieve the same results.Those methods assume on-policy and handle discrete and continuous actions separately.Xiong et al. [36] and Fu et al. [37] use a hierarchical structure [38] to deal with the discrete actions and generate the continuous action based on the discrete actions.Neunert et al. [39] provided mixed policy to handle the discrete-continuous action space.However, none of those methods can handle constrained action space well in edge computing.

III. SYSTEM MODEL AND PROBLEM FORMULATION A. System Model
As shown in Fig. 1, there is a group of N IoT users U = {u 1 , . . ., u N } during a given time slot t.The IoT devices rely on a set of K MEC servers M = {m 1 , . . ., m K } to process tasks through computation offloading.A task from the i th user can be defined as: and ∆ max denote the data size of the task, required CPU cycles to compute the task, and maximum tolerant latency (expired time) of the task Ω i , respectively.Further, the tasks can be partitioned into smaller sub-tasks and offloaded to different MEC servers for parallel processing.Specifically, a task Ω i can be partitioned into several smaller sub-tasks, ξ i = (ξ i,1 , . . ., ξ i,j , . . ., ξ i,K ), where j is the index of the sub-task, and the number of sub-tasks is no greater than the number of MEC servers K.It takes δ i,j to complete each sub-task ξ i,j .In other words, the time cost of all the sub-tasks of Ω i can be denoted as a vector δ i = (δ i,1 , . . ., ξ i,j , . . ., δ i,K ), i ≤ N , and j ≤ K, and the time cost δ i,j for sub-task ξ i,j can be computed by: where δ T i,j , δ Q j , δ C i,j , and δ R j , are the transmission time, queuing time, the computing time of the sub-task ξ i,j , and the remaining running time for the task being processed at the server, respectively.
The transmission time δ T i,j can be given as below where D i,j is the data size of the j th sub-task, and ζ i,j is the current transmission rate from the i th user to the j th MEC server.The transmission rate ζ i,j can be given by where B j is the bandwidth, P i,j is the transmission power, and h i,j , L i,j , N 0 are the Rayleigh fading, path loss, and noise power, respectively.
Once the sub-task ξ i,j arrives at the edge server after transmission, it needs to wait for processing before all the sub-tasks in the queue and the server are completed.δ R j is the reminding computing time of the sub-task which is being processed by the j th MEC server, and it is the difference between the estimated completion time δ C and the starting time δ s : where δ C = C k f k , and C k is the required CPU cycles to compute the k th offloaded sub-task in the current edge server and f k is the allocated computing frequency to process this sub-task.Further, assume there are J tasks or sub-tasks offloaded to the

DRL Model
Fig. 1: System Model j th edge server before the current sub-task is assigned to the j th edge server.Then, the waiting time in the queue δ Q j for the j th sub-task on the j th MEC server can be computed by: After all the sub-tasks before the current task are completed, the sub-task δ C j can be processed now.The required computation time δ C j for the j th sub-task on the j th MEC server is given by: where C i,j is the required CPU cycles to compute the j th sub-task of Ω i , and f j is the frequency of j th MEC server for processing the sub-task.
A task Ω i is considered to be completed before the corresponding deadline if all sub-tasks are completed no later than the maximum tolerant latency.In other words, if the last sub-task has been processed before the expiration time, then the task is processed successfully; otherwise, it is considered as expired and a failure to respond to the user.We can denote a positive flag (+1) to the task partitioning and offloading coordinator when the task is completed before the corresponding deadline, v and set this flag 0 when it fails to timely respond to the user, as defined: Note that the agent gets a punishment (energy cost) whenever it takes action for the partitioning and offloading; therefore, the agent gets negative feedback when it fails to respond timely to the users.Finally, the energy consumption to process task Ω i is given by The energy consumption due to transmission, E T i is the sum of the energy consumption for transmission of the sub-tasks, and is given by As shown in [40,41], the computation energy consumption can be calculated by: where c = 10 −26 and f k i,j is the frequency used to compute the j th sub-task.

B. Problem Formulation
Theoretically, we can define the objective function as: where β 1 , β 2 , and β 3 are the normalization factors.The optimization objective indicates maximizing the number of processed tasks before expiring and minimizing energy consumption.Each action contains two vectors, Φ i for task partitioning and F i for frequency control.Specifically, Φ i = {φ 0 , . . ., φ j , . . ., φ K } for slicing the tasks according to the edge servers, and φ j denotes the percentage of the task offloaded to j th server; F i = {f 0 , . . ., f j , . . ., f K } presents the recommended frequencies, and f j denotes the recommended frequency for the j th sub-tasks.The formulation is relatively straightforward in the mathematical definition.However, it is inflexible to balance reducing energy consumption and to increase the number of completed tasks.Therefore, we formulate the objective function so as to maximize the expected accumulated rewards given by: where s i is the current system observation and π denotes a policy; a policy maps observation states to actions.Each action a i taken by the coordinator and corresponding reward is defined as where α is the weight that allows the network providers to adjust the reward function based on their interests; w 1 , w 2 and w3 are normalization terms to convert Λ, log(E i ), and log(max(δ i )) into the same scale, and C is a small incentive to encourage agents to maintain the stability if the MEC servers.

IV. D3PG: DIRICHLET DEEP DETERMINISTIC POLICYGRADIEN
In this section, we provide a brief introduction to DRL and introduce the developed model with extensive details. vi

A. Background
The DRL settings are very similar to standard reinforcement learning.The learning agents interact with the environment to learn and make decisions.For reinforcement learning, we need a description of the MEC network, which we call the environment.We assume the environment represents the MEC network and provides an interface to the agent to interact with it.In other words, reinforcement learning agents make decisions based on the observation provided by the environment, and the decisions are optimized with respect to expected long-term rewards.
To formulate the MEC network environment as an MDP, we need to specify the components of the MDP, including state space, action space, and a reward denoted as (S, A, P, R), and the transition function p(s |s, a) of the MDP can be given as The transition probability function shows the transition probability of transit from current state s to next state s , where s t+1 is state of the (t + 1) th time step; s t and a t are the current state and action of the t th time step assigned with current state and action, respectively.However, the transition function of the MEC network is unknown.Therefore, we prefer to design a model-free method to overcome the challenges.The long-term accumulated rewards are also known as the return function, and it can be given as where R t is the immediate reward and the rest of the terms denote the estimated future rewards discounted by γ in the target optimization problem (Eq.11); therefore, the optimization problem can be solved by maximizing long-term rewards (Eq.15).
The goal of the RL models is to find optimal policies to maximize long-term expected rewards.A policy π can be considered as a function mapping states to action, and there is no policy can collect more rewards than an optimal policy π * .Note that it is possible than we can find multiple optimal policies.To find the optimal policy π * that maximizes the expected long-term reward, the standard RL methods store all of the possible state and action pairs in tabular data structure, and each pair of state-action has attached a expected long-term reward.Tabular data structure called Q-table is designed to hold state-action pairs and corresponded reward values.The Q-table also can be formulated as an action-value function Intuitively, the RL models save the learned policies, mapping states to actions, into the Q-table; the agent then searches the best actions from the table with the states.However, the Q-table can quickly get an explosion and it is incredibly challenging to search the optimal policies when dealing with high-dimensional or continuous state space.Therefore, Deep Neural Networks (DNNs) have employed to capture the high-dimensional observations and generate plausible policies, which maximize expected long-term rewards.
The essential ideal of incorporating DNNs into reinforcement learning is to employ deep learning to process the complex observation and reinforcement learning to take complex actions.Although DNNs can be considered non-linear approximators to capture the high-dimensional states, DDNs are notoriously unstable in reinforcement learning because of noisy feedback and other attributes of reinforcement learning presented later of this section.Mnih et al. [10] adopted DNNs as the approximators to capture the high-dimensional states and extract the feature maps without knowing the domain knowledge, called Deep Q-Network (DQN).Moreover, they introduce a method to delay parameter updates to stabilize the learning process.Specifically, they copy the network as target network, which is frozen for updating during most of the training episodes and its parameters are updated once after every N episode.In DQN, the authors also adopted experience reply buffer [42] to decouple the correlation of the sequential interaction tracks.During the learning process, the agents draw training samples from the replay buffer U (D) to train the model, and Temporal Difference (TD) [43] learning method is adopted to training the model.Specifically, the agents try to minimize the target value and current value given by the following loss where Q(s , a ; θ − τ ) and Q(s, a; θ τ ) denote the target and current Q-value, respectively.The s, a, θ τ are the current state, action and local network weights, respectively.Similarly, s , a , θ − τ are the next state, next action and target network weights, respectively.Further, the weights θ of the learning network can be updated by and the target network weights are fixed and only update every N step by θ − ← θ.In this work, both the state space and action space have continuous ranges.Therefore, we have to develop a DRL model that can address continuous state space and action space; ideally, it is a policy-based model so that it does not rely on value functions to derive optimal policies.The policy-based reinforcement learning methods are easy to extend to high-dimensional or continuous action space because they do not have to reason individual actions as the objective function is parameterized policy π.In the policy-based DRL method, the models derive the optimal police directly by maximizing the expected long-term utility π ≈ π * maximize expected return where P (τ ; θ) is the probability selection actions based on the policy.The learning process (policy optimization) can be implemented with many methods such as hill-climbing, genetic algorithms, and Policy Gradient Algorithms [44].The popular ways to derive the optimal policies is to use Policy Gradient as where learning rate η decays over the time steps to avoid overshooting which an result in non-optimal policies.Policy-based methods have many advantages such as good converge properties, easily extend to high-dimensional and continuous action space, and achieve true stochastic policy.However, policy-based RL methods also have some drawbacks.First, policy-based methods are susceptible to convergence to local optima, especially with non-linear function approximators.The issue is shared with value-based methods, but it is more challenging in the policy-based method.Second, the obtained knowledge is specific and does not always generalize well because it only captures what the agent wants to optimize the policy and includes no other information.Third, it ignores much information in the data.

B. Proposed Model
Considering the advantages and disadvantages of the value-based and policy-based DRL, we prefer to develop an Actor-Critic DRL model because it has the advantages of value-based and policy-based reinforcement learning.Specifically, we develop the model called Dirichlet Deep Deterministic Policy Gradient (D3PG), which builds on Deep Deterministic Policy Gradient (DDPG) [11].The developed model has to address continuous action space and meet the constraints of MEC task partitioning.Specifically, a task can be partitioned as K sub-tasks, and the size of sub-tasks can be represented by Φ i = {φ 1 , . . ., φ j , . . ., φ K }, φ j denotes what percentage of the full task is contained in the j th sub-task.Thus the sum is constrained by Φ i = 1, since the total percentage is one.To satisfy action space constraints, we employ the Dirichlet distribution to capture the constrained action.
Fig. 2 shows the work process of task partitioning and offloading as well as frequency control with the developed DRL model.The system has three components: the MEC network, the MEC environment, and the DRL agent.The learning DRL agent has no control over the MEC network directly; instead, the environment serves as a coordinator to bridge the MEC network and the DRL learning agent.We assume the environment presents the MEC network for the simplicity of argument, and the DRL interacts with the environment and learns from the trials.
The learning process can be considered as the following steps.First, the agent takes action given the observation from the environment.Second, the environment provides feedback and the next state to the DRL agent; the agent then stores the current interaction data into the experience replay buffer for training the model.Each record of interaction includes current state, action, reward, and next states denoted as a tuple < s t , a t , r t , s t+1 >.The DRL agent keeps interacting with the environment to generate the training data sets.Third, the agent then draws training data from the experience replay buffer to train the learning networks inside the DRL model.Each network has a backup copy called target network, and the target networks are for stabilizing the training.The details of the elements of the MDP and DRL training process are presented in the following subsections.
1) Action: Although the DDPG models can address many DRL challenges with continuous action space, part of the action space in this work has a further constraint.Specifically, each action has two vectors, one vector Φ i = {φ 1 , . . ., φ j , . . ., φ n } for partitioning the tasks into sub-tasks according to the edge servers, and the other vector F i = {f 0 , . . ., f j , . . ., f n }, where n is the number of edge servers.In other words, the DRL specifies p j percentage of the task to offloads to the j th edge server; further, it recommends the server process the sub-task by using f j percentage of maximum CPU frequency of the j th edge server.All of the elements (sub-component of the action) are continuous, and in the range [0, 1].In other words, we can define the constraints as φ j ∈ [0, 1] and f j ∈ [0, 1], and the j th edge server should not receive a sub-task when φ j = 0.Moreover, the sum of proportion of the sliced tasks has to equal 1.Therefore, a specific action a i can be given, Indeed, the sof tmax function can satisfy the constraint that n j φ j = 1.However, the sof tmax function does not have an exploration mechanism, which probably leads the model to a local optimum.In DRL, the learning agent has to explore the environment because the feedback (rewards and punishments) is not labeled data as supervised machine learning; it is an evaluation score of the actions and policies.Therefore, we cannot regard the feedback as the actual label data as the deep Learning training.A possible way to mitigate this problem is to add an exploration method such as − greedy by drawing Φ i from a random distribution (e.g., uniform distribution) or add a noise vector for each action.However, those methods are unstable to explore the continuous action space.Therefore, we use a Dirichlet distribution to characterize the Φ i , which means Φ i ∼ Dir(φ).Dirichlet distribution can not only satisfy the constraint of the Φ i but naturally explores the possible actions to find the optimal policies by sampling from the Dirichlet distribution.Given the random process of Dirichlet sampling, the agent has achieved stochastic policy without saving specific actions.Therefore, the Φ i ∼ Dir(φ) is defined as where Γ(•) is a standard Gamma function given by Moreover, to meet the condition ψ j > 0 of the Dirichlet function, we use an exponential to process the actor-network outputs for the slicing action Ψ i = e z + , where z is post-activation (outputs) parts of the previous layer, and is a very small positive number to avoid zero values.Theoretically, e z is always greater than 0; however, it could be extremely close to 0 in some cases which would raise issues in the real-wold implementation, and is to avoid such implementation errors.The other parts, post-activation from the previous layer, are inputted to the Ornstein-Uhlenbeck process.As shown in Fig. 3, the results of the Dirichlet distribution and Ornstein-Uhlenbeck process are concatenated as a complete action which can be given, where ⊕ denotes the concatenation of two vectors, and Dir(φ φ t ) sub-actions for task partitioning, and the rest of the elements are for the frequency control.The Dirichlet distribution and Ornstein-Uhlenbeck process, denoted µ s t | θ f t + N , can address the exploration challenges during the learning phase.Therefore, the developed model can keep exploring the environment and is unlikely stacked to non-optimal policies because the DRL model keeps exploring by sampling actions from Dirichlet distribution and adding noise to actions with Ornstein-Uhlenbeck process.2) State Space: Although the MEC environment is not fully observable to the DRL agent, which means the observation is not equivalent to the state, we assume the observation is the same as the other reinforcement learning methods.The state space at time slot t is denoted as s t =< M, ζ, Ω, where M denotes the status of MEC servers, ζ is transmission rate matrix, and Ω is the set of the current tasks ready to offload.Each of the components of the state has its own sub-components as defined in section.III.
3) Transition Probability and Reward Function: As the standard reinforcement learning methods, we assume the environment fits in MDP.However, we do not know the transition function (Eq.14) in the environment.Finally, the reward function for a specific time step is defined by Eq. 13.

C. Loss Function and Gradient
Constructing the loss function is one of the critical steps of the training DRL models, and we need two loss functions: one for the actor-network and another for the critic-network.However, it is unnecessary to provide an explicit form loss function for the actor-network because the actor-network is optimized with respect to the critic value.The actor-network is policy-based, and the critic network is value-based; therefore, we can consider the critic as the Q-Network as in the DQN model [10].The critical value is the utility for the policy-based actor-network.The learning process of the critic is very similar to Q-learning and DQN.The training process of Q-learning uses Temporal Difference (TD) [43] to update the Q-values so that the agents can search policies from the Q-table.The Q-value Q(s, a) updating can be accomplished with where η is the learning rate, δ t is the TD error, and γ ∈ [0, 1] is the discount factor of the expected feature values.Theoretically, it has been proven [9,45] that the near optimal Q-value can obtained by iterating the above steps until |Q (s, a) − Q(s, a)| < , where is a very small positive number.
Similarly, the action-value function of actor-critic can be given as the Bellman equation where r(s t , a t ) is the immediate reward when the agent takes action a t based on the given state s t , and the remaining terms are estimated future values based on policy π discounted by γ.In actor-critic DRL, the actions are made by the actor-network, which is parameterized by µ(s|θ).Assuming the policies are deterministic, we can derive the Q-value function when the actions are generated by an actor-network µ (s t+1 ) given by Therefore, the loss function based on Q-learning [45] or DQN [10], and the targeted minimize function can be given, x where ρ β is the state transition probability given the action distribution β.Note that y t is parameterized by the actor-network µ(s t+1 ).Further, the gradient of the loss function can be derived with the chain rule as To stabilize the training process, DDPG also requires the target networks to compute temporal differences.Therefore, y t is obtained from the the target critic-network θ Q and actor-network θ µ .The final loss function can be given where y t is the Q-value computed by the target critic-network.

D. Training process
The training process is shown in Alg. 1.The algorithm has three blocks, including initialization, data generation and collection, and model training.
for episode ← 1 to M do Initialize a random process N for action exploration; Preprocess initial state: S ← ψ(< x 1 >); for time step: τ ← 1 to T max do // 2. Generate training data: Select action according to the current policy and exploration noise: Execute action a t and observe reward r t and next state S ; Store experience (S, A, R, S ) in D; // 3. Learning: Obtain random mini-batch of (s i , a i , r i , s i+1 ) from D; Update critic by minimizing the loss: Update the actor policy using the sampled policy gradient: The first block is to initialize variables and the networks with random weights, creating an experience reply buffer, copying the networks to the target network.As mentioned before, we have two networks, the actor-network and the critic-network, and each network has a target network to stabilize the training.The experience reply buffer maintains the training data collected from the interaction with the MEC network environment.
The second block of the algorithm is to collect data by interacting with the environment.As mentioned earlier, the action consists of a Dirichlet distribution and Ornstein-Uhlenbeck process.Every interaction with the MEC generates a training sample, and each sample includes the current observation state, reward(feedback), next state, and a termination flag.The collected datasets are stored in the experience reply buffer, which is a queue-like container.The experience reply buffer has a fixed size, and it discards the oldest data when it receives new data.
The third block is for training the networks in the model.During the training, the target policy actor has added a smooth factor with clipped range.Again, the noise is only added to the frequency control sub-actions because the rest of the subactions are sampled from the Dirichlet distribution.As the standard actor-critic setting, the policy is optimized with respect to the Q-values defined by the critics.Further, the target networks are updated with soft-update and delay update methods.The target network updates are delayed to reduce variance.This method is similar to the fixation method introduced in DQN; the only difference is that it updates the network more frequently than the fixation method.The soft-updates keep a significant amount of the original weights instead of completely overwriting the networks so that the model does not have to wait for a long time to update the networks to avoid high variance.The portion of weights updated to the target networks can control the factor τ .

V. SIMULATION RESULTS
In this section, we present the details of the simulation and results analysis.We adopt Numpy [46] as a tool for data preprocessing and Pytorch1 to build the DRL models.Again, we consider the simulation has two parts, including the MEC network environment and the DRL model.The MEC network has various network entities such as edge servers and IoTs users.As we consider the MEC networks are heterogeneous, and the edge servers are configured with different computational resources; and the IoTs users frequently generate various tasks to offload.In addition, the MEC network also maintains the network properties such as the channel gain and transmit speed matrices.The entities of the MEC networks are simulated with processors.To verify our model, we compared our model (D3PG) with the existing methods, including DDPG, DDPG with sof tmax (DDPG-softmax), Twin Delayed Deep Deterministic Policy Gradient (TD3) [47] and greedy algorithm.The DRL models are implemented with Pytorch.The key parameter settings are summarized in Table .I. As mentioned in previous sections, both the DDPG and TD3 models have two types of neural networks, one for taking the actions called the actor-network and the other for evaluating the actor-network called critic-networks.For both models, the number of layers and the number of neurons in the hidden layer are the same.Specifically, the actor-networks have five layers, and the number of neurons is the size of the state space, 256, 512, 256, and the size of the action space, respectively.Similarly, the critic-networks have five layers, and the number of neurons in the state space plus the size of the action space, 256, 512, 256, and 1, respectively.Note that the TD3 model has two critic-networks while the DDPG has a single critic-network only.Therefore, TD3 consumes more computational power than DDPG models.
Considering the randomness in the MEC network and DRL models, we have run five experiments and average the results.Fig. 4 shows the rewards with respect to the episodes, and the D3PG are converged to the optimal policies around 1,500 episodes.Each reward has three components: the completed number of tasks, energy consumption, and time cost; the weights of components allow the network providers to configure based on their applications and business purposes.As we can see from Fig. 4, the D3PG model can achieve better results than other models because the Dirichlet distribution captures partitioning actions to improve the policies.DDPG-softmax outperforms the original version of DDPG and TD3 because sof tmax can capture partitioning actions; however, it is highly likely to converge local optima because sof tmax does not have an exploration mechanism to explore the optimal policies.In fact, this DDPG-softmax has relatively good results because we add noise to actions as in TD3 to help sof tmax explore the partition actions.Both original DDPG and TD3 perform poorly because we have to force the partition actions to satisfy the action space constraint n j φ j = 1, which can degrade the overall performance.The greedy algorithm does not require a learning process and can collect more rewards at early episodes; it outperforms the standard TD3 and DDGP.The D3PG and DDPG-softmax can adequately address the action space to maximize the accumulation of long-term rewards and accumulate more rewards at later episodes.
Note that Fig. 4 shows the TD3 model accumulates negative rewards and performs poorer than other models.However, it learns to maximize the completed number of tasks before expiring, as shown in Fig. 5.Although the DRL model is learning to Fig. 7: Energy Consumption maximize the expected long-term rewards, we can decompose the rewards to verify that the model can address joint optimization in an end-to-end manner.Fig. 5 shows the completed number of tasks before expiring.The edge servers can only process a minimal number of tasks at each episode at the beginning because the models take random actions that fail to allocate the resources properly.As the models interact with the MEC environment to learn and improve the policies, they can allocate the resource optimally and serve a maximum number of offloaded tasks.Fig. 6 shows the ratio of the number of tasks processed before expiring to the total tasks.Note that we set the computational cost and data size of tasks considerably large for the edge servers.Therefore, some of the tasks are even impossible to be completed before their corresponding deadlines, and the completion ratio is only for comparison purposes.The completion ratio is much higher if we reduce the size of the tasks and the computational demand.Although the greedy algorithm can collect more rewards than DDPG and TD3, the completed number of the tasks is less than the learning methods.The learning agents need to stabilize the environment and process tasks as many as possible to collect more rewards in an episode.The D3PG outperforms other methods in terms of completed tasks and task completed ratio because the constraints do not cripple the model.Moreover, the Dirichlet distribution can capture the uncertainty of the environment and explore optimal policies.Similarly, the models can reduce the energy consumption while maintaining the number of completed tasks, as shown in Fig. 7.The models can save energy through frequency control because the energy consumption is proportional to the square of the CPU spinning frequencies.Therefore, the DRL models can find optimal frequencies to process the offloaded tasks to balance completed tasks and energy consumption.The TD3 consumes more energy than the rest of the models, and that explains why TD3 collects less rewards shown in Fig. 4. Fig. 7 shows the energy cost per task, and the get expired tasks also consume energy.We can compute wasted energy by multiple the energy consumption with the task completed ratio shown in Fig. 6.Note that the weights of the energy and other terms affect how the model optimizes the energy cost, and the network provider can adjust the weights based on their demands.Although D3PG consumes more energy than DDPG-softmax, they consume almost the same amount of energy to process each task as shown in Fig. 8, and the D3PG model can save more energy than DDPG and TD models.Interestingly, the greedy algorithm achieves the best results for tasks completion ratio, but it cannot stabilize the edge servers and plan the resource for long-term as shown in Fig. 10.The models also can reduce the time cost while maximizing the number of completed tasks and save energy, as shown in Fig. 9.Although the reward function has contained the number of completed tasks before expiring, reduce the time cost can improve the user experience.As we can see from the Fig. 9, the average time cost decrease as the models converge to the near-optimal policies.Fig. 9 shows that the D3PG model saves more time than other models, which improves the quality of service.The DDPG and DDPG-softmax models do not learn to reduce the time cost because the weights of time consumption are relatively small.Note that both energy consumption and average time cost are for the total cost of the completed and expired tasks.10 shows the stability of models, which is measured by the number of time steps of each epoch; the stability is measured by the number of steps that the MEC servers can persist in each episode.The task partitioning and offloading for the MEC can be considered continuous tasks from a reinforcement learning perspective; there is no endpoint of the offloading unless the servers are overloaded or crashed.However, for the simplicity of training, we formulate the task partitioning and offloading as episodic reinforcement learning task; therefore, we set the episodes to end when one of the edge servers is overloaded.In this simulation, we also add the external trigger to terminate the episodes, and the episode is forced to terminate when the number of time steps greater than a threshold value, set as 1,000.Again, we run training the models for five times and average the results.As we can see from Fig. 10 TD3 and D3PG models can reach nearly 1,000 epochs, which indicates most of the epochs are stopped by the simulator, and they can maintain a stable MEC network.The greedy algorithm only chooses the action that maximizes the current reward for each time step but does not plan resources for the long term; therefore, the edge servers are easily overloaded under the greedy algorithm control.

VI. CONCLUSIONS
In this work, we have studied task partitioning and computation offloading in a dynamic environment with multiple IoT devices and multiple edge servers.We developed an end-to-end DRL method to partition and offload tasks and allocate the edge servers' computing power to achieve joint optimization of expected long-term rewards.The model is optimized to maximize the completed tasks before the deadlines, minimizing energy consumption and simultaneously minimizing the time cost.In order to deal with the constrained hybrid action space, we proposed a novel DRL model, namely D3GP, by integrating the Dirichlet distribution into DDGP to make decisions for task partitioning and the Ornstein-Uhlenbeck process for frequency control.The developed model has been verified with extensive simulations and comparisons with existing methods, and the results show that our model outperforms state-of-the-art DRL models.For future work, we will study the optimization for task partitioning and offloading based on the sub-tasks that have dependency relations and task priorities.

Fig. 9 :
Fig. 8: Energy to Tasks Fig. 10: Stability Fig.10shows the stability of models, which is measured by the number of time steps of each epoch; the stability is measured by the number of steps that the MEC servers can persist in each episode.The task partitioning and offloading for the MEC can be considered continuous tasks from a reinforcement learning perspective; there is no endpoint of the offloading unless the servers are overloaded or crashed.However, for the simplicity of training, we formulate the task partitioning and offloading as episodic reinforcement learning task; therefore, we set the episodes to end when one of the edge servers is overloaded.In this simulation, we also add the external trigger to terminate the episodes, and the episode is forced to terminate when the number of time steps greater than a threshold value, set as 1,000.Again, we run training the models for five times and average the results.As we can see from Fig.10TD3 and D3PG models can reach nearly 1,000 epochs, which indicates most of the epochs are stopped by the simulator, and they can maintain a stable MEC network.The greedy algorithm only chooses the action that maximizes the current reward for each time step but does not plan resources for the long term; therefore, the edge servers are easily overloaded under the greedy algorithm control.

TABLE I :
Parameter Settings