Task offloading algorithm of vehicle edge computing environment based on Dueling-DQN

With the wide application of the Internet of vehicles, the rapid development of intelligent vehicles provides drivers and passengers with a good driving and riding experience. However, how to process a large amount of data messages in real time on resource-limited vehicle terminals is still a huge challenge, and it will cause great energy consumption to terminal devices. In this paper, a semi-online task distribution and offloading algorithm based on Dueling-DQN is proposed for time-varying complex vehicle environments. Firstly, the vehicle offloading system is built by the reinforcement learning, because the original optimization problem is a joint optimization problem with high complexity, and divided into the vehicle scheduling sub-problem and the vehicle computation resources optimization subproblem. We predicts different vehicle offloading behaviour, and calculates total rewards value after a series of vehicle offloading action, so as to update the vehicle offloading decisions. The simulation results show that this algorithm can improve the efficiency and energy consumption of computation tasks to a certain extent.


Introduction
With the rapid development of the ubiquitous intelligent transportation system and the rise of the vehicular networks [1,2], the European Telecommunications Standards Association and the 5G Automobile Alliance have proposed artificial intelligence vehicle applications, by interconnecting its objects and devices, namely vehicle-to-everything (V2X) [3] and virtual methods expand the current vehicular networks ,which attracts the widespread attention. However, car applications need to process its own surrounding environment information (location, speed, images, etc.), and the amount of data is very large and complex. The existing vehicle cloud computing has quite powerful computation capabilities, storage space and network resources, which is conducive to vehicle information efficient services between sharing and related businesses. Because cloud servers are usually deployed in data center far away from users, the time for offloading tasks delivered to the cloud and the return time of results are seriously increased. Coupled with the rapid growth of the number of vehicles in the vehicular networks and the large number of vehicle devices accessing the cloud computing center, as a result, the centralized big data processing model on cloud computing cannot provide low-latency and highly reliable services. For such "low latency, high bandwidth, high reliability" application scenarios, introducing Mobile Edge Computing (MEC) [4] technology into the vehicular networks is a good choice to solve this problem. MEC can better reduce the bandwidth pressure on the core network. By deploying computation resources, network control functions and cached data near the data center on urban roads, it greatly reduces the data backhaul link and the processing time on the backbone network and the mobility Energy consumption of equipment. Usually in a wireless fading environment, the time-varying channel state greatly affects the optimal offloading decision of the car system to a certain extent. Especially in many vehicles scenario, the biggest obstacle is how to optimize the computation mode and resource allocation. In recent years, the above two issues have attracted the attention of many researchers. The authors in Reference [5] study the problem of intelligent task scheduling and resource allocation in the MEC vehicular networks scene, propose a two-sided matching scheme to solve the task scheduling problem, and use a deep learning method to deal with the network resource allocation problem. Authors in Reference [6] regards each vehicle-to-vehicle (V2V) link as an agent, and proposes a distributed resource allocation method based on deep reinforcement learning, and interacts with unknown environments to learn and improve resource sharing strategies. Authors in Reference [7] considers multiple concurrent tasks with different priorities, and solves the limitation of uneven vehicle load through genetic algorithm based task offloading strategy, and improves the success rate of car safety tasks. Authors in Reference [8] study the task offloading delay problem in the vehicle edge computing system, and proposes an adaptive learning task offloading algorithm based on the multi-armed bandit (MAB) to encourage vehicles to learn the offloading delay performance between adjacent vehicles while minimize task offloading delay. Authors in Reference [9] discusses the computation power and high-efficiency performance required by computationally intensive vehicle applications, and proposes a joint offloading and resource allocation method to maximize the system network utility and balance load distribution. Authors in Reference [10] first establishes a stacked auto-encoding traffic flow prediction model, then applies a novel deep learning method, and takes into account the temporal and spatial distribution correlation, and finally proposes a greedy layered unsupervised learning scheme for pretraining deep neural networks, while greatly shortening the training time.
In the process of computational offloading, the large amount of data and more complex data content make traditional machine learning techniques have some shortcomings in dealing with such problems. For example, for a large number of data feature extraction, a large number of iterations are required to reach the local optimum [11]. Especially in a dynamic environment with fast fading, computational offloading strategies cannot be formulated in real time. Many existing deep learning methods need to optimize all system parameters at the same time. The algorithm has high complexity and long iteration steps, which is not suitable for the delay-sensitive vehicular networks environment. Therefore, for the MEC vehicular networks computational offloading system, this paper proposes a computation task distribution offloading algorithm based on the dueling deep Q network (Dueling-DQN) [12]. The original optimization problem is decomposed into the vehicle task scheduling subproblem and the vehicle computation resource optimization sub-problem, and by learning the previous offloading experience under different wireless fading conditions, a distributed semi-online computation model is proposed. In the continuous state space, by predicting different vehicle behaviours only need to be selected from a few candidate actions at a time, and its vehicle offloading action generation strategy is automatically improved, which reduces energy consumption to a certain extent and the processing time of resource allocation sub-problems, while maximizing the total vehicle computational offloading efficiency.

System Model
As shown in Figure. Figure. 1 Intelligent offloading process of vehicle

Communication Model
With the increasing number of task offloading, the traditional communication network model inevitably leads to the problem of high latency. In response to this problem, we have cited cellular V2X technology [13], in which the vehicle is equipped with a pool of radio resources that they can automatically select V2V communications. However, limited spectrum resources cannot efficiently and completely handle the offloading problem, unless the vehicle occupying the channel that releases the channel resource, otherwise vehicles cannot allocate the channel. In order to solve this problem, the vehicle resource pool can overlap with the cellular vehicle-to-infrastructure (V2I) interface for making better use of spectrum resources. Using OFDM can convert frequency selective wireless channels into multiple parallel channels on different subcarriers. We assume that the channel fading is roughly the same in the same sub-band and is independent of each other among different sub-band. Therefore, the signal to interference plus noise ratio (SINR) received by each V2V link during the task offloading process can be expressed as: Where k p represents the energy transmission power between the vehicle and the edge device by wireless network BS. 2  is the noise power, k h is the channel power gain from the k-th vehicle to the BS/RSU, and ~i h is the interference power gain between users of each vehicle. Therefore, the vehicle offloading capacity is equivalent to the processing rate of the task, as shown in Equation (2): Where B is the remaining bandwidth of the current system, assuming k i S is the amount of data transmitted by the vehicle, then the task transmission time can be can be calculated by:

Local Computation Mode
In the local computing mode 0 m , the vehicle can simultaneously supplement energy to the vehicle MEC equipment and compute its offloading tasks. It is assumed that task offloading only considers its computation delay, and the output data size of task offloading is much smaller than the input data size. Then, the transmission delay between the two and the backhaul transmission delay can be ignored. We define the computational capability of the vehicle MEC equipment as L F , k i e denotes the computation resources required for the i-th task of k-th vehicle, and the local computation time is , /

Edge Computation Mode
Due to the limitation of time division multiplexing, the vehicle MEC equipment offloads tasks to the BS/RSU after collecting energy, which is referred to as mode 1 m for short. In fact, computation power and transmission power of BS/RSU are more than two orders of magnitude higher than that of vehicle MEC equipment [14]. Compared with offloading tasks to BS/RSU, the computation offloading results are downloaded to the vehicle MEC device in a shorter time. , left i k e denotes the remaining computation resources of the edge server, thus, the edge server computation time is: ( 5 ) In order to maximize the offloading efficiency, we also ignore the energy consumption of receiving the computation result from the BS/RSU, and only consider the energy consumption of data transmission. In this case, the vehicle MEC equipment is required to exhaust its collected energy with best performance. Therefore, the total energy consumption in the two modes can be expressed as: Where k P represents edge device transmission power and

Problem Formulation
At present, many researchers use traditional intelligent optimization algorithms to solve edge computing problems, but the time-varying vehicle environment under fast fading channels is not suitable. In the actual vehicle edge computing scenario, the diversification of vehicle tasks lead to different computation resources requirement. The vehicle state space will also be increased exponentially with the increasing number of vehicles. Different edge devices will also have insufficient computation resources or excessive load that affect the success rate of computation tasks.
In this article, in order to solve these complex problems and offload tasks between different vehicles and edge devices with higher performance, a semi-online task distribution offloading algorithm based on Dueling-DQN is proposed, which can autonomously be used without human intervention, update its offloading decision, optimize system computation resources, reduce energy consumption and network delay. Therefore, the optimization goal of this paper is to maximize the number of computation resources under the constraints of total server resources, and minimize the energy consumption during task offloading, as shown in Equation (7). . .: 2 : 3 : Where 1 C , 2 C and 3 C respectively denote that the required computation resources, vehicle offloading capacity, and vehicle terminal transmission bandwidth cannot exceed the upper limit.

Algorithm Overview
Before solving the above problem, this section mainly briefly describes the SODO algorithm in the MEC vehicular networks scenario. The main framework of the intelligent offloading system is shown  [15,16], in which the state transition probability and reward only depend on the current environment state, and the action is determined by the vehicle. Therefore, it can be represented by the state transition probability matrix . The generation of vehicle offloading action and the prediction of offloading behaviour will be described in more detail later.

Vehicle Offloading Action Generation
In the t-th time frame, each V2V link receives an environment observation ( ) Where , t j A  represents the offloading action of the j-th vehicle under the current offloading decision t  . Our offloading model is binary offloading, in order to better generate binary offloading decisions, firstly, the offloading actions are quantified into M candidate actions, and the number 0 is filling in n digits from the high order to obtain a 0-1 sequence of length n, the real-time decision information will be sent to the edge device to decide whether to offload. Therefore, we can obtain the optimal vehicle offloading action in the t-th time frame by Equation (9) The above Equation (9) can speed up the process in parallel, and then outputs the offloading actions related to the best computation resources and energy consumption accordingly.

Vehicle Behaviour Prediction
When a series of environmental conditions are taken into consideration, the problem will become more complicated. Therefore, under the premise of not requiring any labelled data, by predicting the difference of vehicle behaviour [18], an improved MEC vehicle networking scenario algorithm is designed. Due to the different behaviours of vehicle and with the increasing number of vehicles, the vehicle offloading space will be increased rapidly. Usually we consider the vehicle with better behaviour and abundant computation resources. For each task offloading, when the computation resources are overloaded, the system will receive a punitive reward , which is the negative value of the absolute value for the current system reward, as shown in Equation (10): Therefore, we use the total reward value Total R of the system as the prediction of the vehicle offloading behaviour to better measure the pros and cons of future vehicle behaviour and the choice between local mode and edge mode, which is beneficial to update offloading decision.
At the t-th time frame, the replay memory unit is initialized, and a new training sample data * ( , ) t t S A is added to the replay memory unit. Due to its limited capacity, when it is fully loaded, the newly generated training data will replace the previous data. We use the experience replay technique to train multiple V2X training sample data, and uniformly sample a batch of the training data * ( , ) t t S A from the replay memory unit randomly. In the structure of the neural network, the stochastic gradient descent algorithm (Adam) [19] is used to reduce the average cross-entropy loss. Its main advantage is that after bias correction, each iterative learning rate has a certain range, which makes the parameters relatively stable.
In this paper, as the reinforcement learning algorithm gradually learns the optimal vehicle offloading behaviour, the performance of the entire system is improved, and on the basis of satisfying the minimum delay, Total R is maximized under the current optimal offloading decision. Therefore, the above summary of the related MEC vehicular networks scenario pseudocode algorithm is described as follows:

SIMULATION RESULTS
In this section, simulation results are presented to validate the proposed SODO for vehicular networks scenario, and we custom built our simulator according to the Manhattan city case evaluation method defined in 3GPP TR 36.885 [20]. Among them, vehicles are randomly distributed on the road according to the spatial Poisson process, and each vehicle can communicate with neighbour vehicles. Without loss of generality, it is assumed that the channel state gain remains unchanged in the same time frame and is independent of each other in different time frames.
Our DQN is a four-layer fully connected neural network with two hidden layers. Here, we use Relu as the activation function in the hidden layer, i.e., ( ) , such that the relaxed offloading In the initial stage, the learning rate is set to 0.001 and decreases exponentially. The training data was mainly performed by Adam. The detail parameters can be found in Table 1.  8 10 / round s  Based on the Q learning network framework, the traditional DQN method uses the value function to evaluate the selection of actions. In general, this method tends to choose the action that maximizes the reward value for the next step. However, its action selection strategy is greedy, which will inevitably lead to overestimation of the Q value. In order to better solve this problem, Hasselt [21] proposed a deep double Q network (DDQN), which uses different value functions to evaluate the selection of actions, and then uses a neural network to evaluate the selection strategy and approximate the true value. The target Q value function can be expressed as: With the increasing number of vehicles, the vehicle state space increases exponentially, which will inevitably bring complexity problems. Due to Dueling-DQN pays more attention to the relationship between core system states and actions, it can reduce computational complexity and solve the problem of state space dimensions. Dueling-DQN divides the Q network into two parts, the first part is only related to the state S and unrelated to the adopted action A . It is called the value function part and denoted as ( , , ) V S   . The second part is related to both the state S and the action A , this part is called the advantage function part, denoted as ( , , , ) A S A   .In the end, value function can be re-expressed as:

Q S A V S A S A A S a A
Where  , and  is the neural network parameter.
In order to find out the reasons for the excellent performance of different learning strategies, we compare the SODO with DDQN, local computation and greedy algorithm respectively, and investigate the probability that the vehicle satisfies the delay constraint under different vehicle numbers.
It can be seen from Figure. 3 that as the number of vehicles increases, V2V links are also increasing. Therefore, it is difficult to ensure that each vehicle can satisfy the time delay constraint. The local mode has the fastest decline rate and the worst performance among the above algorithms. Because a single vehicle MEC device does not have enough energy nor can it handle a large amount of task data. The traditional DDQN method is not the best algorithm, because this method cannot pay attention to the relationship between the core system state and actions, which leads to an increase in the complexity of the problem. The greedy algorithm does not fully consider the dynamic changes of the network state, and the SODO can reduce the complexity, the dimension of the state space and overcome the overestimation problem. Therefore, there is a greater probability that the vehicle can satisfy the delay constraint condition. Figure. 3 Probability of satisfied vehicle terminal delay constraint versus the number vehicles Figure. 4 shows the comparison of the maximum computation rate of the vehicle task with different offloading strategies. The result of the exhaustive method can be regarded as the upper bound, and SODO can reach 98% of the upper bound performance. Compared with the greedy algorithm, which is higher than 20%, this is because the greedy algorithm, in contrast to reinforcement learning, always chooses the action with the largest reward value at the current moment, and the intensification of resource competition. Compared with the random algorithm higher than close to 50%, it is obvious that the random algorithm is the worst performance among the above four algorithms. Although the exhaustive method can reach the upper bound, it is not practical due to its high complexity. In short, the performance of SODO is similar to the exhaustive method, and is better than the greedy algorithm and the random algorithm.  Figure. 4 Performance comparison of maximum processing rate versus the number vehicles Finally, we study the impact of different offloading algorithms on the energy consumption of the vehicle task as the number of vehicles increases. As shown in Figure. 5, SODO maintains the lowest energy consumption during the whole task offloading process, and the random algorithm maintains the highest energy consumption. When the number of vehicles is 20, the energy consumption of SODO only accounts for 33.6%, 36.8% and 22.7% of the DQN, DDQN, and random algorithm respectively. This is because the distributed offloading model adopted in this paper improves the task processing efficiency. DQN cannot overcome the interference of overestimation. Even though DDQN overcomes the problem of overestimation, due to the interference of high spatial complexity. In short, SODO can maximize performance while greatly reducing energy consumption.

Conclusion
In this paper, SODO is used to solve the problem of computation resource allocation for the diversity of vehicle offloading behaviours. The algorithm is a distributed task offloading, and there is no need to know the global information in each offloading decisions, reduces the transmission overhead, 10 20 converges after a limited iteration, and obtains a better offloading strategy. It can be seen from the simulation results that SODO maintains good performance while reducing energy consumption and improving the efficiency of computational offloading. In the future, we will consider more complex and dynamic environments and characteristics of vehicle security services in accordance with actual vehicular networks scenarios.