Deep Reinforcement Learning-Based Resource Allocation for Cellular Vehicular Network Mode 3 with Underlay Approach

Vehicle-to-vehicle (V2V) communication has attracted increasing attention since it can improve road safety and traffic efficiency. In the underlay approach of mode 3, the V2V links need to reuse the spectrum resources preoccupied with vehicle-to-infrastructure (V2I) links, which will interfere with the V2I links. Therefore, how to allocate wireless resources flexibly and improve the throughput of the V2I links while meeting the low latency requirements of the V2V links needs to be determined. This paper proposes a V2V resource allocation framework based on deep reinforcement learning. The base station (BS) uses a double deep Q network to allocate resources intelligently. In particular, to reduce the signaling overhead for the BS to acquire channel state information (CSI) in mode 3, the BS optimizes the resource allocation strategy based on partial CSI in the framework of this article. The simulation results indicate that the proposed scheme can meet the low latency requirements of V2V links while increasing the capacity of the V2I links compared with the other methods. In addition, the proposed partial CSI design has comparable performance to complete CSI.


Introduction
Vehicular communication is a vital realization technology of automatic driving and intelligent transportation systems [1]. Various candidate technical solutions have been proposed, such as cellular vehicular communication and dedicated short-range communication (DSRC) based on IEEE 802.11p [2,3]. DSRC has the disadvantages of short coverage distance and high infrastructure cost [4,5]. Despite the deployment of DSRC-based vehicular communication prototypes in the United States, inherent issues with DSRC and recent developments in cellular technology have encouraged research and industry research on cellular-based vehicular communications [6]. 5G Automotive Association (5GAA), as a global and cross-industry organization, was created in September 2016 and has presented the clear position of supporting cellular vehicular communication as the feasible solution for future mobility and transportation services [7]. Different regions such as the United States, Europe, and Japan have carried out pilot construction of cellular vehicular communication in actual operation [8]. Through cellular vehicular communication, road safety and traffic efficiency can be improved [9]. Therefore, cellular vehicular communication has attracted more attention from industry and academia. 3GPP Release 14 introduced modes 3 and 4 for vehicular communication [10]. These modes enable vehicles to directly communicate via licensed cellular frequency bands, bypassing the BS. Modes 3 and 4 both support direct vehicular communications, but their radio resource allocation methods are different. In mode 3, the BS allocates resources to vehicle users, while in mode 4, vehicles autonomously select radio resources [10,11]. Mode 3 can be divided into two types: overlay method and underlay method. In the overlay approach, V2I users and V2V users transmit Since the uplink resources are under-utilized compared with the downlink o we only consider uplink cellular resource allocation.
As illustrated in Figure 1, an uplink scenario in a single cell system is consid was lying in the center of the cell. The road configuration for vehicular communi defined as an urban setting [30]. The cell consists of multiple V2I links, denoted {1,…,M} and multiple V2V pairs denoted as N = {1,...,N}. In order to improve the s efficiency and ensure the data rate requirements of the V2I links. We assume that user has assigned one orthogonal subcarrier resource and each V2I user allows m one V2V pair to share its resource. Then, we can respectively express the signal-to-interference-plus-noise-ratio for the mth V2I link and nth V2V pair on subcarrier m as where pm, pn, pl denotes the transmit powers of the mth V2I, nth V2V pair, and lth V ρ n m is the subcarrier access indicator, ρ n m ∈ 0,1 , ρ n m = 1 if nth V2V pair is allowe subcarrier m, otherwise, ρ n m = 0. σ 2 indicates the noise power.
Furthermore, to reduce interference between V2V links, we assume each V can only access one V2I link resource. Accordingly, the sum data rate of mth V2I nth V2V pair on subcarrier m can be respectively given by We define the channel power gain of the desired transmission for mth V2I link and nth V2V pair on the mth channel as g m m , g m n . Similarly, let g m m,n represent the interfering channel power gain from the mth V2I transmitter to the V2V pair n's receiver. The interference channel power gain between nth V2V link and the BS is denoted as g m n,b , and the interference channel gain from the V2Vpair l's transmitter to the V2V pair n's receiver is denoted as g m l,n . In the simulation in this article, all the channel gains mentioned above include path loss, shadow fading, and small-scale fading.
Then, we can respectively express the signal-to-interference-plus-noise-ratio (SINR) for the mth V2I link and nth V2V pair on subcarrier m as where p m , p n , p l denotes the transmit powers of the mth V2I, nth V2V pair, and lth V2Vpair. ρ m n is the subcarrier access indicator, ρ m n ∈ {0, 1}, ρ m n = 1 if nth V2V pair is allowed to use subcarrier m, otherwise, ρ m n = 0. σ 2 indicates the noise power. Furthermore, to reduce interference between V2V links, we assume each V2V link can only access one V2I link resource. Accordingly, the sum data rate of mth V2I link and nth V2V pair on subcarrier m can be respectively given by We model the latency requirements of V2V links as the successful delivery of packets of size L within a limited time budget T max . Therefore, if the following constraint holds, the latency requirements of a V2V link will be satisfied: where ∆T is channel coherence time, and the index t is added in R m n [t] to indicate the capacity of the nth V2V link at different coherence time slots.

DRL for Resource Management
In this section, we describe how to use the DRL method within the proposed framework to solve the spectrum allocation, including the proposed partial CSI observations and double DQN learning algorithm.
For mode 3 resource allocation, our objective is to find the optimal resource allocation policy that can meet the latency demand of V2V links without excessively affecting the V2I links data rate under low signaling overhead. In addition, the vehicular communication network needs an intelligent resource management framework to make decisions intelligently, therefore, we adopt DRL architecture to make resource-sharing decisions. In the RL, an agent connects its environment, at each moment, the agent takes action based on the current environment state. The action will impact the state of the environment, causing certain changes in the environment. Then the agent receives numerical rewards and observes a new environment state. Through continuously exploiting the environment, agents learn how the environment generates rewards and then update their policy accordingly. In RL, the agent's objective is to maximize the expected sum of rewards, which is defined as where γ ∈ [0,1] is called the discounter factor. We adopt double DQN with experience replay architecture for subcarrier assignment, as shown in Figure 2. BS is regarded as a learning agent and the vehicular network acts as the environment state. Three key elements of the proposed model, i.e., state, action and reward, are defined, respectively, as follows.
We model the latency requirements of V2V links as the successful delivery of packet of size L within a limited time budget Tmax. Therefore, if the following constraint holds, th latency requirements of a V2V link will be satisfied: where ∆ is channel coherence time, and the index t is added in m n R [t] to indicate th capacity of the nth V2V link at different coherence time slots.

DRL for Resource Management
In this section, we describe how to use the DRL method within the proposed frame work to solve the spectrum allocation, including the proposed partial CSI observation and double DQN learning algorithm.
For mode 3 resource allocation, our objective is to find the optimal resource allocatio policy that can meet the latency demand of V2V links without excessively affecting th V2I links data rate under low signaling overhead. In addition, the vehicular communica tion network needs an intelligent resource management framework to make decisions in telligently, therefore, we adopt DRL architecture to make resource-sharing decisions. I the RL, an agent connects its environment, at each moment, the agent takes action base on the current environment state. The action will impact the state of the environmen causing certain changes in the environment. Then the agent receives numerical reward and observes a new environment state. Through continuously exploiting the environ ment, agents learn how the environment generates rewards and then update their polic accordingly. In RL, the agent's objective is to maximize the expected sum of reward which is defined as We adopt double DQN with experience replay architecture for subcarrier assign ment, as shown in Figure 2. BS is regarded as a learning agent and the vehicular networ acts as the environment state. Three key elements of the proposed model, i.e., state, actio and reward, are defined, respectively, as follows.  As shown in Figure 2, each V2V pair observed local information, which include communication interference channel power gain and latency information. We conside that it is difficult to obtain a complete CSI in the vehicle network, so only part of the CS is considered in our observation space design. This helps to reduce the signaling overhea As shown in Figure 2, each V2V pair observed local information, which includes communication interference channel power gain and latency information. We consider that it is difficult to obtain a complete CSI in the vehicle network, so only part of the CSI is considered in our observation space design. This helps to reduce the signaling overhead caused by channel estimation and CSI feedback. To consider the low latency communication requirements, key elements related to latency should be involved. In Figure 2, I n,t−1 , L n,t , T n,t represents the interference power gain received by each sub-channel of the nth V2V link in the previous time slot, remaining payload size for transmission and the time left before violating the latency constraint; among them, I m n,t can be written as follows In summary, the partial CSI observation space of the nth V2V pair can be expressed as where I n,t =(I 1 n,t , I 2 n,t , I 3 n,t , . . . , I M n,t ). Specifically, the complete CSI observation space is defined as where G n,t =(g n,t , g m,n,t , g l,n,t , g n,b,t ). Feeding back complete CSI to the BS will generate a large signaling overhead, and it is unrealistic in the dynamic environment of the vehicle, especially the links not connected to the BS, such as the V2V communication links and the interference of other vehicles on the V2V links. To obtain the g n,t , g m,n,t and g l,n,t , channel estimation at the receiver side is necessary, after which the estimation results can be fed back to the transmitter. Then the transmitter feeds back to BS. Therefore, we consider not using complete CSI. On the other hand, in mode 4, the user selects the spectrum resource with sensing channel measurements, the level of interference that will be experienced if the sensing user transmits in the corresponding spectrum resource [31]. Therefore, we are motivated to consider designing a partial CSI resource allocation mechanism using interference power measurement. Compared with the complete CSI version designed as (9), the partial CSI observation space designed as (8) can help reduce the signaling overhead.
(1) State Space: As stated above, to reduce signaling overhead in a centralized solution, in our solution, each V2V link feeds interference and delay information back to the BS. Hence, BS considers the partial CSI feedback from all V2V pairs as the current state. Thus, the state space of the BS is described as follows (2) Action Space: In this paper, we focus on the subcarrier assignment issues in the V2V communication network. Hence, the action is defined as where b n,t = m, ∀m ∈ M represents that the mth subcarrier has been selected by the nth V2V pair.
(3) Reward: In the RL framework, the training process is driven by the reward function. An agent searches decision-making policy by maximizing reward under the interaction with the environment. Hence, to meet the different requirements of communication devices, corresponding reward functions need to be formulated. In vehicular communications, V2V links exchange critical safety information and have strict latency requirements, whereas V2I links support high-rate data transmission [32]. Our objective is to improve the sum throughput of V2I links while meeting the latency requirements of V2V links. Therefore, we propose the following reward function where λ c , and λ v are the weight factors for the contributions of these two parts to the reward function composition. In response to the first objective, we simply include the sum capacity of all V2I links. To achieve V2V link low latency requirements, for each V2V link, we set the reward G n equal to the effective V2V transmission rate until the V2V links deliver the payload successfully within the delay constraint, after which the reward is set to a constant number, c, that is greater than the largest possible V2V transmission rate. Therefore, the sooner the V2V link completes the transmission, the more rewards can be obtained. As such, the V2V-related reward at each time step t is Q-learning [33] is a typical RL algorithm, which can be used to learn the optimal strategy when the state-action space is small. In the Q-learning algorithm, the action-value function is used to calculate the expected accumulative rewards for starting from a state s by taking action a under policy π, which can be expressed by Similarly, the optimal action-value function is obtained by where s t+1 is the new state after taking action a. The action-value function is updated by where α is the step-size parameter. Moreover, the choice of action a t in state s t follows exploitation and exploration policies, a widely used algorithm is the -greedy algorithm [34], which is defined as a ← argmax a Q(s, a) with probability 1− random action with probability (17) Here, is the exploration rate. However, in larger networks, complex states and optional actions make Q-learning maintain a large Q-table and slow convergence speed, which limits the application scenario. Thus, we apply Q-learning techniques with a DNN parameterized by θ as the action-value function approximator to learn the optimal policies, thus called DQN [35]. To accelerate the learning convergence rate, we adopt double DQN [36] with three key techniques.
(1) Replay Memory: At each time step, BS observes the state s t , determines resource allocation a t , and broadcasts it to all V2V users. After the execution of the V2V links, the BS gets the corresponding reward r t+1 , and the environment reaches the next state s t+1 . In this way, experience samples are formed, that is, (s t , a t , r t+1 , s t+1 ). Experience samples are stored in the replay memory. The replay memory accumulates the agent's experiences over many episodes. After that, a mini-batch is uniformly sampled from memory for neural network training.
(2) Fixed Q-Target: As shown in Figure 2, our proposed scheme consists of double DQN, a target DQN and a training DQN, which have the same structure. The weight value θ target in the target network is updated by the training network weight θ train at regular intervals. This improves network stability and convergence.
(3) Action Selection-Evaluation Decoupling: In double DQN, the training DQN is used to select the optimal action for the next state, then the action selected by the training DQN is input and the next state into the target DQN to generate the target value to calculate the training loss. By changing the action of target DQN selection to training DQN selection, the risk of over-estimation of the Q value can be reduced.
After gathering sufficient experience samples, a mini-batch that consists of D experience samples is retrieved from the buffer to minimize the sum-squared error ∑ t∈D y t − Q s t , a t θ train 2 (18) where y t is the target Q-value, accordingly, the y t can be respectively given by Then, the updating process for the BS double DQN can be written as [36] where β is a nonnegative step size for each adjustment. Algorithm 1 summarizes the training process of the proposed scheme. Reset L n,t = L and T n,t = T max , for all n ∈ N 7:

Algorithm 1 Training Process for the Proposed Scheme
for each iteration step t = 1,2, . . . do 8: Each V2V observes the observation o n,t , sends it to BS 9: BS based on the current state s t = { o 1,t , . . . ,o n,t , . . . }, select the action according to the -greedy, then gets a reward r t+1 , transforms to new state s t+1 10: Store transition (s t , a t , r t+1 , s t+1 ) into experience replay buffer 11: Sample a mini-batch of D transition samples from experience replay buffer 12: Calculate the target value according to Equation (

Simulation Results and Analysis
In this section, we provide the simulation results to demonstrate the performance of the proposed resource allocation method.

Simulation Settings
We considered a single-cell scenario. BS is located in the center of the region. The simulation setup we used was based on the urban case in 3GPP TR 36.885 [30]. We followed the main simulation setup in [22]. The main simulation parameters are summarized in Table 1, and the channel models of V2V and V2I links are shown in Table 2. The double DQN for BS consists of three fully connected hidden layers, containing 1200, 800, and 600 neurons, respectively. The activation function used by the hidden layer of double DQN networks is ReLu f (x) = max(0, x). The RMSProp optimizer [37] is used to update network parameters. The discount factor of the double DQN algorithm is set to 0.05, and the learning rate is set to 0.001. Moreover, θ target is updated with θ train every 500 steps.
We trained the whole neural network for 4000 episodes and fixed the payload size for V2V links during the training phase to be 1060 bytes, which is varied during the test phase.

Performance Comparisons under Different Parameters
To verify the effectiveness of the algorithm in this paper, we compared the proposed algorithm with the following three algorithms.
(1) meta-DRL [22]: in this scheme, using DQN to solve the problem of spectrum resource allocation, and applying deep deterministic policy gradient (DDPG) to solve the problem of continuous power allocation.
(2) Brute-Force method: the action (including channel and power) is searched exhaustively to maximize the rate of V2V links.
(3) Random method: the channel and power are randomly selected. Since we used the interference power measurement to design a partial CSI resource allocation mechanism, to prove the performance advantage of partial CSI, we compared it with complete CSI. Figure 3 shows the changes in the V2V link's successful transmission probability, V2I links throughput as the payload changes. Figure 3a reflects that as the payload size increases, the probability of successful transmission of the V2V links of all schemes decreases, including brute-force. This is because more information needs to be transmitted within the same delay constraint, so performance will decrease. However, it is found that the method in this paper is very close to brute-force and far superior to the random scheme. Since [22] considers the power control of the V2V links, the successful transmission probability of the V2V links is slightly higher than the scheme proposed in this paper. Furthermore, as can be seen from Figure 3a,b, the partial CSI proposed in this paper has comparable performance to complete CSI in meeting the V2V links delay. the method in this paper is very close to brute-force and far superior to the random scheme. Since [22] considers the power control of the V2V links, the successful transmission probability of the V2V links is slightly higher than the scheme proposed in this paper. Furthermore, as can be seen from Figure 3a,b, the partial CSI proposed in this paper has comparable performance to complete CSI in meeting the V2V links delay.   Figure 3b reflects that as the payload increases, the total throughput of the V2I link gradually decreases. This is because when the V2V payload increases, in order to obtain higher rewards, the BS will select actions to increase the V2V rate to meet the delay constraint, which will increase the interference of the V2I links and cause its rate to decrease. However, it can be seen that the throughput of the V2I links obtained by the proposed scheme is still higher than that of the random scheme and has comparable performance to the complete CSI scheme. Furthermore, the proposed scheme outperforms the literature [22] in terms of V2I throughput. In summary, the scheme proposed in this paper is close to brute-force in terms of the V2V links' successful transmission probability and the throughput of the V2I links, which is better than the random scheme and has comparable performance to complete CSI.  Figure 3b reflects that as the payload increases, the total throughput of the V2I link gradually decreases. This is because when the V2V payload increases, in order to obtain higher rewards, the BS will select actions to increase the V2V rate to meet the delay constraint, which will increase the interference of the V2I links and cause its rate to decrease. However, it can be seen that the throughput of the V2I links obtained by the proposed scheme is still higher than that of the random scheme and has comparable performance to the complete CSI scheme. Furthermore, the proposed scheme outperforms the literature [22] in terms of V2I throughput. In summary, the scheme proposed in this paper is close to bruteforce in terms of the V2V links' successful transmission probability and the throughput of the V2I links, which is better than the random scheme and has comparable performance to complete CSI. Figure 4 shows the change in the probability of successful transmission of the V2V links as the vehicle speed increases. It can be seen from the figure that the probability of successful transmission of the V2V links in the proposed scheme decreases. This is because compared to the low-speed state, when the vehicle is in the high-speed state, the environmental state changes more significantly, leading to higher observation uncertainty and reducing the learning efficiency. Therefore, when the vehicle speed gradually increases, the V2V links' successful transmission probability declines. However, the proposed scheme can still maintain a higher probability of successful transmission, which shows that the proposed algorithm has better stability in a highly dynamic environment.  Figure 4 shows the change in the probability of successful transmission of the V2V links as the vehicle speed increases. It can be seen from the figure that the probability of successful transmission of the V2V links in the proposed scheme decreases. This is because compared to the low-speed state, when the vehicle is in the high-speed state, the environmental state changes more significantly, leading to higher observation uncertainty and reducing the learning efficiency. Therefore, when the vehicle speed gradually increases, the V2V links' successful transmission probability declines. However, the proposed scheme can still maintain a higher probability of successful transmission, which shows that the proposed algorithm has better stability in a highly dynamic environment. In order to more clearly explain the reason why the solution in this article is better than the random solution, we randomly selected an episode in the test and plotted the changes in the remaining load of the V2V links over time. Among them, the delay constraint T = 100 ms, and the payload size is 3 × 1060 bytes. Since in the randomly selected episode, all V2V links in the proposed scheme and the random scheme have completed the transmission task within 50ms, we only show the data within 0-50 ms. Comparing Figure 5a,b, it can be seen that although all V2V links of the proposed scheme and the random scheme have completed data transmission within the delay constraint, the transmission time required by the scheme of this paper is much shorter than that of the random scheme. As shown in Figure 5a, all vehicles in the scheme in this paper are transmitted within 15ms, while the random scheme in Figure 5b is completed within 42 ms. This shows that the solution in this paper is more suitable for transmitting delay sensitive services and meeting the delay requirements of the V2V links.
(a) In order to more clearly explain the reason why the solution in this article is better than the random solution, we randomly selected an episode in the test and plotted the changes in the remaining load of the V2V links over time. Among them, the delay constraint T = 100 ms, and the payload size is 3 × 1060 bytes. Since in the randomly selected episode, all V2V links in the proposed scheme and the random scheme have completed the transmission task within 50ms, we only show the data within 0-50 ms. Comparing Figure 5a,b, it can be seen that although all V2V links of the proposed scheme and the random scheme have completed data transmission within the delay constraint, the transmission time required by the scheme of this paper is much shorter than that of the random scheme. As shown in Figure 5a, all vehicles in the scheme in this paper are transmitted within 15ms, while the random scheme in Figure 5b is completed within 42 ms. This shows that the solution in this paper is more suitable for transmitting delay sensitive services and meeting the delay requirements of the V2V links.  Figure 4 shows the change in the probability of successful transmission of the V2V links as the vehicle speed increases. It can be seen from the figure that the probability of successful transmission of the V2V links in the proposed scheme decreases. This is because compared to the low-speed state, when the vehicle is in the high-speed state, the environmental state changes more significantly, leading to higher observation uncertainty and reducing the learning efficiency. Therefore, when the vehicle speed gradually increases, the V2V links' successful transmission probability declines. However, the proposed scheme can still maintain a higher probability of successful transmission, which shows that the proposed algorithm has better stability in a highly dynamic environment. In order to more clearly explain the reason why the solution in this article is better than the random solution, we randomly selected an episode in the test and plotted the changes in the remaining load of the V2V links over time. Among them, the delay constraint T = 100 ms, and the payload size is 3 × 1060 bytes. Since in the randomly selected episode, all V2V links in the proposed scheme and the random scheme have completed the transmission task within 50ms, we only show the data within 0-50 ms. Comparing Figure 5a,b, it can be seen that although all V2V links of the proposed scheme and the random scheme have completed data transmission within the delay constraint, the transmission time required by the scheme of this paper is much shorter than that of the random scheme. As shown in Figure 5a, all vehicles in the scheme in this paper are transmitted within 15ms, while the random scheme in Figure 5b is completed within 42 ms. This shows that the solution in this paper is more suitable for transmitting delay sensitive services and meeting the delay requirements of the V2V links.
(a)  Figure 6 reflects the impact of V2I power changes on V2I throughput and the probability of successful V2V link transmission. It can be seen that with the increase in V2I power, the throughput of the V2 link increases, and at the same time, the interference to the V2V links increases, so the probability of successful transmission of the V2V links decreases. Therefore, it is necessary to reasonably set the power of the V2I links to meet the throughput requirements of the V2I links and the delay requirements of the V2V links.   Figure 6 reflects the impact of V2I power changes on V2I throughput and the probability of successful V2V link transmission. It can be seen that with the increase in V2I power, the throughput of the V2 link increases, and at the same time, the interference to the V2V links increases, so the probability of successful transmission of the V2V links decreases. Therefore, it is necessary to reasonably set the power of the V2I links to meet the throughput requirements of the V2I links and the delay requirements of the V2V links.  Figure 6 reflects the impact of V2I power changes on V2I throughput and the probability of successful V2V link transmission. It can be seen that with the increase in V2I power, the throughput of the V2 link increases, and at the same time, the interference to the V2V links increases, so the probability of successful transmission of the V2V links decreases. Therefore, it is necessary to reasonably set the power of the V2I links to meet the throughput requirements of the V2I links and the delay requirements of the V2V links.

Conclusions
In this article, we developed a DRL-based resource sharing scheme in the underlay approach of mode 3, in which the V2V links reuse the V2I links spectrum. Our goal is to improve the throughput of V2I links while ensuring the V2V links delay constraint. In particular, to reduce the signaling overhead for the BS to acquire complete CSI in mode 3, an intelligent resource allocation strategy based on partial CSI is proposed. The BS only allocates a spectrum based on the feedback CSI, which significantly reduces the signaling overhead. The simulation results show that compared with other methods, the proposed scheme can meet the V2V links delay constraint and has a higher V2I links throughput, and the proposed partial CSI scheme has comparable performance as the complete CSI scheme.