An optimized Algorithm for broadcast beacon frequency and transmission power in VANETs

Power control has been an important branch of the research field for VANETs. In order to meet the safety of vehicle driving, Vehicle-to-Vehicle(V2V) communication has become a research hotpot simultaneously. In the internet of vehicles, most of the messages generated by V2V communication are implemented by broadcasting. This paper proposed a joint optimization algorithm for power control and broadcast beacon frequency based on the LTE-V2V communication protocol. This algorithm applies deep reinforcement learning (DRL) and designs a reward function that can represent the communication effect of LTE-V2V. The experimental results show that the proposed algorithm can better configure the transmitting power and broadcast beacon frequency of the vehicle terminal, and obtain more reliable communication effects in the dynamic traffic flow environment.


Introduction
Mobile communication of vehicles is a vital part of the Intelligent Transportation System (ITS). LTE-V2V (long-term-evolution vehicle-to-vehicle) communication, as a kind of in-vehicle communication technology, is achieved by the device-to-device (D2D) communication between cars which are approaching each other in geographical location [1]. On the basis of LTE-V2V, the in-vehicle communication terminal can multiplex the frequency domain and the time domain communication resource broadcast beacon of the cellular communication user so that simultaneous improving the transmission rate of V2V data and the resource utilization efficiency of the cellular communication.
However, the delay, reliability, and data transmission rate of V2V wireless communication, in view of the application scenario about LTE-V2V vehicle cooperative sensing, are mainly affected by the following two aspects: On the one hand, the transmission power of the vehicle communication terminal has an impact on Vehicle cooperative sensing performance. Literature [2] has put the delay and reliability of the vehicle link as a constraint with a consider of slow time-varying channel state information and has used the power control and resource allocation algorithm to maximize the total rate of cellular users and to take into account the fairness between users in the meantime. On the other hand, Vehicle cooperative sensing performance depends on the frequency of broadcasting beacons of in-vehicle communication terminals. Literature [3] has suggested a broadcast relay algorithm that selects the next hop node relay broadcast message when adaptively estimating the link quality and the distance between the source broadcast and the potential transponder. The key point in this method is selecting a certain number of relay nodes to improve the reliability of packets. Secondly, literature [4] has proposed an algorithm that based on the average speed difference between the vehicle and the neighbouring vehicle to form a cluster could relay the message through the cluster head. The existing LTE-V2V vehicle wireless communication technology mainly adopts the broadcast beacon frequency and transmit power of a fixed configuration of the vehicle communication terminal as raw materials for independent design. Of course, the constraint condition that meeting the QoS requirements of vehicle collaboration awareness related applications in a specific traffic flow environment needs to be satisfy during the design process [5,6]. However, this independent configuration design method has not considered the potential coupling effect of the broadcast beacon frequency and transmits power of the in-vehicle communication terminal on the communication performance of the LTE-V2V in-vehicle communication terminal.
Aiming at the above two aspects affecting the communication effect of LTE-V2V, this paper has proposed a joint optimization algorithm of LTE-V2V power control and broadcast beacon frequency based on DDPG, which is suitable for LTE-V2V by constructing a reasonable reward function. The communication environment is trained by the deep reinforcement learning algorithm to obtain the optimal control strategy of the power of the vehicle communication terminal and the broadcast beacon frequency, and the effectiveness is verified by experiments.

System model
which represents the amount of wireless signal sub-frames corresponding to each broadcast beacon of the LTE-V2V vehicle communication terminal； which represents the bandwidth resource occupancy rate of the broadcast beacon of the LTE-V2V vehicle communication terminal(  ) ； Therefore, using the equation (1) and (2)  Under the condition of maintaining the communication link between LTE-V2V in-vehicle wireless communication terminals without interruption reliably, we construct the function relationship between the average multiplexing distance supported by LTE-V2V resources ( Simultaneously, this paper considers the relative geographic distance error between LTE-V2V invehicle wireless communication terminals acquired by LTE base stations and assumes that the error is . The average duration of wireless signals per sub-frame in LTE-V2V wireless communication channel is defined as sbuframe . As a consequence, the relationship between the maximum number of LTE-V2V vehicle terminals that could simultaneously support multiplexing resources supported by LTE-V2V resources () and the broadcast beacon frequency of LTE-V2V vehicle wireless communication terminal ( LTE−V2V ( b )) is shown below: The density of LTE-V2V vehicle in traffic flow environment is Tx Tx

A Joint optimized algorithm for power control and broadcast beacon frequency based on DDPG in VANETs
Reinforcement Learning (RL) is a kind of learning from environmental state mapping to action and the learning aim of it is to maximize the cumulative rewards of agents in their interactions with the environment [7]. The Markov Decision Process (MDP) can usually be used to model RL. MDP is generally defined as a five-tuple ,,, . And S is a collection of all environmental states. A is a set of executable actions of the agent.P is a state transition probability. R is represented as a transient reward.  is a discount factor (   0,1   ),which is used to calculate a bonus value for the next state. In the process of reinforcement learning, the agent observes that it is currently in a state t s , and uses this action t a to interact with the environment. The environment switches to the next state 1 + t s according to the current state and the action selected by the agent. The purpose of the Agent is to learn a strategy  to maximize the formula:  The state is defined in our model. For vehicle density, the vehicle decides which strategy to adopt based on the current state. According to formula (9), the reward function is set to: t s is the current environment and t a is the action that the agent makes based on the current environment.

Simulation Results
In order to evaluate the feasibility of the proposed algorithm, the simulation experiment simulates an environment with 9 traffic flows on the GYM platform, and each change of traffic flow follows the Markov probability transfer matrix. This paper considers the LTE-V2V communication protocol scenario in [8,9] In the DDPG experiment, the Actor network is designed to have two fully connected hidden layers, each of which has 200 neurons, the activation function is the Relu function and the activation function of the output layer is the Tanh function. There are three hidden layers in the Critic network, and the number of neurons in the hidden layer is 200, 50, and 200, respectively. The rest of the neural network hyper parameters involved are listed in Table 2.  Figure.2, we can observe that under the same transmission power and broadcast beacon frequency based on LTE-V2V protocol, the probability of normal communication between vehicles becomes greater when the density of traffic becomes larger and the communication distance becomes longer. In Fig.3, the simulation results show that under the same transmission power (33dbm), the density of each type of vehicle will have a greater impact on the reward function as the broadcast beacon frequency increases.    Figure.4 Rewards per episode graph In Fig.4，it can be clearly seen from the simulation results that the algorithm proposed in this paper can realize that the reward value does not change greatly with the change of the current environment when the number of iteration steps increases.

Conclusion
In order to obtain the largest LTE-V2V communication cooperative sensing energy efficiency target, this paper proposed a joint optimization algorithm based on deep reinforcement learning for broadcast beacon frequency and transmit power. On the GYM platform under Open-AI, the simulation results show that the proposed algorithm can accurately make better configuration of the power of the vehicle communication terminal and the broadcast beacon transmission frequency in the dynamic traffic flow environment based on the LTE-V2V communication protocol. At the same time the proposed algorithm can get more reliable communication quality.