A Robust Resource Allocation Scheme for Device-to-Device Communications Based on Q-Learning

One of the most effective technology for the 5G mobile communications is Device-to-device (D2D) communication which is also called terminal pass-through technology. It can directly communicate between devices under the control of a base station and does not require a base station to forward it. The advantages of applying D2D communication technology to cellular networks are: It can increase the communication system capacity, improve the system spectrum efficiency, increase the data transmission rate, and reduce the base station load. Aiming at the problem of co-channel interference between the D2D and cellular users, this paper proposes an efficient algorithm for resource allocation based on the idea of Q-learning, which creates multi-agent learners from multiple D2D users, and the system throughput is determined from the corresponding state-learning of the Q value list and the maximum Q action is obtained through dynamic power for control for D2D users. The mutual interference between the D2D users and base stations and exact channel state information is not required during the Q-learning process and symmetric data transmission mechanism is adopted. The proposed algorithm maximizes the system throughput by controlling the power of D2D users while guaranteeing the quality-of-service of the cellular users. Simulation results show that the proposed algorithm effectively improves system performance as compared with existing algorithms.


Introduction
With the rapid development of the mobile Internet and the continuous updating of smart terminal technology, the number of wireless mobile users and wireless network traffic has exploded [Jameel, Hamid, Jabeen et al. (2019); Ahmad, Li, Waqas et al. (2018)]. It is predicted that by 2020, the wireless network traffic will show a thousand-fold increase and cellular systems and has become a research hotspot of D2D communication in recent years [Wang, Wang, Jin et al. (2015); Ren, Liu, Liu et al. (2015); Huang, Nasir, Durrani et al. (2016); Lin, Ouayang, Zhu et al. (2016)]. Because the power control problem is a non-linear objective optimization problem, more and more researchers apply mathematical models such as game theory [Chen, Li, Jiang et al. (2015)], stochastic optimization [Sakr and Hossain (2015)], graph theory [Ni, Collings, Lipman et al. (2015)], mixed integer programming [Alfa, Maharaj, Lall et al. (2016)], etc. to research on power control issues. For example, the authors in Chen et al. [Chen, Yu, Shan et al. (2016)] proposed a distributed D2D communication power control algorithm and compared it with the traditional centralized open-loop power control. The authors in Zhang et al. [Zhang, Zhang, Yan et al. (2015)] summarized the power control problem in a hybrid D2D and cellular network as a Stackelberg game model, with the cellular user as a leader and the D2D user as a follower. A price-based distributed power control method is proposed in Ji et al. [Ji, Caire, Molisch et al. (2016)] based on graph theory, has two types of centralized and distributed power control and channel allocation joint optimization schemes. The authors in ] modeled the spectrum and power allocation problem as a convex optimization problem. Based on the random geometry theory, a joint channel selection and power allocation optimization algorithm was proposed. In these studies, much prior knowledge (such as channel state information) is assumed to be known to D2D users. But, because the traditional pilot signal (in the cellular link) is implemented in D2D communication, it is difficult to know the precise interference characteristics information between the D2D terminal and the base station, especially when the number of D2D users increases, the algorithm complexity also rises sharply. Therefore, the open-loop power control and resource optimization algorithms based on the a priori assumption that D2D users and BS have no information exchange, and no channel state information for D2D users have research significance and application prospects. Some of the recent research hotspots on key issues of D2D communication based on prior knowledge such as no channel state information (CSI) are: i) how to predict the network load status, b) select the optimal channel access, c) dynamically control the power of D2D communication users, d) reduce the interference between D2D users and cellular users, e) obtain maximum system throughput. In Asheralieva et al. [Asheralieva and Miyanaga (2016)], the authors proposed an autonomous Q-learning algorithm for channel selection by D2D pairs which do not require any information of all D2D pairs. The minimum SINR constraints are utilized and the stochastic non-cooperative game mechanism is used to represent this optimization problem. However, this algorithm requires an optimal value of Boltzmann temperature and optimized locally-observed throughput and state. In Yuan et al. [Yuan, Yuang, Feng et al. (2019)], the authors proposed a cooperative algorithm in which the D2D transmitters acts as relays to assist the cellular users for utilization of the licensed spectrum and aimed to consider a realistic scenario with incomplete CSI and perform the one-to-one matching game. However, this scheme requires conversion from cooperative to non-cooperative game and requires synchronized time slots for each one-to-one pair for obtaining the payoff and feedback. The convergence of this scheme is also degraded by an increasing number of cellular and D2D users. Moreover, the learning process is very slow because there is no sub-grouping of the cellular and D2D pairs and also the convergence rate is slower. Machine learning is an important core of artificial intelligence. The main idea is to simulate human learning behavior through computers. After participants are stimulated by the environment, they continuously change their behavior based on experience accumulation to better adapt to the environment and achieve their interests. In recent years, more and more researchers have applied machine learning methods to solve key problems of wireless communication systems [Ge, Song, Wu et al. (2019);Peng, Li, Abboud et al. (2017)]. Q-learning is a branch of machine learning (Reinforcement Learning, RL), and its main elements include the environment, reward, action, and state ]. The learner interacts with the environment through the historical state, and through the learning algorithm, calculates and optimizes a certain target decision value of the system. The specific process of Q-learning is to establish a Q value list. By learning what action each state takes to maximize the Q value, the action with the highest Q value is selected as the final action. The learner repeatedly interacts with the control environment and uses the reward value to evaluate its performance, thereby achieving an optimal decision. This paper introduces Q-learning ideas into the study of dynamic resource allocation strategies for D2D communication. In a hybrid D2D and cellular network, a mathematical model for power control and resource optimization is constructed based on Q-learning. Multiple D2D user pairs in a cellular network are considered as a symmetric multi-agent system. The D2D user action interacts with the environment, and the final target Q value function converges to the maximum value, and the user action in this state is the optimal resource allocation strategy. During the Q learning process, the D2D terminal and the base station do not need to obtain accurate channel state information and mutual interference. The user learns the distributed optimal power allocation strategy through historical throughput and power values to optimize the overall system throughput. The rest of the paper is organized as follows. In Section 2, the system model is described. In Section 3, the proposed Algorithms and their principle are analyzed. Section 4 provides the simulation results, while Section 5 concludes the paper.

System model and problem description 2.1 System model
The network structure used in this paper is shown in Fig. 1. D2D users and cellular users use the underlying method to multiplex the spectrum. Within a single cell, there is a macro base station BS. The cell comprises of number of cellular users and D2D user pairs. The sets ∈ {1,2, … , } and ∈ {1,2, … , } denote the index set of cellular users and D2D pairs, respectively. Assume that users are randomly distributed in the cell. There is no signal exchange between the D2D user and the BS, that is, the D2D user does not know the channel state information, and both the transmitting user and the receiving user are single antennas. Suppose there are orthogonal wireless channels in a BS, where the channels for cellular users are represented as the set ∈ , ∈ {1,2, … , } and the channels used for D2D users are represented as the set ∈ . The average transmit power of each cellular user is a fixed value , and the power value of the D2D user pair is ∈ � 1 , 2 , … , �, ≪ , ( ∈ {1,2, … , }). The channel of any user is represented by for � , ′ �, and ∈ {1,2, … , }. The binary channel selection vector space is N-dimensional and = [ 1 , 2 , … , ] , ( ∈ {1,2, … , }). That is, each user selects at most one channel.
(1) For a cellular user ( ∈ {1,2, … , }), given the transmit power and occupying the channel , the rate , of the cellular link can be expressed as where is the power of the cellular user, is the power of the D2D user, 2 is the variance of the white Gaussian noise power, is the channel gain of the cellular user on the channel at any time and is the channel gain from the interfering D2D users to the cellular users. similarly, for each D2D user pair � , ′ �, ( ∈ {1,2, , … , }), given the transmission power, the channel is occupied, and the D2D link rate , can be expressed as where , ′ is the channel gain of the D2D user pair on the channel at any time . Since D2D users do not know the precise channel state information, the values of , ′ are unknown.

Problem description
In heterogeneous networks, users tend to choose the network that can always get the best service. There is no information exchange between cellular users and D2D users, and D2D users are unknown about channel availability and channel quality. Therefore, it is more difficult to achieve fair and efficient cross-network resource allocation. For users, whether to communicate through D2D or cellular base stations, the principle of choice is to obtain the best system performance. For users of D2D and cellular heterogeneous networks, the maximum throughput is obtained with the smallest power consumption, so the resource allocation and power control problems are modeled as utility problems, and select the overall rate (i.e., throughput), channel allocation, and power control as the utility functions. The goal is to maximize the system user rate (throughput). Therefore, channel allocation and power control problems can be modeled as among them, the channel of the cellular user is represented as the set ∈ and the channel of the D2D user is represented as the set ∈ , ∈ {1,2, … , }, the power value of the D2D user pair is ∈ { 1 , 2 , … , }, and the power of the cellular network users is . In the D2D system, Eq. (4) is difficult or even impossible to solve. The reason is: first, D2D users do not know the precise channel state information, which means that the values of , ′ are unknown, and the objective function (4) cannot be solved; second, Eq. (4) does not take into account the higher priority (QoS) of cellular users; third, the solution of the maximum value of Eq. (4) depends on the set ∈ and the set ∈ , that is, the channel selection and power allocation of the cellular users and D2D users are related. Due to ≤ , ∀ ∈ : based on the basic logarithmic properties, Eq. (5) can be written as (6) when above Eq. (6) takes the lower limit, it is the worst scenario of the network environment, that is, all D2D users send signals with the maximum available power, causing the largest mutual interference. Therefore, for power control, the lower limit of the Eq. (6) is maximized. That is (7) is an NP-hard problem, and it is difficult to solve it directly. There are many methods for solving NP difficult problems, such as branch and bound algorithm, genetic algorithm, etc., but these methods need to consider all D2D users and base stations at the same time, all are centralized optimization algorithms. When the number of users increases, the complexity of the algorithm also rises sharply, and for the non-linear objective function Eq. (7), it is necessary to know the characteristics of interference between D2D terminals and base stations, but this information is difficult to know. To solve this problem, joint resource allocation and power control algorithm based on Q-learning is proposed. D2D terminals and base stations do not need to obtain accurate channel state information and mutual interference. Users learn the best power allocation strategy to optimize the overall system throughput through historical characteristics and power values. That is, multiple D2D user pairs in a cellular network are considered as multi-agent systems, and a joint optimization algorithm for distributed channel selection and power control based on Qlearning is designed. The advantage of this distributed algorithm is that it reduces the complexity of the algorithm, and only needs its information to perform power control, avoiding the complexity of the above calculation.

Q-learning
This article uses one of the most widely used algorithms in reinforcement learning, Q Learning [Wang, Wang, Jin et al. (2015); Ren, Liu, Liu et al. (2015)]. The basic principle of Q-learning is that the agent, that is, the initiator of the action, causes a change in the state of the environment after taking an action. The impact of this change can be quantified as a reward. The value or magnitude of the reward value can reflect the reward or punishment for evaluating the learner's actions. The learner then chooses the next action based on the reward value and the current state of the environment. The selection principle is to increase the probability of receiving a positive reward value (award) until convergence. Therefore, the action chosen depends not only on the immediate return value, but also on the historical environmental status, and has an impact on the environmental status and final return value at the next moment. Therefore, suppose action set and state set . The learner chooses an action ∈ . After the environment accepts the action, the state changes and generates an instantaneous return value ( ). Assuming the current state ∈ , the next action ′ ∈ , ′ is related to the state ′ at the next moment and the cumulative return value ( ). The learning goal of Q-Learning is to dynamically adjust the next action ′ ∈ so that ( ) takes the maximum value.
( ) can be expressed as follows [Wang, Wang, Jin et al. (2015)]: where 0 < < 1 is the return factor and ( ′ | , ) is the state transition probability from state to state ′ when the learner performs the action ′. According to Bellman theory [Kiumarsi;Vamvoudakis;Modares et al. (2018)], the maximum value of the cumulative reward value ( ) is expressed as (9) that is, Q-Learning is used to learn ( , ) and ( ′ | , ) values. The Q function can be expressed as

Problem mapping
This article introduces the idea of Q-Learning into the power control algorithm. The three major elements of Q-learning are: learner agent, state , action ∈ , and reward signals are mapped into the actual power control model. The specific mapping process is described below. D2D users are mapped as learners. Suppose there are D2D users in a cell. For D2D users , the state set is as follows where is the straight-line distance between the user and the base station, and , ′ is the direct link gain of the user pair � , ′ �. In D2D communication, at time , it is assumed that the return function represents the instantaneous return value if the user accesses the cellular network base station for communication; represents the instantaneous return value of users communicating through D2D. For each unit time slot , the mobile user has an action in the state , resulting in an instantaneous return value and . The user learns to derive the best strategy for each by interacting based on the user's location and environment. The return value ( , ) is the largest. For the network scenario in this paper, the rate is used as a return function, and the instantaneous change of the rate can intuitively and accurately reflect the network congestion, thereby deriving the throughput. Based on the rate of state , by calculating the cumulative return value of the rate, find the best power control to maximize the system throughput and QoS. Therefore, the instantaneous return function can be expressed as The instantaneous return function can be expressed as then the total return function is The action set is denoted as ∈ {0,1,2, … . }. When = 0, it means that the user pair ( , ′ ) will be connected to the macro base station, and = ( ∈ {1,2, … }), which means that the user pair ( , ′ ) will communicate through D2D. The probability of defining a cellular link is , and the probability of selecting D2D communication is [Kiumarsi, Vamvoudakis, Modares et al. (2018)], can be expressed as where τ is Boltzmann temperature parameter ] which is expressed as among them, 0 is the initial temperature and is the channel selection duration. When the value of is high, the probability distribution of channel selection is the same. When the value of is low, the user's probability distribution of the cellular network and D2D channel selection is different. Therefore, Eqs. (15) and (16) can be used as channel selection measures. The larger the probability value, the easier it is for the user to choose.

Algorithm description
According to the above mapping rules, the specific implementation steps are as follows: In the first step, the D2D user first initializes the instantaneous return value and the cumulative return value according to the current state . In the second step, the action is arbitrarily selected from the action set {0,1, … , }. By calculating Eqs. (15) and (16), the larger the probability value, the easier it is for the user to choose. In the third step, the mobile user's next state ′ is used to calculate the total return value ( , ) according to Eq. (14). The fourth step is to calculate the value of the Q function and continue to perform the above steps until it converges to the optimal strategy to obtain the maximum cumulative return value. which is ( , ) = ( −1) ( , ) + � _ ( , ) + max ( −1) ( ′, ′) − ( −1) ( , )� (18) where 0 < < 1 , it represents the learning rate. The specific algorithm is shown in Algorithm 1. Fig. 2 illustrates the flowchart of the presented algorithm to better indicates the stepwise process. Generate a binary random number of rand(.) for all users 4: if rand(. ) < 5: Calculate the action by Eq. (15) and (16) 6: Choose a channel with a higher probability for access 7: else 8: Choose the action to make Q reach the maximum 9: End if 10: Perform action according to Eq. (14) 11: Calculate the total return value ( , ) 12: Observe the next state ′, according to Eq. (18) Figure 2: Algorithm flowchart

Simulation results and performance analysis
In this section, a computer simulation method is used to evaluate the Q-based power control and channel selection joint optimization algorithm proposed in this paper for D2D communication. The simulation platform used OPNET's LTE-TDD network simulation package [Ding, Lei, Karagiannidis et al. (2017)]. The simulation parameters are shown in Tab. 1. Compare the performance of our schemes with the following three schemes: The first type uses D2D communication with random access and is called Random; the second type uses macro base station communication and is called All to BS. The third one is the greedy algorithm for channel selection [Cao, Li, Zhao et al. (2017)], which is denoted as Y-greed, and the parameter Y=0.5. First, we compare the trend of the average utility (i.e., the average throughput of D2D user pairs) with Q-learning convergence time at different Boltzmann [Cao, Li, Zhao et al. (2017)] temperatures. The results are shown in Fig. 3. The Boltzmann temperature affects the convergence rate of Q-Learning. When the value of is small, that is, the Boltzmann temperature is relatively low, Q-Learning has a fast convergence speed and a large channel selection probability. On the other hand, as increases, the probability distribution of channel selection is almost the same. At any value of , the average utility is less than the maximum possible throughput ℎ = 120 kbps.  Fig. 4 shows the trend of the average convergence time of Q-Learning for different numbers of D2D users. It can be seen that when the Boltzmann temperature is low, the convergence time of the proposed algorithm is equivalent to the greedy algorithm, and the temperature rises and the convergence speed decreases, and it takes more time to achieve convergence. Fig. 5 shows the trend of total user throughput for different numbers of D2D users. It can be seen from Fig. 5 that our proposed Q-Learning solution has the largest user throughput and is a greedy algorithm that is followed, and the All to BS scheme has the lowest throughput. This is because All to BS does not take advantage of the performance gain brought by D2D communication. All users communicate through the base station, which will inevitably lead to an increase in network load and congestion and a decrease in throughput. In random mode, the performance gain is not obvious due to the randomness of the user's choice of D2D communication. It can also be seen from Fig. 5 that our proposed Q-Learning scheme brings an increase in system throughput, which is close to the greedy algorithm, but the algorithm complexity is much lower than the greedy algorithm. Also, as the Boltzmann temperature decreases, the number of users choosing D2D links increases, so the average throughput of users also increases. Fig. 6 shows the trend of user throughput of different power control algorithms in comparison with our proposed Q-based resource optimization algorithm under different minimum signal-to-interference and noise ratios (SINRs) (that is, different channel states). When the user selects channel for ( , ′ ) at time , if channel is selected, = 1, otherwise = 0, so the SINR can be expressed as It can be seen from Fig. 6 that user throughput is a convex function of min . At the same min , the All to BS algorithm has the lowest system throughput. The reason is that when the min is low, the channel state is poor and the user throughput is small. However, when min is too high, the number of channels that meet the user's quality of service requirements will also decrease the throughput. Therefore, under min , the user throughput function is convex. Since different D2D links work in different frequency bands, they do not interfere with each other. As can be seen from Fig.  7, the average transmission power of the proposed algorithm is significantly smaller than other algorithms and very close to the greedy algorithm, but the complexity is much lower than the greedy algorithm. Tab. 2 compares the computational complexity of the proposed algorithm with conventional algorithms. It is clear from the results that the proposed algorithm has lower computational complexity as compared with the conventional algorithms. It is also clear that the proposed algorithm has also better complexity than the genetic algorithm (GA).

Conclusions and future recommendations
With the continuous development of Internet technology, the number of users and network traffic has exploded. The era of Internet-of-things and big data is emerging. How to greatly increase the network capacity has become the biggest problem facing the development of wireless networks. D2D communication allows direct communication between user equipment at close distances, which can improve the throughput of the system, obtain high spectral efficiency and energy efficiency, and achieve a multiple of the wireless network capacity. It is regarded as one of the most promising new technologies for future wireless communication systems. This paper introduces the idea of reinforcement learning into D2D communication power control research and proposes a joint resource allocation and power control algorithm based on Q-learning in D2D and cellular heterogeneous networks. The proposed algorithm maps the power control problem into a Q-Learning problem in rate as a return function, the instantaneous change of rate can intuitively and accurately reflect the instantaneous change of throughput and All to BS Random the occupation of the cellular network. Through continuous learning, adjust the power of D2D users, get the Q value table of system throughput, and choose the action with the highest Q value as the final act, and finally obtain the joint optimal strategy of channel selection and power control. Under the premise of ensuring the quality of service for cellular users, the maximum system throughput is obtained through D2D power control. Simulation results show that the proposed algorithm can maximize the system throughput and ensure the overall performance of the network. Future directions as an extension to the proposed study are to consider the interference analysis and integration of mmWave communications and evaluate the performance under different usage scenarios.