Energy-Efficient Policy Based on Cross-Layer Cooperation in Wireless Communication

Cooperative communication has emerged as a new wireless network communication concept, in which parameter optimization such as cross-layer cooperation plays an important role. Heuristic evaluation postdecision state learning algorithm (HE-PDS) is proposed in cross-layer cooperation. The proposed algorithm exploits the determinate state information and jointly considers the transmitting power and channel state condition at the physical layer and the buffer congestion control at the media access control layer. The experimental results show that the cumulative average total costs of HE-PDS algorithm decrease about ten times and 8% under the maximum delay and throughput constraints and the power costs decrease about 50% and 28% under various delay limits and about 100% and 56% under the different throughput constraints than the traditional Q algorithm and PDS algorithm, demonstrating that the proposed algorithm has much better energy-efficient performance and faster convergence speed and outperforms the traditional Q learning algorithm and PDS learning algorithm.


Introduction
Recently, the merits of cooperative communication in the physical layer have been explored. However, the impact of cooperative communication on the design of the higher layers has not been well understood yet. As wireless devices often rely on battery power sources in wireless communication, how to minimize the energy consumption under the constraints on both delay and throughput has posed a great challenge and attracted lots of research attention in recent years [1]. Besides, affected by the fading channel state, timevarying buffer state, and dynamic traffic characteristics, this problem becomes more sophisticated [2]. Since the unknown environment can be modeled as a Markov decision process (MDP), it is reasonable to build the cross-layer transmission strategy based on this property [3,4]. The state of the art of research on the energy-efficient problem in wireless communication can be mainly divided into two categories: the cross-layer design and approximate algorithm design. Related research is as follows.
From the view of energy-efficient design, the authors in [5] analysed the throughput performance. However, the feature of limited buffer has not been taken into account during the performance analysis. References [6][7][8] considered energy-efficient packet transmission under packet delay constraint. In [9,10], the authors investigated the balance between throughput and the energy consumption. Although all these works obtained good energy-efficient performance, the trade-off among delay, throughput, and energy consumption is not fully considered. Aiming at the characteristics of MDP model, [6,[9][10][11] formulated the optimal packet transmission as a control policy which was solved by reinforcement learning (RL) algorithm. However, most of these works performed the computation offline which resulted in restricted application. In [7], the authors introduced the postdecision state (PDS) learning to raise the convergence rate. Unfortunately, the state of the transmission power of the model has not been taken into account. Although [11] took the power state into consideration, the effect of tradeoff between exploration and exploitation is not been fully considered. Therefore, the convergence performance of the algorithm is needed to be further improved.
To address the aforementioned challenge, this paper extends our prior work [8] by considering the constraints of both delay and throughput simultaneously. We propose a heuristic evaluation postdecision state (HE-PDS) algorithm for packet transmission, which has not only low computation complexity but also faster convergence speed. The specific contributions of this paper include the following.
(i) A literature survey about various existing energyefficient policy, analyzing their advantages and disadvantages.
(ii) An effective energy-efficient optimization models for decreasing the energy consumption is proposed in wireless communication.
(iii) A unified framework to realize a scheduling mechanism is proposed by jointly considering the transmit power and channel state at the physical layer and the buffer congestion control at the media access control layer.
(iv) An online RL algorithm is proposed that fully exploits the known state information about the system's dynamics to improve learning performance.
(v) Performance analysis of the proposed algorithm and an evaluation of the algorithm with respect to other existing algorithms.
The rest of this paper is organized as follows. A brief overview of the related works is presented in Section 2. The formulation of the problem within the structure of CMDP is presented in Section 3. Section 4 presents an online HE-PDS algorithm for cross-layer optimization. Experiments are given in Section 5. Finally, Section 6 summarizes the anticipated results and discusses some future research directions.

Cross-Layer Cooperation Model
Cooperative communication will improve network performance in certain circumstances. However, if the cooperative communication is not necessary, it will make the system more complex, increase the transmission delay, and reduce the efficiency of the system. As illustrated in Figure 1, we consider a point to point system where one single user (a transmitter and receiver pair) transmits data from the finite buffer queue over a time-varying channel. Meanwhile, we divide the transmission time into equal slots of length Δ and time slot denotes the discrete time interval [ Δ , ( +1)Δ ]. We assume that the transmission and power management decisions are determined and the system state information remains According to the feedback of the delay, throughput and channel state information obtained at the receiver, the transmission rate, and transmission power are adaptively adjusted at the transmitter.

Physical Layer Model.
We consider a discrete time block Rayleigh fading channel model with additive white Gauss noise (AWGN) [12,13], where its power spectrum density is 0 /2 and the wireless channel bandwidth is . During each time slot, we assume that the power gain of the channel state is constant and the transfer of channel state only occurs in the adjacent states. In this paper, we use a finite state Markov channel (FSMC) to describe the wireless channel [14,15]. As shown in Figure 2, there are channel states each of which can be transitioned to its adjacent states with corresponding probabilities.
In Rayleigh fading channel, the received instantaneous signal-to-noise ratio (SNR) is exponentially distributed with probability density function: where 0 = [ ] represents the average channel gain. The channel is said to be in state ℎ if the received SNR is in the interval [ , +1 ]. Let ( ) be the level crossing rate (LCR), which is given by where is the maximum Doppler frequency. Therefore, the state transition probability can be obtained by the following formula: where the steady state probability (SSP) is given by Figure 3, let the transmission buffer be the first in first out queue. In the th time slot, the transmitter receives packets, stores them in the finite buffer, and sends some packets from the buffer. The traffic arrival distribution is assumed to follow an independent and identical distribution (IID) during each slot. For simplicity, we assume that the packets arrival follows a Poisson process with rate . Therefore, the probability density with packets arrival is denoted as

MAC Layer Model. As shown in
Afterwards, we define that the backlog at the transmitter buffer is denoted by ∈ [0, ], where is the capacity of the finite buffer and each packet contains bits. Besides, the arrival packets will be dropped if the buffer is full. Meanwhile, we assume a packet arrival occurs at the end of time slot.
Let packets be sent at the transmitter in slot , where ∈ {0, 1, . . . , }. Affected by the bit error ratio (BER), the packets received at the receiver may be smaller than ; that is, (BER , ) ≤ . Assuming independent packet losses, is represented by a binomial distribution: where PER is the packet error ratio, which meets PER = 1 − (1 − BER ) . We also define int as the initial buffer state and the buffer state at the th slot as . Therefore, the buffer state at the transmitter can evolve recursively as follows: On Idle p on,on p on,idle p on,off p idle,on Figure 4: State diagram of the power model.

Dynamic Power Management Model.
To reduce power consumption, we assume that the wireless card can turn to low power state similar to [8,11,16]. Specifically, the card may be one of the two power management states; that is, ∈ {on, idle}. Furthermore, the power state can be switched to on or idle by the corresponding actions in the set = { on , idle }.
We define on and idle as the power overhead by the wireless card in the on and idle states, respectively. Let tr be the power consumption when the state transitions from on to idle or vice versa. In the th slot, if the packet throughput is , then the required power is where ℎ is the channel state; is the power management state; and and are the power management action and the transmission power, respectively. Define as the number of symbols per slot. Following the discussion in [10], the power required for the transmission is given by In the implementation of power management action, we assume that the delay of the power state switching from one state to another is negligibly small. Let ( ) = [ ( | , )] , represent the transition probability matrix, which means that the power state is switched from to under the condition that the power management action is . As shown in Figure 4, the sequence of the power management states can be modeled as a constraint Markov chain with transition probabilities.

Problem Formulation
As discussed in the second section, in a given channel state, since the energy consumption function is a convex function when packets are transmitted, there must exist an optimal solution to this problem [17]. Given the buffer state , channel state ℎ, and power management state , we define a joint vector state ( , ℎ , ) ∈ , where denotes the th time slot. Furthermore, we formulate this joint vector state process as a CMDP. Meanwhile, we use (BEP , , ) ∈ to represent the joint action, where BEP is the bit-error probability, is the power management action, and is the number of packets to be transmitted. For simplicity, we use buffer overhead instead of queue delay of the transmitter [8]. Consequently, the holding cost and overflow cost of the buffer can be obtained by the following: where the holding cost at the start of each slot stands for the number of packets that still remain in the buffer. We use parameter to fully analyse the effect of the holding cost and overflow cost on wireless transmission. Thus, the buffer cost can be evaluated as For the average packet arrival rate , the system throughput can be calculated by where PER is packet error rate and drop = (1/ )lim → ∞ ∑ =1 overflow is the long-term average buffer overflow probability. Therefore, the throughput maximization is equivalent to minimizing the total packet loss number, which contains the number of both buffer overflow and lost packets caused by BER. As in [10], its mathematical model can be defined as In summary, we can reformulate the cross-layer energyefficient transmission optimization as a problem of minimizing the long-term average power consumption under transmission delay and throughput constraints. Therefore, the optimization problem can be expressed as where (0 ≤ ≤ 1) is the discount factor; : → is a stationary policy which maps system state into transmission New packets l n arrive Execute a n Figure 5: PDS model. rate for each time slot. and denote the throughput and delay constraints, respectively. Similar to [18], by introducing lagrange multipliers, 1 , 2 , this problem can be reformulated as an unconstrained MDP. Specifically, we define the system Lagrangian cost function as

Algorithm Description.
The state information of the environment is often assumed uncertain when the state-action pairs are learned in the traditional learning algorithm. Therefore, the known state information can not be fully utilized in the learning process which will inevitably result in poor convergence performance. However, the known information may be determined in most communication systems. Table 1 gives an example of what is known and what is unknown. In Figure 4, when on,idle and idle,on are known and determined, the ( ) in Table 1 can be defined as known and determined. Besides, if the transmission power in (8) is known, then the power consumption can be classified as known. Similarly, the packets arrival probability and holding cost also can be defined as known and stochastic when BER is known in (4).
In order to use the known state information, we introduce postdecision state (PDS) and PDS value function as in [7]. In PDS learning algorithm, the search of optimal strategy is mainly performed by PDS. Specifically, as shown in Figure 5, the PDS is a virtual state of the system after performing a selected action. In addition, we further assume that the buffer state changes from the current state to the PDS, and, afterwards, the channel state and power management state change from the PDS to the next state.
International Journal of Distributed Sensor Networks 5 Defining the PDS set̃(̃,h ,̃) ∈̃, therefore, the system probability function can be organized as where : × →̃is the transition probability from the current state to the PDS, which decides the known impacts of the performed action . The transition probability from PDS to the next state is defined as̃:̃× → [0, 1], which determines the stochastic impacts caused by the action . The design objective for PDS learning is to obtain an optimal action ( * ) to maximize the long-term value denoted by * ( , ). Define the PDS value function for PDS learning algorithm as * (̃) =̃(̃) + ∑̃( |̃, ) * ( ) , wherẽ(̃) is the immediate reward obtained from PDS to the next state. Meanwhile, the immediate reward obtained from the current state to the PDS is denoted by ( , ). The discount factor (0 ≤ ≤ 1) is the level of "foresight" in making decisions.
The optimal scheme can be calculated by the following formula in traditional learning [19]: * ( ) = min where ( , ) is the reward obtained by taking action in state . From the proof described in Appendix A, the optimal strategy for PDS algorithm can be calculated by the following formula: * Although the PDS learning can reduce action exploration by using the determined information, the action does not balance the trade-off between the exploration and exploitation. To overcome the problem, we propose an HE-PDS learning algorithm that uses heuristic function and evaluation function to improve the algorithm performance. Specifically, the heuristic function stands for the importance when executing an action and the evaluation function for the feasibility. Thus, the optimal scheme can be written as follows: where random is an action randomly chosen among the available action set , which means that a nonoptimal action is intentionally selected to obtain the information of the unknown state. Besides, and are used to control the influence of the heuristic function and evaluation function, respectively; is a random value in the interval (0, 1). The trade-off between exploration and exploitation is controlled by (0 ≤ ≤ 1). Specifically, if is larger, the random selection probability is smaller. The heuristic function ( , ) is used to affect the choice of the actions. However, since the majority of the actions cannot meet the optimal requirements, we use the evaluation function ( , ) to reduce the number of the action to be selected. In order to minimize the error of the heuristic function and evaluation function, the corresponding definitions are given by where is a small real value and ( ) is the action suggested by the heuristic policy. In order to ensure the validity of the exploration process for all state-action pairs, simulated annealing algorithm is used similar to [20]. Thus, the probability that the action is executed in the current state is given by where is the temperature parameter, which controls randomness of the action selection. In summary, the solving process of energy-efficient problem is as follows. In the th slot, The HE-PDS first observes the current state and then, based on the observations, selects and executes an action . Finally, the algorithm obtains immediate reward ( , ) and (̃) and enters next learning cycle. During the learning process, thẽ+ 1 (̃) value can be adjusted by the following formula: where is the learning rate. +1 ( , ) can converge to the optimal * ( , ) when the sequence of learning rates meets ∑ ∞ =0 = ∞, ∑ ∞ =0 ( ) 2 < ∞ and the maximum errors of ( , ) and * ( , ) are bounded. The proof can be found in Appendix B.

The Procedure of the HE-PDS Learning Algorithm.
According to the analysis stated above, the working procedure of HE-PDS learning algorithm is summarized in Algorithm 1.

Numerical Results and Discussion
In this section, we will compare the performance of the proposed algorithm with that of the traditional learning and PDS learning algorithm. In the numerical computation, we assume that the bits can be mapped into QAM symbols by Gray code in physical layer similar to [8,11]. The buffer length is = 25 packets and the packet length is equal to = 5000 bits. Assume that the channel transition distribution is known. In particular, the channel state and its transition probability are described in Table 2 similar to [10]. The noise power density 0 /2 is set to 10 −11 Watt/Hz. We let the channel bandwidth be equal to symbol rate ( = 1/ ), where is the duration of one MQMA symbol and 1/ = 500 × 103 symbol/second.

Performance Comparison under the Fixed Delay and
Throughput Constraints. Figure 6 compares the cumulative average costs for 80000 time slots under the maximum delay (4/ packets) and throughput (0.1/ packets) constraints.
In each subgraph, horizontal coordinate represents the simulation time slot, and the vertical coordinate denotes the cumulative average total cost, delay overhead, throughput cost, and energy consumption in the corresponding slot, respectively. From (a), (b), and (c), we observe that the HE-PDS algorithm and PDS algorithm reduce their cumulative average total costs, cumulative average delay overhead, and cumulative average throughput cost by around ten times compared to the algorithm. In addition, the three metrics of the HE-PDS algorithm decrease about 8%, 10%, and 9% than PDS algorithm, respectively. Furthermore, as described in (d), the algorithm has lower energy consumption at the beginning of the simulation, but its costs increase and stabilize after about 40000th time slot. Since there has been no experience about the environment, the PDS strategy and HE-PDS strategy have larger power costs at the start of the simulation; however, the costs decrease over increasing the simulation time. In addition, the PDS consumes higher power costs than HE-PDS since HE-PDS balances the tradeoff between exploration and exploitation which results in a sharper consumption decline in (d).

Performance Comparison under Various Delay Limits.
To validate the performance of the HE-PDS algorithm for different delay limits, taking values of [3,4,5,6,7,8,9,10,11]/ packets, respectively, Figure 7 shows the delay-energy tradeoff obtained by these three algorithms. From Figure 7, we observe that the power costs of the HE-PDS algorithm decrease about 50% and 28% than the traditional algorithm and PDS algorithm, respectively. We also observe that the power costs of all these algorithms decrease as the delay constraint values increase. Besides, the algorithm gets into steady state at the 9/ packet/slot and the PDS will get it at the 8/ packet/slot; however, the times that are required to enter the steady state is significantly reduced to 5/ packet/slot for HE-PDS algorithm. This suggests that HE-PDS has an obvious advantage in energy-efficient under various delay constraints.

Performance Comparison under Various Throughput
Limits. To verify the performance under various throughput limits, Figure 8     Obviously, this observation is in accordance with formula (14).

Algorithm Convergence Analysis.
In this section, we evaluate how the parameter will affect the convergence of the HE-PDS energy-efficient algorithm. will be set to 0.98 and 0.85, respectively. We also set to 4/ packets and to 0.1/ packets. As can be seen in Figure 9, the results of the energy consumption under the fixed delay and throughput constraints and different are illustrated in (a). The convergence fluctuations of these algorithms are shown in (b) and (c). For simplicity of illustration, we define the relative fluctuation function at the time slot as ( ) = log (‖ ( + 1) − ( )‖/‖ ( + 1)‖), where is the energy consumption and is a real value. Therefore, the smaller value will reflect smaller energy fluctuation which means faster convergence speed. From Figure 9(a), we observe that the proposed algorithm converges with lower energy consumption than the other two algorithms. For example, when is 0.98, HE-PDS algorithm can converge with approximate energy consumption value of 170 mJ, while and PDS algorithms converge with energy consumptions of about 300 mJ and 290 mJ, respectively. In addition, when is reduced to 0.85, the proposed algorithm can converge with energy consumption value by 70 mJ lower than the other two algorithms, both which converge with the same consumption of about 220 mJ. Furthermore, as shown in (b) and (c), our proposed algorithm can obtain the lowest relative fluctuation values, which means that HE-PDS has the fastest convergence rate. Specifically, the value of HE-PDS algorithm is smaller than the PDS and algorithms about 21% and 23% when is 0.98. Meanwhile, when is equal to 0.85, the value of the HE-PDS algorithm becomes lower than the PDS and algorithms about 15% and 17%. This improvement of the performance is due to the fact that our proposed algorithm explicitly uses the heuristic function and evaluation function to effectively reduce the number of actions to be chosen. Consequently, the relative fluctuation results confirm that HE-PDS algorithm can achieve the obvious convergence improvement.

Conclusion
In this paper, we investigated the impacts of the cooperative communications and designed cooperative cross-layer algorithm on energy-efficient policy in wireless networks while subjected to both transmission delay and throughput constraints. Given the dynamic buffer, time-varying channel states, and system-level power consumption in a point to point transmission environment, the problem is formulated as a CMDP and further converted into an UMDP by Lagrange multiplier. We propose an HE-PDS learning algorithm based on the determinate state information to achieve an optimal energy-efficient strategy by using the heuristic function and evaluation function. Furthermore, the performance of different energy-efficient strategies is compared and the proposed scheme is verified through simulations. Through the discussions, we highlight that the proposed algorithm has much better energy-efficient performance and faster convergence speed than the other typical state-of-the-art schemes. Therefore, the optimal strategy of * PDS ( ) and * ( ) is equivalent.

B. The Convergence of the HE-PDS Learning Algorithm
Based on formula (17), we know that

Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.