Reinforcement Learning (RL)-Based Energy Efficient Resource Allocation for Energy Harvesting-Powered Wireless Body Area Network

Wireless body area networks (WBANs) have attracted great attention from both industry and academia as a promising technology for continuous monitoring of physiological signals of the human body. As the sensors in WBANs are typically battery-driven and inconvenient to recharge, an energy efficient resource allocation scheme is essential to prolong the lifetime of the networks, while guaranteeing the rigid requirements of quality of service (QoS) of the WBANs in nature. As a possible alternative solution to address the energy efficiency problem, energy harvesting (EH) technology with the capability of harvesting energy from ambient sources can potentially reduce the dependence on the battery supply. Consequently, in this paper, we investigate the resource allocation problem for EH-powered WBANs (EH-WBANs). Our goal is to maximize the energy efficiency of the EH-WBANs with the joint consideration of transmission mode, relay selection, allocated time slot, transmission power, and the energy constraint of each sensor. In view of the characteristic of the EH-WBANs, we formulate the energy efficiency problem as a discrete-time and finite-state Markov decision process (DFMDP), in which allocation strategy decisions are made by a hub that does not have complete and global network information. Owing to the complexity of the problem, we propose a modified Q-learning (QL) algorithm to obtain the optimal allocation strategy. The numerical results validate the effectiveness of the proposed scheme as well as the low computation complexity of the proposed modified Q-learning (QL) algorithm.

in the work of Niyato et al. [31]; the authors concentrated on the sleep and wakeup scheduling for energy conservation. However, different from EH-powered WSNs, EH-WBANs mainly harvest energy from human body bio-energy sources [32]. These human body bio-energy sources can be categorized into bio-chemical and bio-mechanical energy sources. The bio-chemical energy sources convert electrochemical to electricity for invasive body sensors, while bio-mechanical energy can be obtained from the locomotion of the human body [33]. In the work of Quwaider et al. [34], the weighted sum of the outage probabilities was the objective function to be minimized. The harvested energy was known as a priori to the scheduler, and an optimal offline algorithm was proposed to get the optimal solution. A joint power-QoS control scheme was proposed in the work of Ibarra et al. [35], namely powered by energy harvesting-quality of service (PEH-QoS). The PEH-QoS scheme combines three interconnected modules: the power-EH aware management module, the data queue aware control module, and the packet aggregator system. The core idea of PEH-QoS is to use the amount of power available for transmission and the amount of data stored in the queue to determine the maximum number of packets that can be transmitted in each data communication process. The simulation results show that the energy efficiency of the body sensor can be improved.
Looking at these previous works, the energy efficiency issue is rarely considered; even if some works tried to investigate EH-WBANs, they covered only limited aspects such as the sum-rate, transmission power, and the tradeoff between different objectives. In order to fill this gap, we investigate the energy efficient resource allocation scheme in EH-WBANs with the goal of maximizing the energy efficiency in this paper. This is also the motivation behind this work.
The main contributions of this paper are as follows: • We consider a resource allocation problem for EH-WBANs with the goal of maximizing the average energy efficiency of body sensors. The resource allocation problem jointly considers the transmission mode, relay selection, allocated time slots, transmission power, and energy status to make the optimal allocation decision; • We formulate the energy efficiency problem to be a discrete-time and finite-state Markov decision process (DFMDP) and a modified Q-learning algorithm, which reduces the state-action space in the original Q-learning algorithm, is proposed to solve the modeled problem; • From the numerical analysis, we show that the proposed scheme can obtain the best energy efficiency and with the more rapid convergence speed by eliminating the irrelevant exploration space in the Q-table as compared with the classical Q-learning algorithm.
The remainder of this paper is organized as follows. In Section 2, the network model considered in this paper is presented. After that, the corresponding energy efficiency maximization problem is formulated and the proposed modified Q-learning algorithm is elaborated in Section 3. The simulation results are discussed in Section 4. Finally, we conclude the paper in Section 5.

Network Model Descriptions
In this section, we first depict the system model of the EH-WBAN, which is then followed by the details on the data transmission model, energy harvesting model, and energy efficiency model in EH-WBANs. Table 1 summarizes the different symbols and notations used throughout this paper.

Symbol Definition
δ k Sn→Sm k-th time slot is allocated to n-th body sensor for transmitting data to m-th body sensor. δ k Sn→Sm ∈ {0, 1}, (n, m ∈ (1, 2, . . . , N), ∀k ∈ ψ) δ k Sm→H m-th body sensor forwards the data from n-th body sensor to the hub at the k-th time slot. δ k Sm→H ∈ {0, 1}(n, m ∈ (1, 2, . . . , N), ∀k ∈ ψ) R n Data rate of the n-th body sensor  In this treatise, we consider a single EH-WBAN with one hub and multiple EH-powered body sensors, as illustrated in Figure 1. The hub is placed on the belt, while various body sensors, for example, the electrocardiogram sensor (ECG), the electromyography sensor (EMG), the electroencephalography sensor (EEG), the glucose sensor, and motion sensor, are placed in different positions of the body according to different detection purposes. For simplicity, we only consider the uplink transmission from body sensors to the hub and only body sensors are equipped with the EH function. Here, we denote the hub as H and body sensors as S n , n ∈ (1, 2, . . . , N). In this work, we assume that both direct transmission and cooperative transmission modes are supported by the network layer, as recommended by IEEE 802.15.6 standard [36]. In cooperative transmission mode, only two-hop transmission is permitted. We define a binary parameter α S n ∈ {0, 1} n ∈ (1, 2, . . . , N) to indicate the transmission mode that is utilized recently by the n-th body sensor. α S n = 1 denotes that the n-th body sensor transmits data to the hub directly, while α S n = 0 indicates that the n-th body sensor is in cooperative transmission mode. In the MAC layer, TDM technology is employed to prevent the mutual interference. As the slotted system model is adopted, each transmission frame can be divided into K number of time slots and the time slot set is denoted as ψ = (1, 2, . . . , K). We set t 0 = 0 and t K = T. The duration of each slot is denoted as τ k = t k − t k−1 ∀k ∈ ψ. In this model, these time slots can be assigned to the body sensors, whether they operate in direct transmission or cooperative transmission modes. Meanwhile, it should be noted that the transmission mode of each body sensor is determined by the resource allocation strategy in the proposed scheme. For example, after the proposed scheme finds the optimal resource allocation strategy, in which a specific body sensor is determined to transmit data in cooperative mode, then the body sensor will use the certain time slot (which is also determined by the resource allocation strategy) to transmit data in cooperative mode. In case of direct transmission mode, we define a binary parameter β k S n ∈ {0, 1}, (n ∈ (1, 2, . . . , N), ∀k ∈ ψ) to indicate which time slot is assigned to a specific body sensor. β k S n = 1 denotes that the k-th time slot is assigned to the n-th body sensor for direct transmission, while β k S n = 0 means the k-th time slot is not assigned to the n-th body sensor for direct transmission. More specifically, another two reasonable assumptions are made in this model: (1) the hub can only receive data from one body sensor at each time slot; (2) in each time frame, each body sensor only be assigned at most to one time slot for direct transmission. Thus, we can derive two constraints as Equations (1) and (2): In the case of cooperative transmission mode, we assume that the K time slots in a time frame are allocated to both source-relay and relay-hub links. This assumption is mainly to be used to guarantee the fairness between direct transmission and cooperative transmission, to obtain the optimal resource allocation strategy. Similarly, we define a parameter δ k S n →S m ∈ {0, 1}, (n, m ∈ (1, 2, . . . , N), ∀k ∈ ψ) as an indicator that the k-th time slot is allocated to n-th body sensor for transmitting data to the m-th body sensor, which is selected as the relay node of the n-th body sensor. Meanwhile, δ k S m →H ∈ {0, 1}(n, m ∈ (1, 2, . . . , N), ∀k ∈ ψ) is denoted as the indicator that the m-th body sensor forwards the data from the n-th body sensor to the hub using the k-th time slot. In this mode, we assume that each source sensor can select one relay sensor and each relay sensor can only forward data from one source sensor at any time slot. Thus, we can obtain two constraints as Equations (3) and (4): Furthermore, because each link can only be assigned at most to one time slot, we can obtain the constraint as Equation (5): Another aspect to note is that the data transmission from the source sensor to relay sensor should be prior to the transmission from the relay sensor to hub. Therefore, we can obtain Equation (6):

Data Transmission Model
In WBANs, different body sensors with different monitoring purposes may have the heterogeneity of data rate requirement. In the proposed data transmission model, we suppose that all the body sensors can be served as the source node in direct transmission mode or relay node in cooperative mode. The selection of mode is determined by the resource allocation scheme. We denote R n as the data rate of the n-th body sensor, and it can be expressed as Equation (7): where R n d is the data rate of the n-th body sensor in direct transmission mode, and R n c denotes the data rate of the n-th body sensor in cooperative transmission mode. On the basis of the above analysis, we can derive the instantaneous signal to interference plus noise ratio (SINR) of direct transmission and cooperative transmission. Equations (8)- (10) give the instantaneous SINR of direct link, source-relay link, and relay-hub link in the k-th time slot, respectively.
where p d n,k denotes the instantaneous transmission power of the n-th body sensor in the k-th time slot when transmitting data to the hub; g S n →H is the transmission gain between the n-th body sensor and the hub; p s n 1 ,m,k denotes the instantaneous transmission power of n 1 -th body sensor in the k-th time slot when transmitting data to m-th body sensor, which is selected as the relay sensor of the n 1 -th body sensor; g S n 1 →S m is the transmission gain between the n 1 -th body sensor and m-th body sensor; and n 0 is the noise power.
where p s→r n,m,k denotes the instantaneous transmission power of the n-th body sensor when transmitting data to the m-th body sensor, which is selected as its relay sensor in the k-th time slot; p r→H n,m,k denotes the instantaneous transmission power of m-th body sensor in the k-th time slot when forwarding data from n-th body sensor to the hub; and I s→r n,m,k is the total instantaneous interference of the source-relay link in the k-th time slot. The expression of I s→r n,m,k includes three items; the first item indicates the interference from other source-relay links, the second item is the interference from direct transmission between the source body sensor and hub, and the three item represents the interference from relay-hub links.
where I r→H n,m,k is the total instantaneous interference between the relay sensor and hub when the m-th body sensor is selected as the relay of the n-th body sensor in the k-th time slot.
Additionally, channel fading between body sensors and the hub is affected by many factors such as clothing and obstructions due to different body segments [37], thus the dynamic link characteristics should be taken into full consideration. In this paper, the channel fading takes into account both large-scale fading and small-scale fading. The channel gain g can be represented as Equation (11): where α denotes the large-scale fading, which includes path loss and shadowing. It can be modeled as PL is the path loss constant; β denotes the log-normal shadowing random component; d is the distance between transmitter and receiver in a communication link; ϕ is the power decay exponent; and h is the small-scale fading, which is assumed as Rayleigh small-scale fading with unit mean. According to Shannon's theorem, we can obtain the transmission rate of direct mode as given in Equation (12): when B denotes the bandwidth of the channel. The transmission rate of the cooperative mode R n c can be divided into two parts: one is the transmission rate of source-relay link R n c, s→r and another is the transmission rate of relay-hub link R n c, r→H , as shown in Equations (13) and (14): However, in cooperative transmission mode, the transmission rate of the path between the source body sensor and hub is limited by the smaller transmission rate of the source-relay link and relay-hub link. Hence, the transmission rate of the cooperative mode is R n c = min R n c, s→r , R n c, r→H .

Data Serving Model
In this scenario, we make the assumption that the data are stored in the form of packets in the buffer of the device. The arrived data at each body sensor follow an independently and identically distributed (i.i.d.) sequence with an average rate of λ d [38]. Practically, we assume that the buffer of the device is finite and served in first in first out fashion. We denoted DQ k S n as the instantaneous data queue length at the n-th body sensor in time slot k. The maximum traffic queue length of body sensors is represented by DQ max S n . Accordingly, we can obtain the update function of the instantaneous data queue length as Equation (15): where PS data is the traffic packet size, is the instantaneous service rate of transmission link of n-th body sensor in k − 1-th time slot, and A k−1 S n is the arriving traffic packets of the n-th body sensor in the k − 1-th time slot.

Energy Harvesting Model
We denoted E n, k as the energy harvested by the n-th body sensor in the k-th time slot. {E n, 1 , E n, 2 , . . . , E n, t , . . . , E n, K } is the time sequence of energy harvested in a transmission frame. It is also i.i.d. sequence with average rate of λ e [38]. We denote EQ k S n as the instantaneous energy queue length of the n-th body sensor in the k-th time slot. The maximum energy queue length of body sensors is represented by EQ max S n . Therefore, we can obtain the update function of the instantaneous energy queue length as Equation (16): where PS energy is the energy packet size with the unit of Joules/packet. p n,k−1 denotes the transmission power of the body sensor in the k − 1-th time slot. According on the transmission mode, p n,k−1 can be set to one of p d n,k−1 , p s→r n,m,k−1 , and p r→H n,m,k−1 . It is worth noting that, because the capacity of the energy storage device is finite, two constraints can be derived from Equation (15), as expressed in Equations (17) and (18): Equation (16) depicts that the current available energy cannot exceed the total energy in the battery. Equation (17) expresses that the total energy stored in the battery cannot exceed the maximum battery capacity.

Energy Efficiency Model
In this paper, we define the energy efficiency (EE) of WBANs as the ratio of the transmission rate to the consumed transmission power. Equation (19) gives the energy efficiency of the n-th body sensor in the k-th time slot.

Problem Formulation and Optimization Algorithm
From the energy efficiency maximization problem, we can see that it is a long-term multi-objective optimization problem. Simultaneously, because the variables p n,k are continuous, while α S n , β k S n , and δ k S n are binary, problem (21) is a mixed integer nonlinear programming problem, which cannot be directly solved by convex optimization methods. Even if we can transform the original problem into a tractable convex optimization problem, the problem still requires the prior network information such as channel state information (CSI) to achieve optimal performance. However, WBANs normally work in a dynamic channel characteristic owing to the posture and environment variation [39,40]. Furthermore, from Equation (15), we found that the current consumed energy packets are only related to current arrivals and the previous remainders in the energy queue. Thus, we can formulate problem (21) as the discrete-time and finite-state Markov decision process (DFMDP) [41]. More specifically, in this work, we formulate our scenario into a centralized DFMDP. Therefore, the hub should acquire all information about both the network and users to make the optimal decision. The reasons for formulating the centralized DFMDP are as follows: (1) the hub has more abundant resources compared with the body sensor; (2) the centralized DFMDP will reduce the network signaling overhead and redundancy as compared with distributed DFMDP. In distributed DFMDP, each body sensor should make the decision without the complete knowledge and global network information that will increase the computation complexity and consume more energy. Meanwhile, owing to the high computation complexity, we propose to utilize a modified Q-learning algorithm to solve the optimization problem.

DFMDP Model
DFMDP is a discrete time stochastic control process that provides a mathematical framework for modeling decision-making problems in uncertain and stochastic environments [42]. Typically, a DFMDP is defined by a tuple (S, A, p, r), where S is a finite set of states, A is a finite set of actions, p is a transition probability from state s to state s' (∀s ∈ S, ∀s ∈ S) after action a (∀a ∈ A) is performed, and r is the immediate reward obtained after a (∀a ∈ A) is executed. We denote π as a policy that is a mapping from a state to an action. Our goal is to find the optimal policy denoted as π * to maximize the reward function over a finite time horizon in the DFMDP. Therefore, the detailed tuple in our proposed model is designed as follows: 1.
The state of each body sensor S n in the k-th time slot can be denoted as Sta k S n ∈ S. In this model,

2.
The action a (∀a ∈ A) in this scenario should be the resource allocation variables, which include transmission mode α S n , time slot allocation β k S n , relay selection δ k S n , and power allocation p n,k . To make sure the integrity of the exploration of action space, p d n,k , p s→r n,m,k , and p r→H n,m,k should be subject to the maximum transmission power p max n .

3.
Obviously, the reward r is the immediate reward corresponding to current state-action pair, which is given by Equation (20).
However, the traditional value-based algorithms such as Monte Carlo [43] and temporal difference (TD) [44] algorithms have some shortcomings in practical applications, for instance, they cannot handle the tasks in continuous action space efficiently and the final solution may not be globally optimal. Therefore, we intend to adopt a policy-based algorithm in this paper.
In order to address the formulated DFMDP problem, the Q-learning algorithm is an effective tool [45]. The core idea behind the Q-learning algorithm is to first define the value function V π s k d j → r that represents the expected value gotten by policy π from each state s k d j ∈ S. The value function V π for policy π quantifies the goodness of the policy via an infinite horizon and discounted MDP, which can be represented as Equation (22): γ·r k s k , a k |s 0 = s] = E π r k s k , a k + γ·V π s k+1 s 0 = s .
Additionally, it should be noted that, to solve the MDP problem with discrete state and action spaces, the Q-learning algorithm is capable of obtaining the optimal policy [42]. Because we aim to find the optimal policy π * , the optimal action at each state can be found by means of the optimal value function, as in Equation (23): If we denoted Q * (s, a) r k s k , a k + γ·E π V π s k+1 as the optimal Q-function for all state-action pairs, the optimal value function can be rewritten by V * (s) = max a Q * (s, a) . The Q * (s, a) can be obtain through the iterative process according to the Equation (24): Q k+1 s k , a k = Q k + α r k s k , a k + γmaxQ k s k , a k+1 − Q k s k , a k , where α is the learning rate to determine the impact of new information to the existing Q-value, and γ ∈ [0, 1] is the discount factor. Algorithm 1 gives the pseudo-code of the Q-learning algorithm. observe the current state s, initialize the value of α and γ 3.
for episode = 1 to M do 4.
from the current state-action pair (s, a), execute action a and obtain the immediate reward r and a new state s 5.
select an action a based on the state s and update the table entry for Q(s, a) as expressed in Equation (18)  6. replace s ← s 7.
end for 8.
Output: π * (s) = argmax a Q * (s, a) Initially, the resource allocation scheme randomly selects a transmission mode, relay node (if necessary), time slot, and transmission power without the consideration of the data and energy queue status in each body sensor to get an initial state-action pair. Meanwhile, the algorithm initializes the value of α and γ. After the first iteration, from current state-action pair, we can obtain an immediate reward and a new state (the data and energy queue lengths of each body sensor). Then, the algorithm updates the state-action pair, as expressed by Equation (19). Once either all Q-values or a certain number of iterations is reached, the algorithm will terminate. The optimal policy indicating an action (resource allocation scheme) to be taken at each state is maximized for all states. However, the convergence speed of the classical Q-learning algorithm may not be able to find the optimal policy within the acceptable time, especially in the practical and complicated model. Thus, one aspect that needs to be further considered is the necessity of cut down of the original state-action space of the Q-learning algorithm, as the convergence speed is sensitive to the size of the state-action space [46]. Therefore, we proposed a modified Q-learning algorithm, which aims to improve the convergence speed by cutting down the irrelevant state-action pairs.

The Proposed Modified Q-Learning Algorithm
In order to ensure the integrity of the exploration space, the values of state and action are set to an integer from 0 to the maximum value. However, this assumption will result in unnecessary exploring of state-action pairs in the original space. The proposed modified Q-learning algorithm intends to cut down these irrelevant pairs from both the irrelevant state and irrelevant action. In the irrelevant state aspect, we define the valid state space that should be explored to simultaneously achieve two requirements: available energy and serviceable data. Table 2 demonstrates the valid state space that should be explored. In Table 2, * represents the valid state and 0 indicates the invalid state. Nevertheless, in the view of the valid state, all the actions should be explored. Moreover, we investigate the irrelevant action in this work. Similarly, in the irrelevant action aspect, according to exploration space integrity, the value of one action in the Q-table is as follows: p n,k ranges from 0 to p max n . However, the value of p max n may be larger than the current available energy p n,k . In this case, the range of action value can be cut down from 0 to p n,k . In this way, we can further reduce the state-action space.
To better illustrate the computation complexity, we compare the state-action space size of the proposed modified Q-learning based algorithm with the classical Q-learning algorithm. We assume that the maximum available energy and serviceable data in each sensor is x packets and the maximum power control parameter in each sensor is y packets. On the basis of the definition of the state-action place, Table 3 gives the computation complexity of the proposed modified Q-learning based algorithm and classical Q-learning algorithm. Table 3. Computation complexity comparison between the modified and classical Q-learning algorithms.

Modified Q-Learning Algorithm
Classical Q-Learning Algorithm From Table 3, we can see that the computation complexity of the classical Q-learning algorithm increases exponentially with the number of sensors. However, the state-action space size of the proposed modified Q-learning based algorithm is completely free from the influence of sensor numbers.
Meanwhile, the proposed modified Q-learning scheme also contributes to the balance of energy consumption among body sensors. This is because the proposed modified Q-learning scheme considered the amount of harvested energy of each body sensor while allocating resources. In such a situation, if a specific body sensor harvests less energy from the environment, it will not be selected as relay. Moreover, the standard deviation of the consumed energy is less, which indicates that the consumed energy is distributed among body sensors and the lifetime of the overall WBANs can be extended.

Simulation Results and Analysis
In this section, we compare the proposed algorithm with other three schemes: (1) a random power allocation scheme; (2) the classical Q-learning resource allocation scheme; and (3) a joint power-QoS control scheme proposed in the literature [35]. To verify the effectiveness of the proposed algorithm, we evaluate the performance in terms of energy efficiency and convergence speed.

Simulation Setting
In simulations, we consider a WBAN scenario in which multiple heterogeneous body sensors and one hub are deployed with different positions for various detection purposes. Five typical body sensors with their initial energy are considered. They are as follows: ECG with initial energy of 20 mJ, EMG with initial energy of 12 mJ, EEG with initial energy of 16 mJ, glucose sensor with initial energy of 12 mJ, and motion sensor with initial energy of 18 mJ. For simplicity, we assume that the current energy harvesting technology is able to provide the required conversion efficiency. The hub is placed at the center of this topology with a communication range of 10 m, and it knows all the position information of the body sensors. Each body sensor is randomly placed in the topology with the communication range of 2-5 m [47]. Simultaneously, we suppose that only body sensors are equipped with the energy harvesting function, and the energy harvesting process is Poisson-distributed with a rate λ e at arrival instants t k . The data arriving process is also Poisson-distributed with a rate λ d at arrival instants t k . Moreover, the proposed modified Q-learning algorithm has no prior knowledge about them. Meanwhile, because the scenario contains lots of instability, we set 200 time instants for one episode, and the energy efficiency will be averaged to reduce the instability. For each configuration, we generate 100 independent runs and average the performance of energy efficiency. All of the detailed simulation variables used in this paper are summarized in Table 4.

Results and Analysis
The influence of learning rate α and discount factor γ on energy efficiency In order to avoid other factors influencing the performance, we first evaluate the influence of learning rate α and discount factor γ on energy efficiency. We implement two scenarios in which one body sensor in direct transmission mode and two body sensors in cooperative transmission mode are deployed, respectively. The energy harvesting rate λ e is set to 3 packet/s and the data arriving rate λ t is set to 5 packet/s. Figures 2 and 3 show the average energy efficiency under different values of α and γ. From the results of both scenarios, we can see that either the decrease of learning rate α or the increase of discount factor γ will cause the instability of energy efficiency in the proposed resource allocation algorithm. These two cases are depicted as brown and green marks in Figures 2 and 3. This is because a smaller α leads to less exploration; in this case, the proposed algorithm increasingly concentrates on the greedy action, which has a more immediate effect in increasing the users' utility. Contrarily, a larger γ causes less foresight in the policy updating, which will reduce the average utility in the long term [48]. From the blue marks in Figures 2 and 3, we can observe that, while the α is set to 0.9 and γ is set to 0.1, the average energy efficiency is more stable, which means that the algorithm has the highest convergence speed. Furthermore, we also tried some more complex scenarios in which more sensors are deployed, but the influences of learning rate α and discount factor γ are similar. For simplicity and ease of understanding, we only demonstrate this scenario and we can obtain a vivid result that the proposed algorithm performs better in the case of a higher α and lower γ. Consequently, we set α = 0.9 and γ = 0.1, respectively, in the following simulations.   Figure 4 illustrates the optimization processes for energy efficiency of the proposed modified Q-learning algorithm and classical Q-learning algorithm. The simulation result gives two observations. First, after 30 episodes, the proposed modified algorithm trends to convergence rather than 80 episodes of classical Q-learning algorithm. This is because the proposed modified algorithm eliminates the irrelevant state and action spaces that reduce the exploration space. Hence, the convergence speed is accelerated. Second, as the episodes increase, the performance of the classical Q-learning algorithm trends to stable. However, the proposed algorithm outperforms the classical Q-learning algorithm over approximately 20%. This is because of the lower computation complexity and signaling overhead in the proposed algorithm.  Figure 5 presents the average energy efficiency for a different number of body sensors deployed in WBAN. From the result, it can be observed that the proposed algorithm has the highest energy efficiency among the classical Q-learning algorithm, the PEH-QoS algorithm proposed in the literature [35], and the random power allocation algorithm. For the Q-learning based algorithm and the PEH-QoS algorithm, as the number increases, the average energy efficiency is increased. This is because of the fact that more body sensors will lead to a higher data rate. However, it also can be observed that, as the number increases to 9, the average energy efficiency tends to stable. This is because these three algorithms all take available energy into consideration when allocating transmission power. For the random power allocation algorithm, the energy efficiency goes up when the number of body sensors is less than 7. As more body sensors are deployed, the energy efficiency is reduced, while it reaches a saturation point for most algorithms (except the random power allocation algorithm). This is because more body sensors deployed will involve more mutual interference, which further increases the transmission power.  Figure 6 presents the standard deviation of consumed energy versus the different number of body sensors deployed. The proposed modified Q-learning scheme gives the best performance with an average standard deviation of 10.33, compared with 15.65 using the classical Q-learning scheme, 19.98 using the PEH-QoS scheme from the literature [35] and 26.57 using the random power allocation scheme. This is because of the fact that the proposed modified Q-learning scheme is able to maintain the fairness of each body sensor by means of selecting the optimal transmission mode and relay body sensors to distribute the energy consumption to the relay sensors with more residual energy. It can also be observed that, at the beginning, there is an increase in the standard deviation as the number of body sensors increases. This is because, at the initial time, the number of body sensors is small and the consumed energy in each body sensor is unbalanced. After a further increase in the number of body sensors, the standard deviation tends to decrease and becomes stable. Another interesting finding is that the standard deviation of energy consumed of the random power allocation scheme increased sharply with the number of body sensors. This is because this scheme allocates transmission power randomly and does not consider the available energy residual in each body sensor.   Figures 7 and 8, it is clear that the proposed modified Q-learning algorithm and classical Q-learning algorithm can achieve higher energy efficiency as compared with the PEH-QoS scheme from the literature [35] and the random power allocation scheme. With the increase of λ e , the energy efficiency is improved sharply. This is because more energy can be harvested in each time slot with the higher λ e , and the Q-learning-based algorithms are able to obtain an optimal correlation between energy harvesting time, transmission mode, relay selection, and power allocation. The PEH-QoS scheme proposed in the literature [35] gives a slightly better performance than the random power allocation scheme. However, another interesting finding is that, when the λ e is less than 4, the random power allocation scheme has the best energy efficiency. This is because the random power allocation scheme does not take into account the available energy when allocating transmission power.   9 and 10 plot the energy efficiency with the different data arrival rate λ d . In this scenario, the energy harvesting rate λ e is set to 3 and 5, respectively. As shown in Figures 9 and 10, the proposed algorithm still obtains the highest energy efficiency in the comparison of the four algorithms along with the λ d . The reason is that more data arrival will increase the transmission rate; in the meantime, the proposed algorithm can acheive the optimal coupling relationship between the transmission rate and energy consumption, thus improving the performance of energy efficiency. From the results of Figures 9 and 10, we also observe that, as λ d achieves 7, the energy efficiency trends to stable; this is because more data to be served requires more transmission power p n,k , but the λ e is constantly set to 3 and 5 in this simulation. Hence, the energy efficiency is going to stable. However, although the PEH-QoS algorithm from the literature [35] considered energy harvesting in resource allocation strategy, and the authors have also shown the performance with different transmission rates up to 1 Mbps, it did not take into consideration the data arrival rate of each body sensor. Thus, the results of the PEH-QoS algorithm in Figures 9 and 10 are similar. Meanwhile, because the random power allocation scheme allocates transmission power randomly, the energy efficiency does not change with λ e and λ d .  In addition, because the WBANs concentrate mainly on the stringent reliability requirement for the safety-critical information. In this simulation, we also evaluate the reliability of the proposed scheme. The reliability is represented by the delivery probability. Specifically, we define the delivery probability as the probability of successfully delivering the sensory data of each body sensor with the size of P bits within an acceptable time T. Hence, the delivery probability can be given as Prb α S n ·R n d + (1 − α S n )·R n c ≥ P T . Figure 11 gives the average delivery probability of the WBANs, while different numbers of body sensors are deployed. We set P for each body sensor with the constant size of 1 Mb. From the result, we can find that, as the number of body sensors increases, the average delivery probabilities decrease for all schemes. This is because more body sensors being deployed will cause more mutual interference. However, the proposed scheme still gives better performance compared with the other two schemes throughout the tested cases. Remarkably, the proposed scheme is capable of guaranteeing the average delivery probability above 90% even in the worst case. Moreover, in conjunction with the result from Figure 5, we can validate the practicability of the proposed scheme.

Conclusions
The main motivation of this paper is to study the resource allocation scheme for EH-WBANs. Unlike the traditional WBANs, the available energy will be another vital issue that should be considered in the resource allocation scheme. Specifically, with the goal of maximizing the average energy efficiency, we formulate the resource allocation problem to be a DFMDP, in which the transmission mode, relay selection, allocated time slot, power allocation, and energy constraint of each body sensor are considered. Owing to the high complexity of the problem, we solve the maximization problem using a modified Q-learning algorithm. Through extensive simulations, it is shown that the proposed scheme enhances the energy efficiency significantly for different network settings. Additionally, with the conjunction of transmission reliability, we validate the practicability of the proposed scheme in EH-WBANs.

Conflicts of Interest:
The authors declare no conflict of interest.