Optimal Management of Rechargeable Biosensors in Temperature-Sensitive Environments

Biosensors are tiny wireless medical devices which are attached or implanted into the body of a human being or animal to monitor and control biological processes. They are distinguished from conventional sensors by their biologically derived sensing elements. Biosensors generate heat when they transmit their measurements and their temperature rises when recharged by electromagnetic energy. These phenomena translate to a temperature increase in the tissues surrounding the biosensors. If the temperature increase exceeds a certain threshold, the tissues might be damaged. In this paper, we discuss the problem of finding an optimal operating policy for a rechargable biosensor under a strict maximum temperature increase constraint. This problem can be formulated as a Markov decision process with an average reward criterion. The solution is an optimal policy that maximizes the average number of samples which can be generated by the biosensor while observing the constraint on the maximum safe temperature level. Due to the exponential nature of the problem, a heuristic policy is proposed. The performance of the policies is studied through simulation. A greedy policy is used as a baseline for comparison.


Introduction
Biosensors are tiny wireless devices attached or implanted into the body of a human or animal to monitor and detect abnormalities and then relay data to physician or provide therapy on the spot. Unlike conventional wireless sensors, biosensors are energy as well as temperature constrained. Also, their sensing elements are biological materials such as enzymes and antibodies which are integrated into transducers for producing electrical signals in response to biological reactions and changes. Biosensors are powered by either rechargeable built-in batteries or by continuously sending electric energy to them in the form of electromagnetic waves.
The use of batteries necessitates periodic recharging which can be performed using energy resulting from vibration, motion, light, and heat. However, a more mature approach is to wirelessly collect energy from a Radio Frequency (RF) source and then convert it into usable power. This approach is widely used in the industry to transfer data and power to biosensors. It is also more practical since many biosensors can be recharged simultaneously. In essence, a charging station generates a magnetic field that can convey energy through the skin. From the penetrating magnetic field, an electric voltage is produced by induction in the receiver circuit. The induced voltage is then rectified, filtered, and stabilized to run the biosensors or recharge their batteries.
In this paper, we study a stochastic control problem which arises when a rechargeable biosensor operates in a temperature-sensitive environment like the human body. In this problem, the state of the biosensor is characterized by its current temperature and energy levels, and uncertainty exists due to the random behavior of the wireless channel between the biosensor and base station. The objective is to operate the biosensor in such a way that the average number of samples generated by the biosensor is maximized while the maximum safe temperature level is not exceeded. This control problem can be formulated as an MDP and solved to obtain an optimal operating policy.
Since the size of the MDP model increases with the number of biosensors and their states, -learning which is 2 International Journal of Distributed Sensor Networks a form of reinforcement learning is used to obtain the optimal policy. The optimal policy is learned by interacting with a simulation model of the system. Another way to handle large MDP models is through the use of heuristic policies. This paper proposes a simple heuristic policy whose performance is sufficiently close to that of the optimal policy. A greedy policy is also proposed and used as a baseline for comparing the performance of the different policies.
The remainder of the paper is organized as follows. In the next section, the necessary background information is given. Then, the limited available literature is reviewed. After that, the model of the system under study is described followed by the presentation of its MDP formulation. Next, it is shown how large-size MDP models can be handled usinglearning. Besides, greedy and heuristic policies are described. Then, numerical and simulation results are presented and an example is given. Finally, conclusions are summarized and directions for further research are suggested.

Calculating the Temperature Increase
Radiation due to wireless communication and recharging are the major sources of heat in biosensor networks. The level of radiation absorbed by the human body when exposed to RF radiation is measured by the Specific Absorption Rate (SAR) which is expressed in units of W/Kg. SAR records the rate at which radiation energy is absorbed per unit mass of tissue [1]. The mathematical relationship between SAR and radiation is given by where is the induced electric field due to radiation, and and are the density and electrical conductivity of the tissue, respectively. As an example to appreciate the importance of this measure, it was reported in [2] that an exposure to a SAR of 8 W/Kg in any gram of tissue in the head for 15 minutes may result in tissue damage. SAR is a point quantity. That is, its value varies from one location to another. In this paper, we consider only the SAR in the near field (i.e., the space around the antenna of the biosensor). The extent of the near field is given by 0 = /2 , where is the wavelength of the carrier signal used in wireless communication. SAR in the near field is given by the following equation [3]: where and are the permeability and permittivity of the tissue, respectively. is the length of the wire representing the antenna, is the current provided to the antenna, is the attenuation constant, is the distance from the biosensor to the observation point, is the angle between the observation point and the -plane, is the propagation constant, and is the angular frequency. We assume that the radiation patterns are omnidirectional on the 2D plane and thus sin = 1.
The Pennes bioheat equation [4] is the standard for calculating the temperature increase in the body due to heating. The general form of the equation is where is the mass density, is the specific heat of the tissue, / is the rate of temperature increase, is the thermal conductivity of the tissue, is the blood perfusion constant which indicates how fast the heat can be taken away by the blood flow inside the tissue, and is the temperature of the blood and the tissue. Terms on the right side indicate the heat accumulated inside the tissue. The terms ∇ 2 and ( − ) are the heat transfer due to the thermal conduction and the blood perfusion, respectively. The terms SAR, , and are the heat generated due to radiation, the power dissipation of circuitry, and the metabolic heating, respectively.
The Finite-Difference Time-Domain (FDTD) method [5] is a technique that transforms the previous bioheat equation to a discrete form with discrete time and space steps. The area under consideration is divided into cells of side , and the temperature is evaluated in a grid of points defined at the centers of the cells. Temperatures are computed at equally spaced time instants with a time step equal to . Therefore, from [6], the new bioheat equation is where +1 ( , ) is the temperature of cell ( , ) at time + 1, is the time step, and is the space step. Using (2) and (4), the temperature increase at the location of the biosensor ( , ) can be found. It is assumed that the temperature of the surrounding cell points is the normal body temperature (i.e., 37 ∘ C).

Related Work
The research on the possible biological effects caused by biosensors and how to mitigate those effects is very recent. Most of the existing research deals with other technical issues such as energy efficiency and quality of service. In this section, the limited available literature is briefly reviewed.
Tang et al. [6] were the first to propose rotating the cluster leadership in a cluster-based biosensor network to minimize the heating effects on human tissues. They proposed a genetic algorithm for computing a minimal temperature increase rotation sequence. Since using (4) in computing the temperature increase due to a sequence is computationally expensive, they proposed a scheme for estimating the possible temperature increase due to a sequence.
In another work, Tang et al. [7] addressed the issue of routing in implanted biosensor networks. They proposed a thermal-aware routing protocol that routes the data away from high-temperature areas referred to as hot spots. The location of a biosensor becomes a hot spot if the temperature of the biosensor exceeds a predefined threshold. The proposed protocol achieves a better balance of temperature increase and shows the capability of load balance.
The above two works have motivated us to explore further the bioeffects of implanted biosensor networks. As a result, we noticed a lack of information on how to optimally operate an implanted biosensor network when bounds such as the maximum temperature increase exist. Most of the existing works assume that energy is the only limiting factor in the operation of Wireless Sensor Networks (WSNs). However, this is not the case in biosensor networks where the increase in temperature is a serious limiting factor.
We have approached the problem of how to optimally operate a biosensor network from the perspective of sensor scheduling and activation in conventional wireless sensor networks. Sensor scheduling is concerned with the problem of how to dynamically choose a sensor for communication with the base station. On the other hand, sensor activation is concerned with the problem of when a sensor should be activated. Many interesting works have been done in this regard. Next, these works are briefly reviewed.
In [8], the sensor scheduling problem is formulated as an MDP. The objective is to find an operating policy that maximizes the network lifetime. The state of a sensor is characterized by its current energy level only. Three kinds of channel state information are considered: global, channel statistics, and local. Considering only the energy level at each sensor gives rise to an acyclic (i.e., loop-free) transition graph which enables the MDP model to converge in one iteration. On the other hand, if the temperature of each sensor is included in the model, the transition graph of the underlying MDP becomes cyclic. This is because when the sensor cools down (i.e., its temperature decreases), it transitions back to a less hot state. An MDP model whose transition graph is cyclic needs more time to converge.
Dynamic sensor activation in networks of rechargeable sensors is considered in [9]. The objective is to find an activation policy that maximizes the event detection probability under the constraint of slow rate of recharge of the sensor. The state of the system is characterized by the energy level of the sensor and whether or not an event would occur in the next time slot. The recharge event is random and recharges the sensor with a constant charge. The model does not include the state of the wireless channel which is very crucial when temperature is considered.
Body sensor networks [10] with energy harvesting capabilities are another kind of WSNs in which each sensor has an energy harvesting device that collects energy from ambient sources such as vibration, light, and heat. In this way, the more costly recharging method which uses radiation is avoided. The interaction between the battery recharge process and transmission with different energy levels is studied in [11]. The proposed policies utilize the sensor's knowledge of its current energy level and the state of the processes governing the generation of data and battery recharge to select the appropriate transmission mode for a given state of the network. Figure 1 shows the system under study where a mobile subject, in this case an animal, has a biosensor implanted into its body. The biosensor has a built-in battery which is recharged by an RF power source. The role of the biosensor is to monitor and report interesting physiological events such as heart rate and blood pressure. The biosensor becomes incapable of detecting and reporting events if it does not have enough energy for transmission under any channel condition or the increase in its temperature exceeds a prespecified threshold. The latter condition causes a halt in system operation to allow the system to cool down. Both the biosensor and RF power source are under the control of the base station which initiates the measurement process. The base station generates three control signals: Sleep and Sample targeted at the biosensor and Recharge targeted at the RF power source. The system state information is assumed to be available to the base station before it generates a control signal.

System Model
Mathematically, the system can be modeled as a discretestate system which evolves in discrete time. Therefore, the time axis is divided into slots of equal durations Δ . At the beginning of each time slot, the state of the system is observed and a control signal is generated by the base station accordingly. Each time slot is long enough to transmit a complete packet carrying a measurement. Next, the elements of the system are described.

Biosensor.
A biosensor typically contains four essential components: biorecognition, transducer, radio and battery.
The biorecognition system is made of elements such as enzymes and antibodies whose role is to produce a physiochemical change which is detected and measured by the signal transducer. The transducer carries out signal processing tasks. The radio circuitry is responsible for wireless communication. The battery provides power for all active modules in the biosensor and is recharged using RF energy. During a recharging period, the biosensor uses its radio module to collect energy and recharge the battery. Therefore, while its battery is being recharged, the biosensor cannot perform sensing and communication.
The location of a biosensor represents a critical point since it experiences the maximum temperature increase. This is because the tissues surrounding the biosensor might be heated continuously due to the local radiation generated by the biosensor itself and the radiation generated by the base station while recharging the biosensor.
In each time slot , the state of the biosensor is characterized by two variables which are the current temperature and energy level . There are + 1 safe temperature levels; that is, ∈ {0, 1, . . . , }, where the zero temperature level represents the normal body temperature and is an upper limit which must not be exceeded. Initially, the biosensor has a total energy of E 0 which is also the capacity of its battery. The energy required for the biosensor to successfully transmit its measurement to the base station is determined by the state of the wireless channel at the time of transmission. This transmission energy is denoted by E , where is the th state of the wireless channel. The temperature increase due to a transmission energy of E units is denoted by T .
At the beginning of each time slot, the base station may decide to recharge the biosensor; let it transmit its measurement or put it into sleep. The time required for a full recharge is random since it depends on the current temperature and energy levels. During this time, interesting events may occur but they will not be reported by the biosensor since it is being recharged. Also, the biosensor may be put into sleep for a random amount of time during which no measurements can be produced.
At the beginning of the next time slot (i.e., + 1), the energy level at the biosensor is given by the following equation: where is the action taken by the base station at time and E is the amount of energy gained by the biosensor. Similarly, the temperature of the biosensor at + 1 is given by the following equation: where T is the amount by which the temperature of the biosensor decreases when it is put to sleep. In the same way, T and T are the amounts by which the temperature of the biosensor increases when it is recharged and when it is allowed to transmit its measurement, respectively. T and T can be calculated using (4). T is constant since the SAR due to the base station is assumed to be constant. On the other hand, T is not constant since the SAR due to the biosensor changes with the change in transmission energy. Therefore, before T can be calculated, the SAR due to the radiation from the antenna of the biosensor is calculated using (2).
Since (2) is a function of , it is assumed that the current ( ) corresponding to each transmission power level is known.

Wireless Channel.
The communication between the biosensor and base station occurs over a Rayleigh fading channel with additive Gaussian noise. Hence, the instantaneous received Signal-to-Noise Ratio (SNR) denoted by is exponentially distributed with the following probability density function [12]: where 0 is the average received SNR. Such a wireless channel can be modeled as a Finite-State Markov Chain (FSMC) [13,14]. The model can be built as follows. For a wireless channel with states, the state boundaries (i.e, SNR thresholds) are denoted by Γ 1 , Γ 2 , . . . , Γ , Γ +1 , where Γ 1 = 0 and Γ +1 = ∞. The channel is said to be in state if the SNR is between Γ and Γ +1 , where = 1, 2, . . . , . It is assumed that the SNR remains the same during packet transmission, and only transitions to the current or adjacent states are allowed.
The steady-state probability of the th state of the FSMC is given by and thus the state transition probabilities are where (Γ ) is the average number of times per unit interval that the SNR crosses level Γ and Δ is the packet duration.
(Γ ) can be computed using the following equation [15]: where is the maximum Doppler frequency defined as = V/ with V being the speed of the subject and being the wavelength. The above channel model has been verified to be precise when the fading process is slow [13], such as in biosensor applications.
International Journal of Distributed Sensor Networks 5

MDP Formulation
An MDP is a model of a dynamic system whose behavior varies with time. The elements of an MDP model are the following [16]: The solution of an MDP model (referred to as a policy) gives a rule for choosing an action at each possible system state. If the policy chooses an action at time depending only on the state of the system at time , it is referred to as a stationary policy. An optimal stationary policy exists over the class of all policies if every stationary policy gives rise to an irreducible Markov chain. This means that one can limit the attention to the class of stationary policies.
In order to obtain a policy from an MDP model, it is necessary to form and solve the so-called optimality equation (or Bellman's equation). The following is the standard form of this equation with the maximization operator [17]: (11) where is the iteration index, is the set of system states ( ∈ ), ( ) is the set of actions possible when the system is at state , ( , ) is the reward/cost per step, P is the system state transition probability matrix, and ( ) is the optimal value of the objective function when the system is started at state and the optimal policy is followed. Equation (11) can be solved using the classical policy iteration, value iteration, and relative value iteration algorithms [17]. Next, the details of the MDP model are given.

State Set.
The state of the system at time is described by the following 3-dimensional vector: where , , and are the current temperature of the biosensor, its energy level, and transmission power required for successful transmission at time , respectively. The total number of possible states is | | = | |×| |×| |, where | |, | |, and | | are the numbers of possible temperatures, residual energies, and transmission energy levels, respectively.

Action Set.
In each time slot, the base station chooses an action based on the current state of the system. In each state , there are three possible actions: where the action lets the biosensor generate a measurement and transmit it to the base station, ℎ action recharges the biosensor, and action puts the biosensor into sleep.
The Sleep action can be performed at every system state. The other two actions, however, can only be performed at system states, where the next temperature of the biosensor is within the safe temperature range. In addition, the Sample action can only be performed at system states, where the remaining energy is sufficient to make a successful transmission.

Reward Function.
Since the objective is to maximize the expected number of samples that can be generated by the biosensor, the reward function is defined as This means that one unit of reward is earned every time the Sample action is performed. The long-run expected sum of rewards represents the average number of samples that can be generated by the biosensor with an initial energy of E 0 units and maximum temperature increase of units.

Transition Probability Function.
After the action taken by the base station is performed, the system transits to a new state according to the transition probabilities of the present state of the wireless channel. Thus, the behavior of the system is described by | | transition probability matrices, and each such matrix is of size | | × | |. Each matrix is denoted by , +1 ( ) which is the probability that choosing an action when in state will lead to state +1 . More formally, , +1 ( ) can be written as follows:

Value Function.
The problem of finding an optimal policy for maximizing the average number of samples is formulated as an infinite-horizon MDP using the average reward criterion [16]. So, let ( 0 ) be the expected number of samples given that the policy is used with an initial state 0 . Then, the maximum expected number of samples * ( 0 ) starting from state 0 is given by The optimal policy * is the one that achieves the maximum expected number of samples at all system states. The famous value iteration algorithm [17] is used to numerically solve the following recursive equation for > 0: In (17), the subscript denotes the iteration index. As → ∞, → * .  Figure 2: Using an RL algorithm (like -learning), the decisionmaking agent gradually learns the optimal policy. An action is applied to the system or simulated and then the resulting reward/ cost is fed back into the knowledge base of the decision maker. The new knowledge obtained over time helps the decision maker to make better actions.

Handling Large-Size MDP Models
The size of the proposed MDP model depends on the number of biosensor states which is a function of the number of possible temperature and energy levels. As the number of biosensor states increases, the process of computing the transition probability matrices for the system becomes very time consuming. Also, the value iteration algorithm used for solving the MDP model becomes impractical. This section presents two methods (namely, -learning and heuristics) for handling MDP models with a large number of states.
6.1. -Learning. Reinforcement Learning (RL) offers an alternative for obtaining the optimal policy at a significantly lower computational cost. Using a simulation model of the system under study, the decision maker in an MDP is viewed as a learning agent whose task is to learn the optimal action in each possible state of the system. As Figure 2 shows, the optimal policy is learned while the system is being driven (i.e., simulated) by the actions selected by the learning agent which stores the results of its actions in a knowledge base. The actions of the decision maker become better over time as new knowledge is obtained. Eventually, the RL algorithm converges to an optimal policy which can be used in the physical system.
-Learning is an RL algorithm which was introduced in [18]. It is used for learning from experience. It requires that each entry in the decision-maker's knowledge base corresponds to a state-action pair. The value stored in each entry is referred to as the Q-value and is a measure of the goodness of executing an action in a particular system state. The -value for a state-action pair ( , ) is updated as follows: where is the immediate reward obtained after executing action in state , is the next state, and ( ) is the set of possible actions in state . and denote the learning rate and discount factor (0 < < 1), respectively.
The -learning algorithm is shown as Algorithm 1. The interaction between the learning agent and the simulator (or environment) is divided into episodes. In each episode, the system transits through a sequence of states. The length of this sequence is controlled by the parameter which is the number of simulated time slots. In each simulated time slot, based on the current state of the system, the learner chooses an action either based on the -policy or randomly. If the former is selected, the action with the highest -value is selected. After that, if the action is Sample, a reward of one unit is earned; otherwise, the reward is zero. The action is then simulated and the next system state is observed. Next, the -value is updated using (18). Also, the new system state becomes the current one and the cycle repeats.
Although -learning is theoretically guaranteed to obtain an optimal policy, it requires that each state-action pair be tried infinitely often in order to learn the optimal policy. The quality of the learned policy depends on how much time is spent in learning and if every state-action pair can be tried. On the other hand, depending on the application, a certain percental difference between the learned and optimal policies might be tolerated. This is because the system states differ in the likelihood of being visited. Thus, a default action (like Sleep) can be assigned to system states with a low likelihood of being visited.

Heuristic.
Since it is difficult to describe the structure of the optimal policy, a heuristic policy is proposed in this section. The goal is to design a policy which mimics the behavior of the optimal policy as close as possible. However, before presenting such a policy, a greedy one is given to provide insight into the design of any heuristic policy.
The greedy policy is computed using Algorithm 2. The inputs to this algorithm are the set of possible system states and the set of feasible actions for each system state. The computed policy is greedy in the sense that for each system state, the feasibility of actions is checked in the following order: Sample, Recharge, and then Sleep. The first feasible action is associated with the corresponding system state.
As will be shown by simulations in the next section, the greedy policy is poor since it is based on a fixed order of actions. Therefore, Algorithm 2 needs to be extended to allow for a dynamic selection of actions. This objective is accomplished by introducing two control parameters: and . With these two control parameters, the Sample and Recharge actions are not selected in a specific order or whenever they are feasible. Algorithm 3 shows how the control parameters and new heuristic policy are computed.
The essence of Algorithm 3 is as follows. If the current temperature (denoted by ) of the biosensor is low and 0.5 0.2 0.1 its current energy level (denoted by ) is high, then the condition ≤ would more likely be true and thus the Sample action could be executed. However, this would not be the case when the available energy is very close to zero. In this case, the opposite condition (i.e., ≥ ) would more likely be true and thus a Recharge could be performed. If neither of the two conditions is true, the biosensor is put to sleep and thus its temperature decreases.

Numerical and Simulation Results
In this section, an example is first presented to illustrate the viability of the proposed MDP model. Then, the performance of the optimal policy is compared to that of the approximate policies using simulation. The impact of various system parameters on the performance of the system is also evaluated. The simulation was performed using a simulator written in Matlab [19]. Each simulation was run for a duration of 100000 time slots, and each data point is the average of 10 simulation runs. The number of channel states ( ) is four, and the channel state boundaries are randomly generated.

Illustrative Example.
In this example, a wireless channel with two states is considered. The channel state transition probabilities are calculated using (9). Table 1 shows the values of the parameters involved. Figures 3(a) and 3(b) show the expected number of samples when there is no recharge and when recharge is allowed, respectively. The expected number of samples is expressed as a function of the maximum safe temperature level ( ) and initial energy (E 0 ). The first observation is that if recharge is allowed, more samples are expected to be generated by the biosensor. In the case when recharge is not allowed, the expected number of samples is limited only by the amount of initial energy. This is confirmed by Figure 3(a) where for the same initial energy, the same expected number of samples is obtained when is varied.  When recharge is allowed, the maximum safe temperature level ( ) plays a critical role. This is due to the temperature increase caused by the recharge action. In Figure 3(b), for the same initial energy, the expected number of samples increases as is varied. Increasing enables the Recharge action to be performed more often. On the other hand, as one would expect, if is fixed and (E 0 ) is varied, the expected number of samples slightly increases when is small. However, when is large (≥6), the maximum possible expected number of samples can be achieved when E 0 is at its maximum value. Therefore, for this particular example, if E 0 = 10, the optimal value for is 6. Figures 4(a) and 4(b) show the optimal action for each possible system state. In Figure 4(a), for channel state 1, the Sample action is performed in 70% of the system states. The Sleep action is performed whenever the temperature reaches the maximum safe level ( ), and the Recharge action is performed when the remaining energy is zero and the temperature is below .
By contrast, in Figure 4(b), for channel state 2, the Sample action is performed only once at the initial system state. For this channel state, due to the higher cost of transmission, the biosensor is put in the sleep mode most of the time. However, since the cost of the Recharge action is independent of the channel state, the system recharges itself more often to enable more samples to be generated when the wireless channel switches to a state with a lesser transmission energy requirement (i.e., channel state 1).

Comparative Analysis.
In order to be able to appreciate the merit of any approximate policy, a more meaningful performance criterion is needed. In this work, the average number of time slots needed to generate a sample is used as a criterion to distinguish between the different policies available to run a system. It is calculated as the total simulation time divided by the average number of samples generated by the system while being operated by a certain policy. This measure takes into account the effect of the ℎ and actions. For example, consider Figure 5. In this figure, is fixed at five while E 0 is varied. The greedy policy is very costly since it requires the largest amount of time before a sample can be generated. The difference in the amount of time required by the heuristic policy and that required by the optimal policy stays around two time slots. This is a 75% reduction in time when compared to the greedy policy. The policy is the best approximate policy. On average, the difference with the optimal policy stays around 1.1 time slots. Figure 6 shows the amount of time required to generate a sample when E 0 is fixed at five and is varied. In this figure, when = 1, the greedy policy outperforms both the policy and heuristic policy. A difference of three time slots is observed. This can be explained as follows. In the and heuristic policies, the Recharge action can be performed in one state only (i.e., when = = 0). On the other hand, with the greedy policy, the Recharge action can be performed in more than one state (i.e., whenever = 0). This, of course, leads to a reduction in the average amount of time needed to generate a sample. Other than that, for ≥ 2, the and heuristic policies are always better than the greedy policy, and their performance is close to that of the optimal policy.

Conclusions
The increase in temperature due to the heat generated by biosensors is a limiting factor in the operation of biosensor networks. This problem can be modeled as a stochastic control problem using the framework of Markov decision processes. The solution is an optimal policy which ensures that the maximum safe temperature level is not exceeded. In order to handle large-size MDP models, it is shown how -learning can be used for obtaining the optimal policy. In addition, a heuristic policy is proposed. Its performance is comparable to that of the policies obtained by the MDP model and -learning.
This work can be extended in the following directions. First, the scenario of more than one rechargeable biosensor should be studied. In this case, the number of possible system states is exponentially huge. Thus, techniques for eliminating equivalent states would be necessary. Second, the performance of other reinforcement learning techniques should be investigated, especially for models with a huge state space. Third, algorithms for computing better heuristic policies should be developed to mitigate the problem of finding better approximate policies.