E ﬃ cient Learning in Stationary and Non-stationary OSA Scenario with QoS Guaranty

In this work, the opportunistic spectrum access (OSA) problem is addressed with stationary and non-stationary Markov multi-armed bandit (MAB) frameworks. We propose a novel index based algorithm named QoS-UCB that balances exploration in terms of occupancy and quality, e


Introduction
Cognitive radio (CR) enables to get access to the underutilized spectrum when it is not occupied by a licensed user and thus opens new doors in spectrum sharing for communication.Opportunistic Spectrum Access (OSA) is an effective alternative to solve spectrum shortage issue and to improve overall spectrum efficiency.In OSA context, primary users (PUs) are the licensed users who buy the right to use spectrum for certain time.While secondary user (SU) is the unlicensed user who could be allowed to use the spectrum for communication when no PUs is using it.
There exists several pieces of work on OSA spectrum allocation strategy, most of them dealing with wide band sensing and allocation.The core of the problem being to learn which channels are the best (in terms of a chosen criterium, e.g.availability), recent works have modeled the spectrum learning process with multiarmed bandit (MAB) framework, where SU senses any one of K different channels [9,10,14,15].According to the traffic load of PUs and the propagation conditions, some channels are more available than others and have better quality in terms of signal to noise ratio (SNR), thus potentially support a reliable transmission.The availability and quality of each channel could be learnt by exploration.The objective is to handle the exploration and exploitation dilemma, defined as exploiting the best channel while simultaneously collecting information about it.
The archetypal MAB problem is a reinforcement learning (RL) game where a player decides to play one of the K independent slot machines at each discrete time step such that its long-term reward is maximized [1,3].In the early works, stationary case was considered and each arm has identical and independent distributed (iid) rewards drawn from a unknown distribution, but assumed to remain constant during the game.The player aims at minimizing the expected regret over time, which is defined as the difference between the expected reward obtained using infeasible ideal policy and the selected policy.Auer et al. in [3] proposed a simple mean based index policy, called UCB1, for bounded iid rewards which achieves logarithmic order regret uniformly over time, not only asymptotically as in [1,11].Anantharam et al. in [2] extended the classical MAB problem to Markov MAB where the reward distribution of each arm is generated by a finite state, irreducible and aperiodic Markov chain.For several algorithms in literature, e.g.[13,16,19,20], the authors proved that mean based index policies are also order optimal for Markovian rewards.
Though the stationary formulation of Markov MAB problem permits to address exploration versus exploitation dilemma more appropriately, but it may fail to justify a changing environment model where the observed reward distribution undergoes changes in time.As an example, probability of vacancy of each channel is likely to experience changes in time in OSA scenario, which exhibits the limitation of stationary MAB models.In many application domains, abrupt changes in the reward distribution are an intrinsic characteristic of the problem.Standard soft-max and upper confidence bound (UCB) class of policies are not well adapted for abruptly changing environments as it has been stated in [8].
Authors, in [4][5][6], introduced soft-max action selection policies, i.e.EXP3, EXP3.S, for non-stationary iid MAB problem, where distribution of reward undergoes abrupt changes in time.Moreover, several policies such as discounted-UCB (D-UCB), sliding-window UCB (SW-UCB), belonging to the wider family of UCB, are designed to address abruptly changing nonstationary iid MAB problem [7,12,17].These policies are also consistent with more extreme settings, such as the one presented in [18] where reward distribution follows a Brownian motion.Even though theoretically Brownian motion can be seen as a particular Markov process, there is no straightforward links with nonstationary Markov MAB problem.Thus, learning policy for MAB with Markovian reward distribution requires a special attention because of increase in interest to model a cognitive radio learning with Markov MAB framework.
To the best of our knowledge, most of the Markov MAB policies utilize free channels without giving sufficient weight to their quality in terms of received power.Moreover, in case of OSA scenario, each channel consists of time-varying condition due to random fading and/or primary users' activities on that channel as shown in Fig. 1.This paper proposes two policies able to learn the best channel in terms of vacancy probability and quality as illustrated in Fig. 1.The first policy, named QoS-UCB, is designed to learn quality and availability of a band in stationary environments, i.e. when the reward distribution does not evolve with time.Free channels lead to transmission opportunities, but not all with the same quality, which depends on the received signal to interference plus noise ratio (SINR) on each band.The second policy, referred as discounted QoS-UCB (DQoS-UCB), is able to learn on the same characteristics than QoS-UCB, i.e. quality and availability, but in non-stationary environments, i.e. when reward distribution evolves with time, allowing to increase the overall network performance.The latter proposed policy is referred as DQoS-UCB because of discount factor which weight more recent observations compared to observations acquired in the past.
The rest of the paper is organized as follows.In Section II; we formulate the stationary Markov MAB problem as a CR learning process, and present a novel machine learning algorithm based on the local channel quality information.Section III introduces nonstationary Markov MAB problem and propose a DQoS-UCB learning policy for it.The numerical results are presented in Section IV, verifying the validity and efficiency of the proposed QoS-UCB and DQoS-UCB policies in stationary and non-stationary environments.Finally, Section V concludes the paper.

Stationary Markov MAB Problem Formulation
We consider a network with a single SU transmitter (Tx) and receiver (Rx) pair and a set of channel K = {1, • • • , K}.K PUs use the K different channels independently from SU.This latter can access one of the K channels if it is not occupied.This paper considers a Markov MAB framework where we interchangeably refer to a channel as an arm, SU as a player, and the action "sensing a channel" as playing an arm.The i−th arm is modeled by an irreducible and aperiodic discrete time Markov chain with finite state space S i .P i = { p i kl , (k, l ∈ {q 0 , q 1 }), (q 0 , q 1 ∈ S i ) } denotes the transition probability matrix of the i−th arm, where q 0 and q 1 are the Markov states, i.e. occupied and free respectively.We assume that each arm is mutually independent from others.Let, π i be the stationary distribution of the Markov chain defined as π i q (t) = π i q ∀t and [19]: Let S i (t) being the state of channel i at time t and T i (t) being the total number of times channel i has been sensed up to time t.The reward associated with the observed state q from a channel i at time t is denoted by r i q (t) = S i (t) 1 .The stationary mean reward µ i of the i−th channel under stationary distribution π i is given by: µ i = ∑ q∈S i r i q π i q .Furthermore, the instantaneous channel quality is considered in this work, which is assumed to be a function of the SINR on the secondary radio-link 2 .The channel quality is assumed to be a wide sense stationary (WSS) random process within state q, meaning that its statistical properties, i.e. first and second order statistical moments, are not evolving over time, but the instantaneous quality R i q (t) of the i−th channel varies within state q.Let us denote T i (n) as the number of times channel i has been sensed up to time n and G i q (T i (n)) = 1 is the maximum expected quality within the set of channels.Taking into account the quality information of each channel i, modified mean reward is defined as µ R i = ∑ q∈S i G i q r i q π i q .

Stationary Weak Regret
The performance of a stationary Markov MAB learning policy is measured with the regret Φ(n) notion and is defined by the reward loss of the expected total reward obtained from the given scheme compared to the ideal policy which plays an arm with the highest mean up to time n.Always playing a best arm is referred as the optimal arm selection policy; accordingly the others will 1 However, reward associated with state q can take positive and stationary distributed payoff different from S i (t). 2 Proposed policy can be generalized for various quality metrics such as i.e. energy efficiency, SINR, etc., but explicit expression of the quality function is out of the scope of the paper be referred to as suboptimal arms.The regret at time n is with a(t) being the channel index selected by the policy a at time t and q a(t) being the state of the channel a(t) at time t.Moreover and without any loss of generality, we assume that µ R The channel with the highest mean reward µ R 1 is called the optimal channel.Channels whose mean reward are strictly less than µ R 1 are suboptimal channels.The mean reward optimality gap is defined as

QoS-UCB Policy
The proposed QoS-UCB policy learns a channel which is optimal in terms of probability of vacancy and quality.The learning is done from the observations acquired by sensing the channels without considering any a priori statistical information about the channels.Algorithm 1 states the first contribution of this paper.
The SU first starts to sense all channels at least once initially, i.e. steps 1 to 6 in Algorithm 1.After n > K iterations, the QoS-UCB policy updates the index B i (n, T i (n)).At each iteration, Algorithm 1 returns channel index a(n) of the largest QoS-UCB index.The QoS-UCB index in Algorithm 1 is defined as The QoS-UCB index, i.e.B i (n, T i (n)), comprises three terms.Si (T i (n)), is the empirical mean of the reward associated with Markov states (occupied or free) of channel i at time n and represents the exploitation contribution.The second term Q i (n, T i (n)) is related to channel quality and is estimated by observing the quality information R i q (n).Finally A i (n, T i (n)) forces to explore other channels.This bias increases when the algorithm leads to sense an occupied channel.The last two terms represent the exploration contribution.
Contrary to traditional UCB policies, step 22 in Algorithm 1 comprises information about the channel quality and is updated at each time step.The key parameter of QoS-UCB policy is Q i (n, T i (n)) and is defined as where Thanks to this formulation, the proposed policy selects a channel with the best quality for transmission, i.e.G q 1 max (n), up to time n in state q 1 .The bias A i (n, T i (n)) is the same as the classical UCBs [9,19] and is defined Algorithm 1 QoS-UCB policy Sense channel n 3: Observe current state S n , transmit if free 4: T n (n) = 1 6: end for 7: while n > K do 8: Sense channel i = a(n) Observe current state S i 10: if channel i free then Observe quality R i q 1 (T i (n)) Two coefficients α and β appear in ( 5) and ( 4) respectively and are defined as exploration coefficients for vacancy and channel quality, respectively.If α and β increase, QoS-UCB explores more channels for better quality and availability.On the contrary, if α and β both decrease, the priority is given to the exploitation rather than exploration and the empirical mean of reward, i.e.Si (T i (n)) associated with the state of channel i dominates and thus the policy tends to exploit the best available channel found in the previous iterations, i.e. step 18.If α increases and β decreases, the algorithm favors the exploration part related to the search for a more available channel.On the contrary, if β increases and α decreases the exploration part related to the search for a better channel quality, e.g. in term of SINR, will be favored.
There are three different kinds of behavior during the selection process: • Case 1: The selected channel is optimal in terms of availability i.e.Si (T i (n)) and quality i.e.Q i (n, T i (n)), then these both terms ensure to exploit this channel.
• Case 2: The selected channel is only optimal in term of availability or quality.In this case, the respective optimal term leads to the exploitation of this channel, but the other term forces to explore the others.
• Case 3: The selected channel is not optimal both in terms of availability and quality, then the last term i.e.A i (n, T i (n)) forces to explore other channels.

Non-stationary Markov MAB Problem Formulation
Non-stationary environments are considered in this paper as shown in Fig. 2 where the reward distributions remain constant during period b j , j = 1, • • • , M and change at unknown time 3 .Here, M − 1 is the total number of times the reward distribution of a channel changes up to time t.The position in time of a transition of the reward distribution is unknown for user a priori and is assumed to be independent from reward distribution.The i−th arm is modeled by an irreducible and aperiodic discrete time Markov chain with finite state space S i .P i,b j = { p i,b j kl , (k, l ∈ {q 0 , q 1 }), (q 0 , q 1 ∈ S i ) } denotes the state transition probability matrix of the i−th arm in period b j , where q 0 and q 1 are the Markov states, i.e. occupied and free respectively.We assume that each arm is mutually independent from others in each period.Let, π i b j be the stationary distribution of the Markov chain in period b j defined as Likewise stationary problem, taking into account the reward associated with quality information and observed state of each channel i in period b j , mean being the channel with highest mean reward in period b j and is referred as an optimal channel for that period.The regret of a policy in the non-stationary Markov MAB environment is defined as the difference between the total rewards collected by the optimal policy a * (which plays an optimal arm at each time instant) and the total rewards collected by the selected policy.However, it is important to consider that the nonstationary regret is not estimated with respect to the optimal arm on average, but with respect to an optimal policy a * following the optimal arm at each time instance.The non-stationary regret Φ ns (n) defined in this section is similar to the regret for non-stochastic bandit problem presented in [4].
with a * (t) and a(t) being the channel index selected by the optimal policy a * and given policy a at time t, respectively.q a * (t) and q a(t) being the state of the channel a * (t) and a(t) at time t, respectively.Moreover and without any loss of generality, we assume that µ R

Discounted QoS-UCB (DQoS-UCB) Policy
In this section we introduce discounted QoS-UCB (DQoS-UCB) algorithm for the Markov MAB problem.
The motivation for the DQoS-UCB policy is to find an optimal channel with less exploration in the case of abruptly changing environments.As stated before, the standard QoS-UCB policy is not appropriate for the non-stationary environment, because confidence interval of standard QoS-UCB policy becomes tighter when time goes up.To guaranty the performance of DQoS-UCB policy in non-stationary environment, a discount factor λ is considered for the DQoS-UCB index estimation.The idea behind inclusion of discount factor is to give more weight to recent observations compared to the ones acquired in past.
The proposed DQoS-UCB policy learns a channel which is optimal in terms of probability of vacancy and quality in each period b j .Our contribution for non-stationary Markov MAB is stated in Algorithm 2. As with QoS-UCB policy, an SU employing DQoS-Algorithm 2 DQoS-UCB policy Input: α, β, a(0), T i (0) = 0, R i q 1 (0) ∀i ∈ K and λ < 1 Output: a(n + 1) 1: for n = 1 to K do 2: Sense channel n 3: Observe current state S n , transmit if free 4: T n (n) = 1 6: end for 7: while n > K do 8: Sense channel i = a(n) Observe current state S i (n) 10: if channel i free then Observe quality R i q 1 (T i (n)) a(n + 1) = arg max i B i (n, T i (n)), n = n + 1 28: end while UCB policy first starts to sense all channels at least once initially, and after n > K iterations, it updates the index B i (n, T i (n)) as in (8).However, each term of the index equation is adapted to take into account the nonstationary hypothesis.
T i (n) is still the number of times channel i has been sensed by DQoS-UCB policy up to time n.
is the discounted number of times channel i has been sensed up to time n, and Contrary to QoS-UCB policy, empirical mean of observed states Si (T i (n)) and quality G i q 1 (T i (n)) are estimated by taking into account a discount factor λ < 1 as shown in steps 20 and 21 of Algorithm 2. The coefficients α and β in steps 25 and 24 of Algorithm 2, are the same than in QoS-UCB policy, to weight exploration for vacancy and channel quality, respectively.

Simulation Results
In this section, QoS-UCB and DQoS-UCB policies under stationary and non-stationary environments are investigated in simulations.We present simulation results focusing on regret, quality information and percentage of optimal channel selection.K is set to 10 channels with two states each, i.e. q 0 (occupied) and q 1 (free) respectively.Simulation results are given by averaging over 100 runs performed using MATLAB.

Stationary Environment
Two different scenarios, i.e.C 1 and C 2 are considered in this Section with parameters detailed in Table 1 where the transition probability matrices P i and channel quality information R i q 1 are selected arbitrarily.The stationary distribution π i q 1 and mean reward µ R i are computed using P i and R i q as detailed in Section 2. The quality information reward associated to the state q 0 is arbitrary set to R i q 0 = 0.1.Both scenarios are as following: • C 1 : The most available channel has also the best quality, i.e. π 1 q 1 > π i q 1 and R 1 q 1 > R i q 1 , ∀i ∈ {2, • • • , K}.This is a standard scenario of multiarmed bandit problem.
• C 2 : The most available channel is not the one with the best quality, i.e. π 2 q 1 > π i q 1 , ∀i 2 and R 1 q 1 > R i q 1 , ∀i 1.In the machine learning field, regret indicates the reward loss associated to a channel selection and it is expressed as in (2).High values of regret suggests that many mistakes were done during the channel selection Cha.Fig. 3 shows the evolution of the regret achieved by our QoS-UCB policy for several values of α and β as selected in [16] for scenarios C 1 .The regret behaves logarithmically for all values of the couple (α, β) tested.In consequence, we observe that the regret first increases rather rapidly w.r.t. the time steps and then very slowly.This behavior shows that the QoS-UCB policy is able to learn from the past observations.This figure also states that small values of α and β lead to exploit more than explore, and thus resulting to a smaller logarithmic regret behavior.
Comparison Analysis of QoS-UCB and UCB1.In this part, we compare the cumulative regret of our scheme with an already existing UCB1 algorithm.In order to compare the convergence and other performance measures, the best value of the exploration coefficient L = 0.25 of UCB1 is selected as per the authors guidelines presented in [19].Figs. 4 and 5 show the cumulative regret of QoS-UCB and UCB1 for scenarios C 1 and C 2 respectively.Obviously, and in order to discuss results that are comparable, the regret computation is not done with the same best channel for QoS-UCB and UCB1.Indeed, the regret for UCB1 is always computed with its own best channel, i.e. the most available and QoS-UCB regret is computed with the channel having the best mean reward µ R , taking into account both availability and quality.For scenario 1 best channel is the same for both, i.e. channel 1, but in scenario C 2 , channel 1 is the best channel considering availability and quality but channel 2 is the most often vacant.
Our proposed QoS-UCB policy produces a much lower cumulative regret compared to the UCB1 approach for both scenarios.In particular, in C 2 where the best channel is not the same for both algorithms, we still observe that QoS-UCB converges faster to its own optimal channel than UCB1 does for its own optimal channel as it can be observed in Fig. 5.   6 shows the average reward associated with quality information observed from the QoS-UCB and UCB1 policies for scenario C 2 .Our algorithm converges to the optimal mean reward, i.e. µ R 1 = 1.43 in the long run.Both algorithms start exploring channels with different goals, i.e.UCB1 computing its average reward only with the information on the availability and QoS-UCB taking into account the channel quality information.UCB1 is converging to the mean reward µ R 2 = 1.08 of channel i = 2 which has the largest probability to be vacant as expected from Table 1.Whereas, QoS-UCB is converging to the mean reward µ R 1 of optimal channel, i.e. i = 1, which presents theoretical guaranty for the optimality in terms of availability and quality informations.Hence, the introduction of the channel quality information inside the calculation of the index allows to take into account an information related to the transmission quality on the secondary link and hence the proposed algorithm could offer higher data rate than traditional UCBs.In machine-learning field, another important metric, referred as percentage of optimal channel selection, is used for analysis of the reinforcement learning policies.It illustrates that CR device should transmit as much as possible in the optimal channel since it has the highest mean reward µ R 1 for the SU.Figs. 7 and 8 compare the percentage of time the optimal channel has been selected within the pool of 10 channels for two QoS-UCB algorithms with different values of α and β and UCB1 algorithm.As seen in Fig. 7, both algorithms converge to the optimal channel asymptotically for the standard MAB scenario C 1 , because channel i = 1 is optimal not for only QoS-UCB but also for the UCB1 policy.However, it can be noted than QoS-UCB converges faster than UCB1.This is an interesting result because UCB1, which does not take into account quality, seems slower in the learning process than QoS-UCB, despite the fact than channel 1 is also optimal for UCB1.Particularly in this paper, we are interested in the performance of the QoS-UCB and UCB1 policies under scenario C 2 where the most available channel is not the one with best quality.Optimal channel selection is calculated w.r.t. the optimal channel in terms of both availability and quality for QoS-UCB, while this performance criterium is computed exclusively with the most available channel for UCB1.For scenario C 2 as in Fig. 8, our proposed policy converges to the optimal channel i = 1 asymptotically.Whereas, UCB1 policy is diverging from the optimal channel in terms of both availability and quality, and converging to the most available channel i = 2.However, we can see that after 1000 trials QoS-UCB algorithm plays optimal channel for more than 70% of time in the worst case.The higher α and β, the more time it spends in the exploration, even if other channels are not optimal.On the contrary, when α and β take small values, QoS-UCB tends to exploit more than explore and learns less from the other remaining solutions.This is the reason why the QoS-UCB is converging slowly towards the optimal channel when α and β are high.From Figs. 7 and 8, we remark that QoS-UCB selects optimal channel i = 1 more often compared to UCB1, thus QoS-UCB offers higher average reward associated with quality in the long run and hence potentially better transmission rate for the SU.

Non-stationary Environment
The scope of this section is to introduce a setting that allows to analyze the behavior of MAB learning policies in non-stationary abruptly changing environment.Two breakpoints at time t = 1500 and t = 3000 are introduced in the simulation setting which indicate an abrupt change in the reward distribution.An optimal policy is able to identify an abrupt change in reward distribution with reduced delay and maximize the opportunistic transmission in nonstationary environment.We compare the performance of UCB1 (L = 0.25), QoS-UCB (α = 0.25 and β = 0.32) and DQoS-UCB (α = 0.25 and β = 0.32) policies.Moreover, we introduce a discount factor λ = 0.98 in the DQoS-UCB policy to cope up with nonstationary scenario C 3 .Table 2 introduces the transition probability matrices P i for different time periods where distribution of each channel may change in scenario C 3 .In this table, transition probabilities do not change from channel 1 to 5, but change over time from channel 6 to 10. Reward associated with quality R i q remains stationary during the simulation, however, estimated mean reward µ R i changes abruptly due to channel availability variations.The simulation parameter settings are summarized in Table 3 for scenario C3.
Fig. 9 shows the evolution of regret achieved by UCB1, QoS-UCB and DQoS-UCB policy for scenario C 3 .As seen in Fig. 9, DQoS-UCB achieves significantly lower regret than QoS-UCB and UCB1 policy in non-stationary environment.DQoS-UCB wastes significantly less time than QoS-UCB to identify an abrupt change (t = 1500 and t = 3000) in the reward distribution, due to inclusion of discount factor λ in the index calculation.On the other hand, past has an higher Scenario Channel 0 < t < 1500 1500 ≤ t ≤ 3000 3000 < t < 5000 (i) P i q 0 q 1 P i q 1 q 0 P i q 0 q 1 P i q 1 q 0 P i q 0 q 1 P i q 1 q 0 influence on UCB1 and QoS-UCB preventing them to adapt quickly to changes.Moreover for further analysis, we compare the average reward associated with quality information observed with DQoS-UCB and QoS-UCB policies in scenario C 3 .As shown in Fig. 10, both algorithms converge to the optimal mean reward in the long run, but DQoS-UCB policy benefits from a less dependency on the past observations.Whereas, QoS-UCB policy takes more time to converge to the optimal mean reward after an abrupt change in the reward distribution, because it considers all past observations to make next decision.Fig. 10 states that DQoS-UCB policy is able to track an abrupt change in the reward distribution and achieves higher reward in the long run.specifically when reward distribution changes after sufficient time step.Another important criterion is the number of successful channel access, correlated to successful opportunistic transmission at the end.We define the successful transmission percentage (STP) as the number of time free slots are detected from total number of time steps n: 1 if channel is free at time t 0 if channel is occupied at time t.
The DQoS-UCB policy achieves higher successful transmission percentage (STP) compared to QoS-UCB policy in the long run as shown in Fig. 11.However, QoS-UCB policy is able to find an optimal channel in non-stationary environment, but it requires significantly more time to converge to an optimal channel after an abrupt change in the reward distribution.Thus, DQoS-UCB policy is more appropriate for non-stationary environment compared to QoS-UCB policy.

Conclusion
In this paper, efficient OSA learning algorithms named DQoS-UCB and QoS-UCB and based on channel quality information and availability have been proposed for the stationary and non-stationary Markov multi-armed bandit problem.Simulation results have shown that the proposed QoS-UCB policy actually achieves a logarithmic regret for a large variation range of the exploration coefficient values in stationary Markov MAB framework.Furthermore, this paper extends QoS-UCB policy to non-stationary environment by including a discount factor to weaken the confidence interval of QoS-UCB policy which normally becomes tighter when time goes up.It has also been shown in numerical analysis that QoS-UCB outperforms UCB1 in terms of regret in stationary Markov MAB scenario.Whereas, in case of non-stationary Markov MAB, both DQoS-UCB and QoS-UCB policies are able to find an optimal channel, but QoS-UCB policy requires longer time to converge to an optimal channel.To design a fully adaptive algorithm which is able to adjust discount factor according to the environment is not an easy task and remains the subject of on-going research.

Figure 1 .
Figure 1.Occupancy and channel condition considered for secondary user

Figure 4 . 1 Fig.
Figure 4. QoS-UCB and UCB1 cumulative regret for scenario C 1 Fig.6shows the average reward associated with quality information observed from the QoS-UCB and UCB1 policies for scenario C 2 .Our algorithm converges to the optimal mean reward, i.e. µ R 1 = 1.43 in the long run.Both algorithms start exploring channels with different goals, i.e.UCB1 computing its average reward only with the information on the availability and QoS-UCB taking into account the channel quality information.UCB1 is converging to

Figure 7 .
Figure 7. QoS-UCB and UCB1: percentage of optimal channel selection for scenario C 1

Figure 8 .
Figure 8. QoS-UCB and UCB1: percentage of optimal channel selection for scenario C 2

Fig. 11 Figure 9 .
Fig.11presents optimal channel selection percentage for DQoS-UCB and QoS-UCB policies.It is clear from Fig.11that DQoS-UCB policy finds an optimal channel quickly, and concentrates on it in case of an abrupt change in the reward distribution.Whereas, QoS-UCB takes significantly more time to converge to an optimal channel after an abrupt change, and more

Figure 11 .
Figure 11.QoS-UCB and DQoS-UCB: Optimal channel selection percentage and successful transmission percentage (STP) comparison for scenario C 3

Table 1 .
Channel Parameters for two scenarios C 1 and C 2

Table 2 .
State transition probabilities P i in non-stationary environment: Scenario C 3

Table 3 .
Observed channel reward and estimated mean reward in non-stationary environment: Scenario C 3