Decentralized and Dynamic Band Selection in Uplink Enhanced Licensed-Assisted Access: Deep Reinforcement Learning Approach

Enhanced licensed-assisted access (eLAA) is an operational mode that allows the use of unlicensed band to support long-term evolution (LTE) service via carrier aggregation technology. The extension of additional bandwidth is beneficial to meet the demands of the growing mobile traffic. In the uplink eLAA, which is prone to unexpected interference from WiFi access points, resource scheduling by the base station, and then performing a listen before talk (LBT) mechanism by the users can seriously affect the resource utilization. In this paper, we present a decentralized deep reinforcement learning (DRL)-based approach in which each user independently learns dynamic band selection strategy that maximizes its own rate. Through extensive simulations, we show that the proposed DRL-based band selection scheme improves resource utilization while supporting certain minimum quality of service (QoS).


Introduction
The rapid mobile traffic demand has resulted in the scarcity of the available radio spectrum. To meet this ever-increasing demand, extending systems like long-term evolution (LTE) to unlicensed spectrum is one of the promising approaches to boost users' quality of service by providing higher data rates [1]. In this regard, initiatives such as the licensed-assisted access (LAA) [2], LTE-unlicensed (LTE-U) [3], and Multe-Fire (MF) systems [4] can be mentioned. The focus of this article is, however, on the LAA system, which 3GPP has initially introduced and standardized in Rel.-13 for downlink operations only [2]. By using the carrier aggregation (CA) technology, carriers on licensed band are primarily used to carry control signals and critical data, while the additional secondary carriers from unlicensed band are used to opportunistically boost the data rates of the users [5]. To obey regional spectrum regulations such as restrictions on the maximum transmitting power and channel occupancy time [6] while fairly coexisting with the existing systems such as WiFi, it is mandatory for an LAA base station (BS) to perform listen before talk mechanism before transmitting over unlicensed band [7][8][9]. The enhanced version of LAA, named as enhanced licensed-assisted access (eLAA), that supports both uplink and downlink operations was later approved in Rel.14 [10]. The uplink eLAA mode over unlicensed band is designed to meet the channel access mechanisms of the two bands, meaning the BS performs LBT and allocates uplink resources for the scheduled users, and then the scheduled users perform the second round of LBT to check whether the channel is clear or not before uplink transmission [11]. The degradation of uplink channel access due to two rounds of LBT mechanism is investigated in [12][13][14]. If a scheduled user senses an active WiFi access point (AP) which is hidden to the BS, then the channel cannot be accessed, wasting the reserved uplink resource. Scheduling based approach in uplink eLAA, while there are unexpected interference sources, can significantly affect the utilization of uplink resources.
To improve the utilization of unlicensed band resources, several approaches have been suggested. In [15][16][17], multisubframe scheduling (MSS), a simple modification of the conventional scheduling, is proposed. MSS enables a single uplink grant to indicate multiple resource allocation across multiple subframes. Providing diverse transmission opportunities may enhance the resource utilization; however, the resources can still be wasted if the user fails to access the channels. In [14,18], schemes that switch between random access and scheduling are proposed, but their focus is limited to unlicensed spectrum. Joint licensed and unlicensed band resource allocation that takes a hidden node into account is proposed in [19] for the downlink eLAA system. Furthermore in [20], a scheme that does not require uplink grant along with the required enhancement to the existing LTE system is proposed.
In this paper, we attempt a new learning approach in which each user makes dynamic band selection (licensed or unlicensed) independently for uplink transmission, without waiting for scheduling from BS. To this end, we implemented each user as a DRL agent that learns the optimal band selection strategy relying only on its own local observation, i.e., without any prior knowledge of WiFi APs' activities and time-varying channel conditions. Through continuous interactions with the environment, the potential users to be affected by hidden nodes learn the activities of WiFi APs and make use of it in the band selection process. The learned policy not only guarantees channel access but also ensures a transmission rate above a certain threshold, despite the presence of unpredictable hidden nodes. Such a learning approach would be a useful means of handling the underlying resource utilization problems in uplink eLAA.
The rest of the paper is organized as follows. Section 2 describes the system model considered in the paper. Section 3 gives a brief overview on deep reinforcement learning (DRL), followed by DRL formulation of the band selection problem. The proposed deep neural network architecture and training algorithm are also discussed. Simulation results are presented in Section 4, and finally conclusion is drawn in Section 5.

System Model
We consider a single cell uplink eLAA system that consists of an eLAA base station (BS) and N user equipment (UE) that can also operate in unlicensed band through carrier aggregation technology. Let N = f1, 2, ⋯, Ng denote a set of user indices which are uniformly distributed within the cell and ℳ = f1, 2, ⋯, Mg designate a set of unlicensed band interference sources such as WiFi access points (APs) which are located outside the coverage area of the cell within a certain distance. The system model is shown in Figure 1.
In order to get uplink access, each UE n ∈ N makes a scheduling request to the eLAA BS, who is responsible for allocating resources. Before granting uplink resources, the eLAA BS is required to undergo a carrier-sensing procedure within its coverage limit. Once the channel is clear, it reserves resources for uplink transmission. Then, the scheduled user performs another round of listen before talk procedure before transmission. If the user detects transmission from hidden nodes, nearby WiFi APs that are outside the carriersensing range of the eLAA BS, then the reserved uplink resources over unlicensed band cannot be accessed.
We assume the channel between the BS and the n-th UE, denoted as h n ðtÞ, evolves according to the Gaussian Markov block fading autoregressive model [21] as follows: where ρ n is the normalized channel correlation coefficient between slot t and ðt − 1Þ. From Jake's fading spectrum, Let W U and W L be the total bandwidth in unlicensed and licensed bands, respectively. At time slot t, let the number of users associated with unlicensed and licensed band be N U ðtÞ and N L ðtÞ, respectively. If all UEs on licensed band are uniformly allocated to orthogonal uplink resources, then the bandwidth of the UEs is constrained as Similarly, expecting that the total unlicensed bandwidth is equally shared among UEs in a virtual sense, then the bandwidth of UEs on unlicensed band can be constrained as Denoting P and N 0 as uplink transmit power and the noise spectral density, we may compute the signal-to-noise ratio (SNR) of the received signal at the BS for unlicensed band user n (assuming it occupies channel) as WiFi AP-m

WiFi hidden node
WiFi AP-1 WiFi AP-2 UE-n eLAA BS UE-1 U n l i c e n s e d b a n d g r a n t L i c e n s e d b a n d g r a n t Figure 1: Uplink eLAA system model.

Wireless Communications and Mobile Computing
Likewise, the SNR for licensed band user n is given as The dynamics of each WiFi APs activity are modeled as a discrete-time two-state Markov chain as shown in Figure 2. Each AP can be either in active (state = 0) or inactive (state = 1) state. The transition probability from state j to k is denoted as Note that the users do not have the knowledge of the underlying dynamics of WiFi APs' activities, i.e., transition probabilities.
Let τ represent the transmission probability of an active WiFi AP. In slot t, let N n,cont ðtÞ be the number of contending active APs within the sensing range of n-th UE. Assuming that all activities of WiFi AP's are independent, the probability of UE n having at least one hidden node is In order to calculate the uplink rate (throughput) of the users, we refer to the lookup table, given in Table 1, which maps the received SNR to spectral efficiency (SE) [22]. Then, the uplink rate of UE n using unlicensed band is given as Similarly, the uplink rate of UE n using licensed band is given as In each time slot t, the goal of each UE is to select the band that maximizes the uplink rate. Note that if a certain band, e.g., licensed band, is overloaded by a large number of UEs, the individual rate of the users in the band will be significantly reduced. This will constraint each UE to take advantage of the unlicensed band whenever the APs are inactive. Hence, learning the WiFi APs' activities and channel conditions is critical to effectively use the uplink resources while boosting individual data rate.

DRL-Based Decentralized Dynamic
Band Selection 3.1. Deep Reinforcement Learning (DRL): Overview. In reinforcement learning (RL), an agent learns how to behave by sequentially interacting with the environment. As shown in Figure 3, at each time t, the agent observes the state s t ∈ S, where S is the state space, and executes action a t ∈ A from the action space A. The interaction with the environment produces the next state s t+1 and scalar reward r t+1 . The goal of the agent is to learn an optimal policy that maximizes the discounted long-term cumulative reward, expressed as where γ ∈ ½0, 1is the discounting factor and T is the total number of time steps (horizon) [23]. One of the most widely used model-free RL methods is Q-learning in which the agent learns policy by iteratively evaluating the state-action value function Qðs, aÞ, defined Inactive (1) Figure 2: Activity model of WiFi AP as a two-state Markov chain.   Wireless Communications and Mobile Computing as the expected return starting from the state s, taking the action a, and then, following the policy π. In order to derive the optimal policy, at a given state s, the action that maximizes the state-action value function should be selected, i.e., and then similarly follow optimal actions in the successor states.
In Q-learning, a lookup table is constructed that stores the action value Qðs, aÞ for every state-action pair (s, a). The entries of the table are updated by iteratively evaluating the Bellman optimality equation as: where β ∈ ½0, 1 is the learning rate. However, the look up To stabilize the learning process, it is common to use a replay buffer D that stores transitions e = ðs t , a t , r t+1 , s t+1 Þ and mini batch of samples are randomly drawn from the buffer to train the network. Moreover, a separate quasi-static target network, parametrized by θ′, is used to estimate the target value of the next state. The loss function is computed as θ is updated by following stochastic gradient of the loss as θ ⟵ θ − β∇θLðθÞ, while the target parameter θ ′ is updated according to θ ′ ⟵ θ every C steps [24]. The details of DQN algorithm is summarized in Algorithm 1.

DRL Formulation for Dynamic Band Selection.
Each user is implemented as DRL, specifically by deep Q-network (DQN) agent that relies on the output of their deep neural network to make dynamic band selection decisions between licensed and unlicensed bands. The DRL formulation is presented below.

(i) Action
In each time slot t, the n-th agent samples an action a n ðtÞ from the action set (ii) State After executing the action a n ðtÞ, the agent receives binary observation and reward from the environment. The observation is either o n ðtÞ = 1 if the uplink rate in the selected band exceeds the minimum threshold rate or o n ðtÞ = 0 otherwise. The state of the agent is defined as history of an actionobservation pairs with length H: Initialize replay buffer D Initialize action value function Q with parameter θ Initialize target action value functionQ with parameter θ′ = θ Input the initial state to the DQN for t = 1, 2, :: ⋯ do Execute action a t from Q using ε-greedy policy Observe r t+1 and s t+1 from the environment. Store the transition ðs t , a t , r t+1 , s t+1 Þ into the replay buffer D Sample random minibatch of transitions from D Evaluate the target y j = r j + γ max a′Q ðs j+1 , a′ ; θ′Þ Perform a gradient descent step on ðy j − Qðs j , a j ; θÞÞ 2 with respect to θ Every C steps, update the target networkQ according to θ′ ⟵ θ end for Algorithm 1. DQN algorithm. 4 Wireless Communications and Mobile Computing (iii) Reward Depending on the selected action, the agent receives the following scalar reward: where R n,U ðtÞ and R n,L ðtÞ are given according to Equations (8) and (9), while R U,min and R L,min are the uplink minimum threshold rates on unlicensed and licensed band, respectively.

Deep Neural Network Description.
For dynamic band selection, each UE trains independent DQN. The structure of the deep neural network is shown in Figure 4. The deep neural network consists of long short-term memory (LSTM) layer, fully connected layers, and rectified linear unit (ReLu) activation function.
Long short-term memory (LSTM) is one class of recurrent neural networks (RNNs) which are designed to learn a specific pattern in a sequence of data by taking time correlation into account. They were initially introduced to overcome the vanishing (exploding) gradient problem of RNNs in the course of back propagation. Regulated by gate functions, the cell (internal memory) state of an LSTM learns how to aggregate inputs separated by time, i.e., which experiences to keep or throw away [25]. In our formulation, note that the states of the agents, which are histories of action-observation pairs, have long-term dependency (correlation) emanating from the dynamics of WiFi APs' activities that follow a two-state Markov property, and the time-varying channel conditions according to Gaussian Markov block-fading autoregressive model. LSTM is crucial for the learning process since it can capture the actual state by exploiting the underlying correlation in the history of action-observation pairs. Therefore, the state must pass through this preprocessing step before it is directly fed to the neural network.
A deep neural network consists of multiple fully connected layers, in which each of the layers abstracts certain feature of the input. Let x be the input to the layer, while W and b are the weight matrix and bias vector, respectively.
The output vector of a layer, denoted as y, in a fully connected layer can be described by the following operation: where f is the element-wise excitation (activation) that adds nonlinearity. In our simulations, we input the states to an LSTM layer with hidden units of 64, whose output is fed to two fully connected hidden layers with 128 and 64 neurons. The output layer produces action values Qðs, aÞ for both actions. ReLu activation function is used on all the layers to avoid the vanishing gradient problem [26]. The target network also adopts the same neural network structure.

Training Algorithm
Description. The DQNs of the agents are individually trained according to Algorithm 2. The loss function given by Equation (14) is used to train the DQN. The hyperparameters are summarized in Table 2.
Note that the agents do not have a complete knowledge of the environment, such as the action of other agents, the underlying dynamics of the WiFi APs' activities, and varying channel conditions. Instead, through sequential interaction with the environment, each agent makes decisions on band selection solely based on local feedbacks (reward and observation) from the base station. This significantly reduces the training complexity (cost) at each user. Moreover, since the training can be conducted in an offline manner, the trained weights can be used in deployment phase. Retraining the weights is done infrequently; for example, if the environment significantly changes.

Simulation Setup.
For each realization, we first distribute 10 users uniformly in a square area of 100 m × 100 m. Within 30 m distance from the coverage area of the cell, WiFi APs are distributed in homogeneous Poisson point process (PPP) with rate λ. Figure 5 illustrates the network model of one realization for node deployment of BS, users, and APs.
We set the dynamics of each WiFi AP activity according to the following transition matrix:

Wireless Communications and Mobile Computing
We further assume that the uplink transmission of a user over unlicensed band can be interfered from any active WiFi AP m ∈ ℳ within 30 m range. Table 3 summarizes the values of all simulation parameters used for evaluating the proposed algorithm.

Performance Evaluation.
We compared the policy learned by the DRL agents to two benchmark schemes: random policy and fixed distance policy. In random policy, each user randomly decides which band to select, while in fixed policy, decision is made based on the location of the user. Assuming the BS knows the location of the users at each slot t; hence, the distance from BS, only users within D meters from the base station transmit using unlicensed band resources, since they are less susceptible to interfering WiFi APs. The others transmit using licensed band resources. Since we assumed transmission from a WiFi AP can affect for each agent n ∈ N do Initialize replay buffer D n Initialize action value function Q n with parameter θ n Initialize target action value functionQ n with parameter θ n ′ = θ n Generate initial state s n,1 from the environment simulator end for for t = 1, 2, :: ⋯ do for each agent n ∈ N do Execute action a n,t from Q n using ε-greedy policy Collect reward r n,t+1 and observation o n,t+1 Observe the next state s n,t+1 from the environment simulator Store the transition ðs n,t , a n,t , r n,t+1 , s n,t+1 Þ into D n Sample random minibatch of transitions from D n Evaluate the target y n,j = r n,j + γ max a n, j+1Q n ðs n,j+1 , a n,j+1 ; θ n ′Þ Perform a gradient descent step on ðy n,j − Q n ðs n,j , a n,j ; θ n ÞÞ 2 with respect to θ n Every C steps, update the target networkQ n according to θ n ′ ⟵ θ n end for end for Algorithm 2. DQN training algorithm for dynamic band selection.    Figure 5, in fixed policy users with D = 20m from the BS are assigned to unlicensed band resources. The trained DRL policy of each agents should learn this distance without any prior assumption while selecting band. Furthermore, by learning the activities of the APs, the agents should make a dynamic selection. Figure 6 compares the per user average success rate of the users for different thresholds at history length H = 5, λ = 0:5 × 10 −2 , and R L,min = R U,min = 4Mbps. The dynamic DRL agents entertain around 90% of success rate, outperforming the users of the fixed distance-based policy with all of the thresholds we set. The gain from the fixed distancebased policy is attributed to two factors. The first one is that DRL agents, without any prior assumption, learn the optimal distance d * from the BS to make a decision on band selection. In other words, if user n ∈ N is located outside the optimal distance range ðd n > d * Þ, then it transmits over licensed band to avoid interference from nearby WiFi APs. The second factor is that the agents capture the dynamics of both timevarying channel and WiFi APs' activities while making use of it in dynamically selecting band. It implies that during the absence of transmission from nearby WiFi APs, even if ðd n > d * Þ, user n ∈ N exploits the opportunity of transmitting over unlicensed band; hence, avoids overloading other users on licensed band.
To further investigate the gain coming from dynamic decision on band selection, we evaluate the per user average success rate of the users for different throughput thresholds in Figure 7. As the threshold values (over both bands) increase from 3 to 5, the gap on performance (per user average success rate) also increases. This indicates that capability of the DRL agents is crucial to maintaining appreciable success rate under a stringent requirement on quality of service (QoS).
In Figure 8, per user average throughput obtained by the three policies for history length H = 5, λ = 0:5 × 10 −2 , and R L,min = R U,min = 4Mbps is compared. As depicted, the per user average throughput achieved by DRL agents outperforms the other two schemes. The ability of DRL to adapt to changing environment and learn robust policy enabled the agents to outperform a fixed distance-based policy which falls short when either of the bands is overloaded. In other words, even if there is an opportunity to transmit on unlicensed band, due to inactivity of nearby WiFi APs, cell edge users in fixed   The effect of the number of interfering WiFi APs on the performance of the DRL agents is investigated for history length H = 5, and R L,min = R U,min = 4Mbps in Figure 9. As the number of WiFi APs increases (when λ increases), the gain due to dynamic decision on band selection reduces since the number of contenders for unlicensed band resources increases. However, the agents still retain the gain coming from learning the optimal distance for band selection. The performance of the fixed distance-based policy is unaffected by the number of WiFi APs.
Next, in Figure 10, we compare the effect of history size on the performance of the DRL agents. We observe that shorter history sizes tend to converge relatively faster. The variation of convergence time of the learned policy is however marginal. This implies the convergence time of the learned policy is generally less sensitive to history size. Note that all the results are averaged from three numerical simulations.

Conclusion and Future Works
To improve the underlying resource utilization problem in uplink eLAA, we presented a learning-based fully decentralized dynamic band selection scheme. In particular, employing the deep reinforcement learning algorithm, we have implemented each user as an agent that makes a decision based on the output of the DQN, without waiting for scheduling from BS. It is shown that despite the lack of the knowledge of the underlying dynamics of WiFi APs' activities, the DRL agents successfully learn a robust policy to make a dynamic decision on band selection. Such dynamic and decentralized learning approach can significantly improve the resource utilization problem associated with unlicensed band, due to hidden nodes, in the uplink eLAA system. In a future study, we want to extend this work to more complicated scenarios that involve joint resource allocation over the two bands. Moreover, to improve the gain presented in this paper, different architectures and hyperparameters should be investigated.

Data Availability
We have not used specific data from other sources for the simulations of the results. The proposed algorithm is implemented in python with TensorFlow library.

Conflicts of Interest
The authors declare that they have no conflicts of interest.