Neural Networks for Energy-Efficient Self Optimization of eNodeB Antenna Tilt in 5G Mobile Network Environments

In this paper, we present an energy-efficient Self Organizing Network (SON) architecture based on a tunable eNodeB (eNB) antenna tilt design for macrocells in a mobile network environment. This is an imperative element of mobility management in high speed and low latency wireless networks. The SON architecture follows a fully distributed approach with optional network information exchange with neighboring cells and core network. Antenna tilt directly affects its radiation pattern thus changes in eNB antenna tilt can be used to optimize cell coverage and reduce interference in mobile networks. We apply and compare two reinforcement machine learning techniques for optimizing the eNB antenna tilts, i.e., Deep Q-learning using Artificial Neural Network (ANN) and a simple Stochastic Cellular Learning Automata (SCLA). ANN is well known for its ability to learn from a vast number of inputs, while the stochastic learning technique relies on a simple action based probability vector updated based on system feedback. Neighboring cells for any one cell in the network environment are selected based on their separation distance and antenna orientation. We validate the data call performance of the network for edge users as they directly impact the Quality of Service (QoS) in the mobile environment. Our simulated results show that ANN performs better for edge users as compared to SCLA. The model also satisfies the SON requirement of scalability and agility. This work is a follow-up to our earlier work, where we showed that SCLA performs better than Q-learning in a similar network environment and optimizing strategy due to its low complexity, but within the same Q-learning algorithm more input learning parameters gave better performance.

i.e., LOS links, the NLOS links, and the noise. 119 In our earlier work [19], we showed that a simple RL tech-   given respectively by [24]: where; where; 305 where; The throughput of user e attached to eNB u is given by where; . 326 We can now define the utility function of one SON entity 327 or eNB U e as the sum of EE when it is delivering data to its 328 associated UEs. The objective of our system is to maximize 329 this cell utility function, which is also its effective EE. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. whereby each eNB has an associated agent. We compare 362 two RL schemes, i.e., Q-learning using ANN and SCLA. We We consider an eNB e, that interacts with its environ- as MDP and is a 5-tuple (S, A, P r a (s, s ′ ), Re a (s, s ′ ), γ).

389
Here, P r a (s, Given that the Markov property defines that any future 401 state depends only on the current state regardless of the 402 previous states, we can rewrite Eq.10 as Eq.11, given R s,a 403 as the mean of the immediate reward r t . Therefore, if the cumulative reward is to be maximized, a 405 maximum policy Π * is required which can be found if R s,a 406 and P r a (s, s ′ ) are known. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Q-function used in Q-learning is defined as: Since Q-function depends on discounted cumulative re-418 wards, so it will be maximum when the action selection 419 policy is optimal i.e.
where; 432 r t+1 is the reward observed after performing action a t in 433 state s t .

434
The recursive update from the above Eq. 16 will ultimately 435 achieve the optimal required Q value function i.e. Q(s, a) → 436 Q * (s, a) as done in Algorithm 1. For ease of understanding, 437 we define the following two terms in this paper based on 438 Eq. 16.
The advantages of Q-learning to our system model are; (i)   to the next. A reward r ∈ R is computed after an agent takes 455 action and observes the effect on desired system response.

456
The reward r is a numerical real value that depends on the 457 effect an action has on the utility function given in Eq. 8. We where This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. with ANN is shown in Fig. 6. sum of its inputs (Fig. 7). These weights are updated during    Particularly to HetNet, we have shown that SCLA approach 565 meets the SON requirements of scalability, stability, and 566 agility. SCLA follows a distributed architecture that allows 567 quick adaptability to changes in the environment.

568
SCLA stems from the idea of applying stochastic learning techniques in cellular automata (CAs). CAs are mathematical models for systems consisting of large numbers of simple identical components with local interactions. The simple components act together to produce complex emergent global behavior. By combining machine learning capability like stochastic learning automata to the plain CA, the new model formed is known as SCLA. Stochastic learning automata is a finite state machine and can learn from both stationary and non-stationary environment requiring only environment feedback to achieve better performance [37]. As there are no predetermined relationships between stochastic learning automata actions and the responses, so there is no requirement for a closed-form system model. Further details on CA, stochastic learning automata and SCLA can be found in [36]. Similar to our earlier work [19], SCLA picks out one eNB antenna tilt angle or action a from all possible positions or VOLUME 4, 2016 9 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  for eNB e at time t. Once the tilt angle has been selected we update the probability vector by using the Discrete Pursuit Reward Inaction (DPRI) pursuit algorithm [38]. SCLA learns on the basis of feedback from the environment. A positive or negative reinforcement signal r is based on the earlier reward criteria we have set for the Q-learning algorithm (Refer to Eq. 19). The eNB probability vector is updated according to the following equation: The pseudo-code of the SCLA algorithm is given in Al-

641
Note that not all UEs, that get access to the network will 642 be able to download the complete data file because in 643 some cases calls may get dropped when a UE moves 644 from one cell coverage to the next, i.e. handover case. The detail of simulation parameters is given in Table 2. The This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.