Machine Learning Classifier Approach with Gaussian Process, Ensemble boosted Trees, SVM, and Linear Regression for 5G Signal Coverage Mapping

This article offers a thorough analysis of the machine learning classifiers approaches for the collected Received Signal Strength Indicator (RSSI) samples which can be applied in predicting propagation loss, used for network planning to achieve maximum coverage. We estimated the RMSE of a machine learning classifier on multivariate RSSI data collected from the cluster of 6 Base Transceiver Stations (BTS) across a hilly terrain of Uttarakhand-India. Variable attributes comprise topology, environment, and forest canopy. Four machine learning classifiers have been investigated to identify the classifier with the least RMSE: Gaussian Process, Ensemble Boosted Tree, SVM, and Linear Regression. Gaussian Process showed the lowest RMSE, R-Squared, MSE, and MAE of 1.96, 0.98, 3.8774, and 1.3202 respectively as compared to other classifiers.


I. Introduction
W ireless Networks have developed rapidly during the last 2-3 decades.1G arrived in the late 1980s which worked on analog signals and supported voice calls only. 2G arrived during the 1990s and was used for voice calls and data transmission having a bandwidth of 64kps.In 2000, 3G was launched with a bandwidth of 1Mbps to 2Mbps and supported not only voice calls but also video calls and conferencing.4G came into existence in 2009 with its data transmission speed of 100Mbps -1Gbps.4G network was expanded world-wide and found its industrial applications also. Advance wireless technology is evolving very rapidly with the implementation of the 5G-NR network until 2020 [1]- [3]. Features of a High-speed next-generation 5G-NR network are high density (1 million nodes per Km2), high capacity (10Tbps per Km2), high data rates (Multi Gbps peak rates), low latency (1 ms), high reliability (1 out of 100 million packets lost), low energy (10 + years of battery life), low complexity, and high speed of 1 Gbps. With the establishment of this new technology, it may bring some challenges and difficulties like security, privacy, etc. Millimeter (MM) waves required for 5G propagation have a limitation of their effects on human cells and tissues, getting absorbed during transmission, require a small size antenna and cause unpredictable loss of signal during propagation [4]- [6].
In advanced wireless networks remarkable enhancement in information is observed [7]- [9]. Manual extraction of relevant information from an enormous amount of data is not possible, and if done, it will be prone to inevitable flaws. For capturing such big data companies no longer limit themselves to surveys and questionnaires rather big data capturing devices are deployed which include smart phone's, cameras, online browsing, etc. Machine learning seems to hold a promising solution for the analysis of big data [10]- [13]. Generally, data patterns are learned using information hidden in the big data, and then effective predictions can be proposed depending upon the final analysis. There are many Machine Learning algorithms available out of which appropriate selection of machine learning algorithms can be done using the hit and trial method [14]- [16]. We briefly describe the content in the following sections of the paper. In Section II and III a literature review and an outline of Machine learning algorithms is given respectively. In Section IV we describe measurement setup with data collection methodology. Finally, in Section V and VI, we report experimental results, discussions, conclusions, and future scope which clearly show the usefulness of the machine learning approach in predicting signal coverage.

II. Literature Review
Hajar El Hammouti et al. [17] proposed a signal mapping model based on field measured data which is applied for predicting signal coverage in an outdoor network by utilizing an S-shaped sigmoid function. Effectively modeled neural networks provide a better approximation of coverage mapping. Amir Ghasemi [18] presented a crowd-sourced analysis of the Long-Term Evaluation (LTE) network to build a wireless coverage predictive model of the radio access network (RAN). Janne Riihijarvi et al. [19] explored signal coverage mapping with machine learning algorithms using a data set derived from an extensive drive test and analyzed that Random Forest, Exponential smoothing of time series, and Gaussian Process are machine learning methods that produce better results for signal coverage. It improves the Quality of user experience and concurrently reduces the operational cost. H. Braham et al. [20] analyzed that accurate evaluation of coverage gives better coverage optimization by utilizing the Fixed Rank Kringing (FRK) algorithm which further provides an accurate prediction of signal coverage of the locations where field measurements are not easily accessible. FRK frames a coverage map from geo located measurement by interpolating them spatially. Carlos Oroza et al. [21] estimated the performance of a machine learning based signal loss model for different terrain and vegetation environments. Four major machine learning algorithms were explored with minimum error value: K-Nearest-Neighbor, Adaboost, Random Forest, and Neural Networks. Random Forest outperforms among them with the least error. Machine learning model accomplishes a 37% reduction in average prediction error. Many researchers [22]- [23] executed exhaustive field signal measurement on Received Signal Strength Indicator (RSSI) transmission multiband channel to get the maximum signal coverage and angular power arrival. A hybrid approach adopting sub 6-G and Millimeter (mm) wave bands was found to be promising. In [24]- [25] authors proposed Backtracking Spiral Algorithm (BSA) information detection and recognition of cellular coverage using the big-data method. The distribution potential of the individual cell was diagnosed in a small granular geographical grid. The high efficiency and detection capability of the proposed algorithm validate it over other existing algorithms. Aldebaro Klautau et al. [26] represented 5G scenarios by creating channel realizations using ML applied to the PHY layer. Tadilo Endeshaw et al. [27] studied the deployment of AI by combining machine learning, NLP, and data analytics techniques for increasing the competence of wireless networks. Deussom Djomadji et al. [28] tuned the propagation models using Particle Swarm Optimization. Data is collected using the network navigation tool IX EVDO rev B. Comparison of RMSE has been done for the optimized model and the Okumura Hata model and it was concluded that an optimized model using PSO performs better than Okumura Hata model. Chao-Kai Wen et al. [29] estimated the wireless channel based on Sparse Bayesian Learning techniques.

III. Outdoor Path Loss Prediction
5G Network planning consists of outdoor propagation modeling required to predict the outdoor signal loss. Outdoor channel models consider the effect of diffraction, reflection, scattering, and refraction of EM waves, when traveling in free space between transmitter and receiver [38], [47]- [49]. Different channel models have been designed to consider the interference effect due to terrain, the height of a building, vegetation, rain, terrain, etc. Ideal path loss is predicted using the free-space path loss equation. Path loss models have been classified into canonical, empirical, deterministic, and stochastic propagation models [30]- [32].

A. Free Space Path Loss Model
It is based on the Friis transmission equation and one of its simplest kinds of first-order approximation connectivity models. EM waves travel without any interference in Line of Sight (LOS) in free space. Receive Signal Strength (RSS) decreases as the square of the distance between Tx-Rx increases with a single path after neglecting the ground effect [33]- [35].
The free space path loss model is expressed in equation (1): where, P t = transmitted power (dB) P r = power received by receiver (dB) G r = gain of receiving antenna G t = gain of transmitting antenna λ = signal wave length (m) d = distance between Tx-Rx L = system loss factor (L=1 for FSL) Channel loss (dB) can also be expressed using equation (2). (2)

B. Machine Learning Algorithms
Machine learning algorithms are classified as [36], [39]- [42]: Supervised Learning: This algorithm learns from the specimen data and related results to predict the correct result when given new data similar to specimen data.
Unsupervised Learning: This algorithm learns from the specimen data without any related results where the algorithm itself has to determine underlying data patterns and groups or cluster data based on some kind of similarity in their features.
Reinforcement Learning: This algorithm learns from specimen data that lack labels where the result or outcome is rewarded or penalized. It is like learning by trial & error.
The below described algorithms are supervised learning algorithms.

Linear Regression
Linear regression is the most suitable machine learning algorithm for prediction [18].In linear regression, a set of independent input parameters(x) are considered to determine the output parameter(y) and there exists an association between the input and output parameter which is expressed in the form of the linear equation as: Where ϵ = intercept on y-axis w = slope of the line The linear regression algorithm tries to find the best fit line by minimizing the root mean square error (RMSE) between true and predicted value.

Support Vector Machine (SVM)
Support Vector Machine (SVM) can perform both classification and regression of the data. The data is classified into one of the groups by finding a hyperplane that divides the input instances into two classes. The input vectors located on the hyperplane are the support vectors. In cases where the input data is not linearly separable, suitable kernel functions are applied which map data into higher dimensions where data can be easily classified.

Gaussian Process
Gaussian regression process is a non-parametric Bayesian approach that computes the probability distribution over permissible functions in the data [29]. The posterior probability is deduced from the prior distribution and the data. For a linear function The posterior probability is obtained using Baye's rule, equation (3): To make predictions at some random point x* one needs to calculate the predictive probability where a weighted measure of all the posterior probability distribution is evaluated using equation (4).
Predictive probability is computed from posterior probability so that uncertainty measurements on the predictions can be provided by calculating their mean and variance.

Ensemble Learning
Sometimes a single machine learning algorithm is not able to provide the desired results, the expected result can be obtained by combining available algorithms [43]. The final result can be calculated by voting or averaging the result of individual algorithms. Major ensemble algorithms used are Bagging and Boosting.

a) Ensemble Bagging Tree with Random Forests
In Bagging firstly initial data set is utilized to reproduce a replica of the training set by using the Bootstrap sampling method. The bootstrap sampling method is used to create random samples from the initial data set where the sample size used as a training set is the same as the initial data set [44]. The random sample is generated from initial data by duplicating some sets of data multiple times and some records are not even considered once. The test data set is the initial and random sample sets that are used as the training data set. Secondly, multiple models are built from the training sets when the same algorithm is applied to them. Random Forests is a very good example to represent bagging benefits. In Random Forests, the best feature is selected for classification that converges the algorithm faster to a unique result. When the same data set is used a similar tree structure with associated prediction is obtained. Whereas the random forest after every split, bagging provides a random set of features for classification that probably result in the negligible association among classifications from sub-models.

b) Ensemble Boosting Tree with AdaBoost
AdaBoost implies Adaptive Boosting. Bagging works on 'simple voting' where each model is developed independently to provide an outcome [45]. The final result is obtained after analyzing the majority outcomes of the parallel ensemble. Boosting works on 'weighted voting' where each model provides an outcome that is based on majority selection. The final result is obtained by generating a sequential ensemble where greater weights are designated to the instances of preceding models that are misclassified. In every iteration, a model is built by rectifying the misclassification of the preceding model until no further corrections are required [46].

IV. Experimental Setup and Data Collection
This research has been carried out in Dehradun, Uttarakhand-India which lies in the Himalayan ranges. Its geographical coordinates are within latitude 78.0322° E, longitude 30.3165° N. Uttarakhand is often regarded as a terrain full of the tree canopy and suburban environment with a combination of mountain, forest, residential building, commercial complex(2 to 6 storied), and free space. Since 2018 exhaustive field measurement has been carried out at fringe areas of Uttarakhand to measure the effect of the tree, forest, mountain, snow, and buildings on propagation loss.4 clusters (6 BTS each) were identified, covering a 36 Km 2 area shown in Fig. 1. These BTS are strategically chosen to cover RSSI to the tree canopy and mountain canopy. At each test point, exhaustive measurements were carried out repeatedly to calculate changing average RSSI signal variation due to environmental conditions. Hours of driving and route tracking were carried out around identified BTS to collect RSSI samples using CATIA and TEMS navigation tools as shown in Fig. 3. When the dense forest area started, RSSI samples became inaccessible. A drive test was conducted to measure the data. As shown in Fig.  2 and Fig. 3 drive test tools consist of drive vehicle, laptop, Garmin (GPS) global position system, sockets, test cables, Sony mobile handset equipped with TEMS navigation software, MapInfo or Deskcat, and drive test route [35], [28].  GPS was mounted on a vehicle and a Sony Ericsson mobile handset was used.RSS measurements were recorded at each test point around the selected base stations (BS) covering all roads, forests, mountains, and populated areas.

A. Dataset
42,500 RSSI samples of field measurement dataset are utilized for applying the machine learning approach on signal coverage prediction. Fig. 4 shows the architecture of 3D channel modeling and the complete procedure of real-time data collection.   Table I illustrates features of wireless network which affect network planning and required for optimum signal coverage. The received signal is a combination of signals coming from different directions due to reflection, diffraction, and scattering.
RSSI signal strength is measured 360 degrees around each BTS in 3 sectors alpha, beta, and gamma to analyze the maximum coverage of signal within a cell. The coverage threshold values of the network in 3 sectors are summarizes in Table II.

V. Experimental Results and Discussions
In this section, the performance of Machine learning classifiers is evaluated using data collected from the experimental setup. The validation scheme has been chosen before tuning to estimate the performance of the model on new data. Validation also helps to examine the predictive accuracy of the fitted models and avoids over fitting. 3 types of validation schemes were available: a) Cross-Validation is used for small data sets and uses a full portion of the data set.
b) Hold out Validation is used for large data sets and uses some portion of the data set.
c) No Validation signifies no protection against overfitting.
5 fold cross-validation was used to divide the original data set into 5 disjoint sets as by using 5 fold cross-validation, the predictive accuracy of trained models was well estimated on the entire data set where each fold:

A. Predicted Vs Response Plot
The Predicted Vs Response plot analyzed the performance of classifiers by evaluating the efficiency of the regression model by investigating the prediction for varying response values. The predicted response of models was laid against the true response. An efficient regression model had a predicted response nearly identical to the true response, therefore response values lay close to the diagonal line. The perpendicular separation between the diagonal line to each point was the deviation of the prediction for the point under consideration. An efficient classifier has minimum errors and points distributed roughly identical about the diagonal line.

B. Comparison of Residual Plot Outlier
The residuals plot from Fig.9 to Fig. 12 displayed the deviation between the predicted and true responses. The predicted response variable was chosen among true response, predicted response, record number, or one of the predictors to plot on the x-axis. The efficient model had residuals distributed roughly symmetrically around 0. Fig.10 showed that the residual plot of Gaussian Process scattered roughly symmetrically around 0 and also clear patterns in the residuals are observed.

C. Comparison of Response Plot Outlier
A regression model result was viewed in the response plot that displayed the prediction response against the record number. Predication error was displayed as vertical lines between predicted and new responses. Gaussian Process model performance was also evaluated using the residual plot after the model was trained. The difference between true and predicted response was displayed by a residual plot in Fig. 13. A variable predicted response had been plotted on the x-axis. For a good model, residuals had been scattered approximated identical around 0 and it changed considerably in size from left to right.

VI. Conclusion and Observation
The performance evaluation of machine learning algorithms on training data is tabulated in Table III.
We observed that the lowest value of RMSE was obtained from the Gaussian Process classifier, depicting the probability to correctly predict the propagation loss, while the highest value of RMSE (4.692%) was observed with the Ensemble boosted tree. However, SVM and linear regression classifiers hold intermediate values of RMSE 3.9823%, and 4.2946%, respectively. The lowest value of RMSE (1.9691%) was estimated by the Gaussian Process classifier. MSE and MAE for Gaussian Process are also a minimum of 3.8774 and 1.3202. Response plot Outlier curves for the proposed Gaussian Process classifier and other state-of-the-art algorithms are shown in Fig. 13 where residuals have been scattered approximately identical around 0. It is analyzed that the Empirical signal coverage models which are univariate cannot predict signal coverage by using only one network parameter for coverage prediction, however machine learning-based signal coverage prediction model is multivariate and it could be designed on field RSSI measurement by considering two or more network parameters, hence predict signal coverage more accurately. Signal coverage prediction using the machine learning model requires training of best-fit machine learning classifier by hit and trial method and shortlisting machine learning classifiers with minimum RMSE error on RSSI field dataset. To validate it practically, the classifier-based signal mapping approach was applied to a real-time wireless network at the fringe area of Uttarakhand-India. However, the results of this application could encourage practitioners and researchers to validate further the practicality of the approach for similar real fringe area wireless networks.